-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathphyloseq_6_Gap Statistic.Rmd
123 lines (86 loc) · 3.57 KB
/
phyloseq_6_Gap Statistic.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: "phyloseq_6_Gap Statistic"
author: "wentao"
date: "2019/8/23"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# 聚类分析
**Gap Statistic**
### 这批数据究竟能聚类成几份呢?
**How many clusters are there?**
我们对数据进行分类,看看到底聚成多少的类别。
Gap统计值来确定k的个数,通过对数据进行bootstrap抽样来比较内差异性。这里使用cluster软件包里面的clusGap函数计算。
以下是phyloseq用于聚类的案例,并且将结果使用包装在phyloseq包中的ggplot展示。
><font size=2>From the clusGap documentation: The clusGap function from the cluster package calculates a goodness of clustering measure, called the “gap” statistic. For each number of clusters k, it compares (W(k)) with E^*[(W(k))] where the latter is defined via bootstrapping, i.e. simulating from a reference distribution.
><font size=2>The following is an example performing the gap statistic on ordination results calculated using phyloseq tools, followed by an example of how a ggplot-based wrapper for this example might be included in the phyloseq package.
载入包,查看版本信息
```{R}
library("phyloseq"); packageVersion("phyloseq")
library("cluster"); packageVersion("cluster")
library("ggplot2"); packageVersion("ggplot2")
```
### 导入绘图主题和数据
```{R}
theme_set(theme_bw())
# Load data
data(enterotype)
```
进行系统聚类时,先计算样本之间的距离, 这里使用jsd距离进行MDS排序
><font size=2>in this case, MDS on the Bray-Curtis distance.
```{R}
# ordination
exord = ordinate(enterotype, method="MDS", distance="jsd")
```
### 聚类计算
Compute Gap Statistic
根据排序得分估计最优聚类
clusGap()函数一般我们用于来计算用于估计最优聚类k个数
```{R}
pam1 = function(x, k){list(cluster = pam(x,k, cluster.only=TRUE))}
x = phyloseq:::scores.pcoa(exord, display="sites")
# gskmn = clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn = clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 50)
gskmn
```
### 可视化K值
```{R}
library(factoextra)
library(cluster)
fviz_gap_stat(gskmn)
```
### 下面构造统计函数,方便运算,构造出图函数。
```{R}
gap_statistic_ordination = function(ord, FUNcluster, type="sites", K.max=6, axes=c(1:2), B=500, verbose=interactive(), ...){
require("cluster")
# If "pam1" was chosen, use this internally defined call to pam
if(FUNcluster == "pam1"){
FUNcluster = function(x,k) list(cluster = pam(x, k, cluster.only=TRUE))
}
# Use the scores function to get the ordination coordinates
x = phyloseq:::scores.pcoa(ord, display=type)
# If axes not explicitly defined (NULL), then use all of them
if(is.null(axes)){axes = 1:ncol(x)}
# Finally, perform, and return, the gap statistic calculation using cluster::clusGap
clusGap(x[, axes], FUN=FUNcluster, K.max=K.max, B=B, verbose=verbose, ...)
}
plot_clusgap = function(clusgap, title="Gap Statistic calculation results"){
require("ggplot2")
gstab = data.frame(clusgap$Tab, k=1:nrow(clusgap$Tab))
p = ggplot(gstab, aes(k, gap)) + geom_line() + geom_point(size=5)
p = p + geom_errorbar(aes(ymax=gap+SE.sim, ymin=gap-SE.sim))
p = p + ggtitle(title)
return(p)
}
```
计算
```{R}
gs = gap_statistic_ordination(exord, "pam1", B=50, verbose=FALSE)
print(gs, method="Tibs2001SEmax")
```
可视化结果
```{R}
plot_clusgap(gs)
```