R语言:一致性聚类 ConsensusClusterPlus
01
—
什么情况下会用到一致性聚类?
顺手策个文章套路:
要是一不小心找出了这几组免疫细胞有区别、生存有区别,是不是一篇揭示XX癌免疫应答异质性的文章就来了?
这类文章里有很多鉴别subtype是用到的聚类方法是ConsensusClustering~
这时,一致性聚类由于基于重采样的方法,特性就是结果很稳定。可以克服这个问题。
无监督分析下鉴定簇集数及成员
聚类分析
传统方法的不足
不能提供“客观的”分类数目的标准和分类边界,例如Hierarchical Clustering。
需要预先给定一个分类的数目,且没有统一的标准去比较不同分类数目下分类的结果,例如K-means Clustering。
聚类结果的合理性和可靠性无法验证。
一致聚类(Consensus Clustering)
一致聚类通过基于重采样的方法来验证聚类合理性
一致聚类方法的主要目的是评估聚类的稳定性基本原理假设
从原数据集不同的子类中提取出的样本构成一个新的数据集,并且从同一个子类中有不同的样本被提取出来,那么在新数据集上聚类分析之后的结果,无论是聚类的数目还是类内样本都应该和原数据集相差不大。因此所得到的聚类相对于抽样变异越稳定,我们越可以相信这一样的聚类代表了一个真实的子类结构。重采样的方法可以打乱原始数据集,这样对每一次重采样的样本进行聚类分析然后再综合评估多次聚类分析的结果给出一致性(Consensus)的评估。
02
—
用R实现一致性聚类
1. 关于ConsensusClusterPlus
library(ConsensusClusterPlus)
ls("package:ConsensusClusterPlus")
ConsensusClusterPlus
function for determing cluster number and class membership by stability evidence.calcICL
function for calculating cluster-consensus and item-consensus.
2.操作
使用 ConsensusClusterPlus 的主要三个步骤:
准备输入数据
跑程序
计算聚类一致性 (cluster-consensus) 和样品一致性 (item-consensus)
3. 准备输入数据
首先收集用于聚类分析的数据,比如 mRNA 表达微阵列或免疫组织化学染色强度的实验结果数据。输入数据的格式应为矩阵。下面以 ALL 基因表达数据为例进行操作。
library(ALL)
data(ALL)
#这是个表达为列阵数据示例
dataset <- exprs(ALL)
#取前五行、前五列看看长什么样子
dataset[1:5,1:5]
# 01005 01010 03002 04006 04007
# 1000_at 7.597323 7.479445 7.567593 7.384684 7.905312
# 1001_at 5.046194 4.932537 4.799294 4.922627 4.844565
# 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
# 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
# 1004_at 5.925260 5.912780 5.893209 6.170245 5.615210
取矩阵中 MAD 值(绝对中位差) top 5000 的数据:
在统计学中,绝对中位数MAD是对单变量数值型数据的样本偏差的一种鲁棒性测量。
#取绝对中位差
mads <- apply(dataset, 1, mad)
#按绝对中位差排序,取前5000数据
dataset <- dataset[rev(order(mads))[1:5000],]
dim(dataset)
# [1] 5000 128
4. 运行 ConsensusClusterPlus
先设定几个参数:
pItem (item resampling, proportion of items to sample) : 80%
pFeature (gene resampling, proportion of features to sample) : 80%
maxK (a maximum evalulated k, maximum cluster number to evaluate) : 6 设置最多想尝试的分组数
reps (resamplings, number of subsamples) : 50
clusterAlg (agglomerative heirarchical clustering algorithm) : 'hc' (hclust)
distance : 'pearson' (1 - Pearson correlation)
title <- “YOUR PATH” #所有的图片以及数据都会输出到这里的
results <- ConsensusClusterPlus(dataset, maxK = 6,
reps = 50, pItem = 0.8,
pFeature = 0.8,
clusterAlg = "hc",
distance = "pearson",
title = title,
plot = "png")
## 作者这里是pFeature = 1,和前文不符,于是我依然是按0.8输入计算的
这时工作路径的文件夹会出现9张图。
查看一下结果:
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1.00000 0.9375000 1.0000000 0.90625 1.0000000
# [2,] 0.93750 1.0000000 0.9677419 1.00000 0.9393939
# [3,] 1.00000 0.9677419 1.0000000 0.93750 1.0000000
# [4,] 0.90625 1.0000000 0.9375000 1.00000 0.9062500
# [5,] 1.00000 0.9393939 1.0000000 0.90625 1.0000000
results[[2]][["consensusTree"]]
# Call:
# hclust(d = as.dist(1 - fm), method = finalLinkage)
#
# Cluster method : average
# Number of objects: 128
results[[2]][["consensusClass"]][1:5]
# 01005 01010 03002 04006 04007
# 1 1 1 1 1
4.1 一致性矩阵
分别为图例、k = 2, 3, 4, 5 时的矩阵热图。
这个图叫做CM plots,其目的是展示分类情况,找到最“干净”的一张图(也就是白的方块中尽量不掺杂蓝色),就是分类效果最好的一类。
CM plots
4.2 一致性累积分布函数图
cdf plot
不同聚类数k时的cdf分布。
Empirical cumulative distribution function (CDF) plots display consensus distributions for each k . The purpose of the CDF plot is to find the k at which the distribution reaches an approximate maximum, which indicates a maximum stability and after which divisions are equivalent to random picks rather than true cluster structure.
4.3 Delta Area Plot
一般用elbow method,取拐点处的k值,为最佳分类数。
The delta area score (y-axis) indicates the relative increase in cluster stability.
4.4 Tracking Plot
这个图从行(k)开始看,展示了不同聚类数(k)下,每个sample(列)都被分为了哪一类。比如,k=2时,大部分sample都被分为了淡蓝色那一类,只有中间一小撮被分为深蓝色那一类。
The item tracking plot shows the consensus cluster of items (in columns) at each k (in rows). This allows a user to track an item's cluster assignments across different k, to identify promiscuous items that are suggestive of weak class membership, and to visualize the distribution of cluster sizes across k.
5. 计算聚类一致性 (cluster-consensus) 和样品一致性 (item-consensus)
icl <- calcICL(results, title = title,
plot = "png")
## 返回了具有两个元素的list,然后分别查看一下
dim(icl[["clusterConsensus"]])
# [1] 20 3
icl[["clusterConsensus"]]
# k cluster clusterConsensus
# [1,] 2 1 0.9402982
# [2,] 2 2 0.9062500
# [3,] 3 1 0.8504193
# [4,] 3 2 0.9062500
# [5,] 3 3 0.9869781
# [6,] 4 1 0.9652282
# [7,] 4 2 0.9045058
# [8,] 4 3 0.9062500
# [9,] 4 4 0.9728043
# [10,] 5 1 0.9216686
# [11,] 5 2 0.9145987
# [12,] 5 3 0.9062500
# [13,] 5 4 0.9874950
# [14,] 5 5 NaN
# [15,] 6 1 0.9307379
# [16,] 6 2 0.8897721
# [17,] 6 3 0.7474747
# [18,] 6 4 0.8750000
# [19,] 6 5 0.9885269
# [20,] 6 6 0.6333333
dim(icl[["itemConsensus"]])
# [1] 2560 4
icl[["itemConsensus"]][1:5,]
# k cluster item itemConsensus
# 1 2 1 28032 0.9523526
# 2 2 1 28024 0.9366226
# 3 2 1 03002 0.9686272
# 4 2 1 01005 0.9573623
# 5 2 1 04007 0.9549235
5.1 item-Consensus Plot
IC plot
Item-consensus (IC) is the average consensus value between an item and members of a consensus cluster, so that there are multiple IC values for an item at a k corresponding to the k clusters. IC plots display items as vertical bars of coloured rectangles whose height corresponds to IC values.
(这张图不重要)
5.2 Cluster-Consensus Plot
cluster-consensus plot
(这张图也不重要)
References
ConsensusClusterPlus Tutorial https://bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf
Nowicka, Malgorzata, et al. "CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets." F1000Research 6 (2017).
往期文章:
扫描二维码
获取更多精彩
扫描二维码
获取更多精彩
简书