BACKGROUND:Distance based unsupervised clustering of gene expression data is commonly used to identify heterogeneity in biologic samples. However, high noise levels in gene expression data and relatively high correlation between genes are often encountered, so traditional distances such as Euclidean distance may not be effective at discriminating the biological differences between samples. An alternative method to examine disease phenotypes is to use pre-defined biological pathways. These pathways have been shown to be perturbed in different ways in different subjects who have similar clinical features. We hypothesize that differences in the expressions of genes in a given pathway are more predictive of differences in biological differences compared to standard approaches and if integrated into clustering analysis will enhance the robustness and accuracy of the clustering method. To examine this hypothesis, we developed a novel computational method to assess the biological differences between samples using gene expression data by assuming that ontologically defined biological pathways in biologically similar samples have similar behavior. RESULTS:Pre-defined biological pathways were downloaded and genes in each pathway were used to cluster samples using the Gaussian mixture model. The clustering results across different pathways were then summarized to calculate the pathway-based distance score between samples. This method was applied to both simulated and real data sets and compared to the traditional Euclidean distance and another pathway-based clustering method, Pathifier. The results show that the pathway-based distance score performs significantly better than the Euclidean distance, especially when the heterogeneity is low and genes in the same pathways are correlated. Compared to Pathifier, we demonstrated that our approach achieves higher accuracy and robustness for small pathways. When the pathway size is large, by downsampling the pathways into smaller pathways, our approach was able to achieve comparable performance. CONCLUSIONS:We have developed a novel distance score that represents the biological differences between samples using gene expression data and pre-defined biological pathway information. Application of this distance score results in more accurate, robust, and biologically meaningful clustering results in both simulated data and real data when compared to traditional methods. It also has comparable or better performance compared to Pathifier.

译文

背景:基于距离的基因表达数据无监督聚类通常用于鉴定生物样品中的异质性。但是,经常会遇到基因表达数据中的高噪声水平以及基因之间相对较高的相关性,因此传统距离(例如欧几里得距离)可能无法有效区分样品之间的生物学差异。检查疾病表型的另一种方法是使用预定义的生物学途径。在具有相似临床特征的不同受试者中,这些途径已显示出受到不同方式的干扰。我们假设与标准方法相比,给定途径中基因表达的差异更能预测生物学差异的差异,并且如果将其整合到聚类分析中,则将增强聚类方法的鲁棒性和准确性。为了检验这一假设,我们通过假设生物学上相似的样本中的本体论定义的生物学途径具有相似的行为,开发了一种新的计算方法来使用基因表达数据评估样本之间的生物学差异。
结果:下载了预定的生物学途径,并使用高斯混合模型将每个途径中的基因用于对样品进行聚类。然后汇总跨不同途径的聚类结果,以计算样品之间基于途径的距离得分。将该方法应用于模拟和真实数据集,并与传统的欧几里得距离和另一种基于路径的聚类方法Pathifier进行了比较。结果表明,基于路径的距离得分表现明显优于欧几里得距离,尤其是当异质性较低且相同路径中的基因相互关联时。与Pathifier相比,我们证明了我们的方法可为小路径实现更高的准确性和鲁棒性。当路径较大时,通过将路径下采样为较小的路径,我们的方法能够实现可比的性能。
结论:我们已经开发了一种新颖的距离评分,该评分使用基因表达数据和预定义的生物途径信息来代表样品之间的生物学差异。与传统方法相比,此距离得分的应用可在模拟数据和真实数据中产生更准确,可靠且具有生物学意义的聚类结果。与Pathifier相比,它还具有可比或更好的性能。

+1
+2
100研值 100研值 ¥99课程
检索文献一次
下载文献一次

去下载>

成功解锁2个技能,为你点赞

《SCI写作十大必备语法》
解决你的SCI语法难题!

技能熟练度+1

视频课《玩转文献检索》
让你成为检索达人!

恭喜完成新手挑战

手机微信扫一扫,添加好友领取

免费领《Endnote文献管理工具+教程》

微信扫码, 免费领取

手机登录

获取验证码
登录