BACKGROUND:Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS:Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION:Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available.

译文

背景:单核苷酸多态性(SNPs)是人类基因组中最常见的多态性类型。有效的遗传关联研究需要识别可捕获尽可能多单倍型信息的标签SNP集。标签SNP选择类似于信息论中的数据压缩问题。根据Shannon的框架,最佳标签集会受到SNP数量的限制,从而使标签SNP的熵最大化。这种方法需要适当的概率模型。与简单的连锁不平衡(LD)相比,单倍型序列的良好模型可以更准确地说明LD结构。它还提供了一种机制,用于预测标记的SNP,从而通过其预测更大的SNP集的能力来评估标记集的性能。
结果:在这里,我们计算了一系列模型的SNP数据的描述代码长度,并基于这些模型和熵最大化策略开发了标签SNP选择方法。使用HapMap和ENCODE项目中的数据集,我们显示出Li和Stephens引入的隐马尔可夫模型在以下几个方面优于其他模型:SNP数据的描述代码长度,标签集的信息内容以及标签SNP的预测。这是在标签SNP选择的背景下首次使用此模型。
结论:我们的研究提供了有力的证据,表明我们基于Li和Stephens模型的最佳方法选择的标签集优于通过几种现有方法选择的标签集。结果还表明,与正确标记的SNP预测率相比,使用良好模型评估的信息内容对于评估标记集的质量更为敏感。此外,我们表明,单倍型相位不确定性对良好标签集预测标签SNP的能力影响几乎可以忽略。尽管基因分型研究并未直接评估单倍型,但这证明了根据单倍型信息性选择标签SNP的合理性。提供了实现我们方法的软件。

+1
+2
100研值 100研值 ¥99课程
检索文献一次
下载文献一次

去下载>

成功解锁2个技能,为你点赞

《SCI写作十大必备语法》
解决你的SCI语法难题!

技能熟练度+1

视频课《玩转文献检索》
让你成为检索达人!

恭喜完成新手挑战

手机微信扫一扫,添加好友领取

免费领《Endnote文献管理工具+教程》

微信扫码, 免费领取

手机登录

获取验证码
登录