BACKGROUND:Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance. RESULTS:Experiments were designed to measure the effect of "sample size" (i.e. size of the datasets), "sense distribution" (i.e. the distribution of the different meanings of the ambiguous word) and "degree of difficulty" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations: BPD, BSA, PCA, and RSV, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated. CONCLUSION:Several different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology.

译文

背景:字词歧义消除(WSD)在生物医学领域对于提高自然语言处理(NLP),文本挖掘和信息检索系统的精度至关重要,因为歧义词会对准确访问包含生物分子实体(例如基因)的文献产生负面影响蛋白质,细胞,疾病和其他重要实体。已经开发出自动技术来解决许多文本处理情况下的WSD问题,但是该问题仍然是一个具有挑战性的问题。有监督的WSD机器学习(ML)方法已应用于生物医学领域,并显示出令人鼓舞的结果,但结果通常包含许多混杂因素,并且由于这些因素相互影响,真正了解这些方法的有效性和可推广性是有问题的彼此影响最终结果。因此,需要明确解决这些因素并系统地量化其对性能的影响。
结果:设计了实验来测量“样本量”(即数据集的大小),“感官分布”(即歧义词的不同含义的分布)和“难易程度”(即歧义词的含义之间的距离)对WSD分类器的性能。支持向量机(SVM)分类器应用于自动生成的数据集,该数据集包含四个歧义生物医学缩写:BPD,BSA,PCA和RSV,这是由于它们各自含义上的差异程度不同而选择的。结果表明:1)增加样本大小通常会降低错误率,但这主要限于良好分离的感官(即,感官之间的距离较大的情况);在困难的情况下,需要极大地增加样本数量以略微提高性能,这是不切实际的; 2)当感官可分离时,感官分布对性能没有影响; 3)当多数感官超过90时%,WSD分类器并不比使用简单多数感官更好; 4)错误率与各种感官的相似性成正比; 5)使用5倍或10倍交叉比对时结果之间没有统计学差异验证方法。还列举了影响性能的其他问题。
结论:将ML技术用于WSD时,有几个不同的独立方面会影响性能。我们发现将它们组合成一个单一的结果会模糊对基本方法的理解。尽管我们仅研究了四个缩写,但我们使用了一种完善的统计方法,该方法可以保证结果对于具有相似特征的缩写很可能是可推广的。我们的实验结果表明,为了了解这些ML方法的性能,至关重要的是,论文要报告基线性能,数据集中感官的分布和样本大小以及标准偏差或置信区间。此外,论文还应描述WSD任务的难度,WSD解决和未解决的情况以及所使用的ML方法和功能。这应该导致人们对通用性和方法的局限性有了更好的了解。

+1
+2
100研值 100研值 ¥99课程
检索文献一次
下载文献一次

去下载>

成功解锁2个技能,为你点赞

《SCI写作十大必备语法》
解决你的SCI语法难题!

技能熟练度+1

视频课《玩转文献检索》
让你成为检索达人!

恭喜完成新手挑战

手机微信扫一扫,添加好友领取

免费领《Endnote文献管理工具+教程》

微信扫码, 免费领取

手机登录

获取验证码
登录