To overcome the two-class imbalanced problem existing in the diagnosis of breast cancer, a hybrid of K-means and Boosted C5.0 (K-Boosted C5.0) is proposed which is based on undersampling. K-means is utilized to select the informative samples near the boundary. During the training phase, the K-means algorithm clusters the majority and minority instances and selects a similar number of instances from each cluster. Boosted C5.0 is then used as the classifier. As there is one different instance selection factor via clustering that encourages the diversity of the training subspace in K-Boosted C5.0, it would be a great advantage to get better performance. To test the performance of the new hybrid classifier, it is implemented on 12 small-scale and 2 large-scale datasets, which are the often used datasets in class imbalanced learning. The extensive experimental results show that our proposed hybrid method outperforms most of the competitive algorithms in terms of Matthews' correlation coefficient (MCC) and accuracy indices. It can be a good alternative to the well-known machine learning methods.

译文

:为克服乳腺癌诊断中存在的两类不平衡问题,基于欠采样,提出了K-均值和Boosted C5.0(K-Boosted C5.0)的混合体。利用K均值选择边界附近的信息量样本。在训练阶段,K-means算法将多数和少数实例聚类,并从每个聚类中选择相似数量的实例。然后将Boosted C5.0用作分类器。由于通过聚类存在一个不同的实例选择因子,可以鼓励K-Boosted C5.0中训练子空间的多样性,因此获得更好的性能将是一个巨大的优势。为了测试新混合分类器的性能,它在12个小型和2个大型数据集上实现,这是班级不平衡学习中经常使用的数据集。大量的实验结果表明,在Matthews的相关系数(MCC)和准确性指标方面,我们提出的混合方法优于大多数竞争算法。它可以替代众所周知的机器学习方法。

+1
+2
100研值 100研值 ¥99课程
检索文献一次
下载文献一次

去下载>

成功解锁2个技能,为你点赞

《SCI写作十大必备语法》
解决你的SCI语法难题!

技能熟练度+1

视频课《玩转文献检索》
让你成为检索达人!

恭喜完成新手挑战

手机微信扫一扫,添加好友领取

免费领《Endnote文献管理工具+教程》

微信扫码, 免费领取

手机登录

获取验证码
登录