BACKGROUND:There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure. METHODS:In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated. RESULTS:We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach. CONCLUSIONS:We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies.

译文

背景技术:越来越多的观察性研究不仅关注单个生物标志物来预测结果事件,而且在多变量环境中解决问题。例如,当量化除已建立的风险因素外的新生物标记物的附加值时,目标可能是就其预测性能对几种新标记物进行排名。因此,重要的是要考虑标记物相关结构以进行此类研究。由于其复杂性,可能需要一种仿真方法来充分评估样本大小或其他方面,例如性能指标的选择。
方法:在基于真实数据的模拟研究中,我们调查了如何生成具有实际分布的协变量以及应使用哪种生成模型进行结果计算,目的是确定获得实际结果所需的最少信息量和复杂度。大型流行病学队列研究作为模拟的基础,使用了古腾堡健康研究。对标记的增加值进行量化,并在该总体数据的子采样数据集中进行排名,并根据排名的质量来判断模拟方法。评估方法之一是随机森林,它需要各个级别的原始数据。因此,还研究了基于随机森林的模拟试验研究规模的影响。
结果:我们发现简单的逻辑回归模型无法充分生成现实数据,即使具有交互项或非线性效应等扩展。人们认为,随机森林方法更适合于复杂数据结构的仿真。从大约250个观察值开始的试点研究被认为为该方法提供了合理水平的信息。
结论:我们建议避免过度简化的回归模型进行仿真,尤其是在关注多变量研究问题时。更一般而言,模拟应基于真实数据,以充分反映复杂的观察数据结构,例如流行病学队列研究中发现的结构。

+1
+2
100研值 100研值 ¥99课程
检索文献一次
下载文献一次

去下载>

成功解锁2个技能,为你点赞

《SCI写作十大必备语法》
解决你的SCI语法难题!

技能熟练度+1

视频课《玩转文献检索》
让你成为检索达人!

恭喜完成新手挑战

手机微信扫一扫,添加好友领取

免费领《Endnote文献管理工具+教程》

微信扫码, 免费领取

手机登录

获取验证码
登录