模拟复杂数据结构以计划研究，重点是生物标记物比较。-小狗文献

【Simulation of complex data structures for planning of studies with focus on biomarker comparison.].

【模拟复杂数据结构以计划研究，重点是生物标记物比较。】 复制标题 收藏收藏

影响因子 :
发表时间：2017-06-13
来源期刊：BMC Med Res Methodol

DOI：10.1186/s12874-017-0364-y 复制DOI
文章类型：杂志文章

作者列表：
下载文献

BACKGROUND:There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure. METHODS:In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated. RESULTS:We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach. CONCLUSIONS:We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies.

译文

背景技术：越来越多的观察性研究不仅专注于单个生物标志物来预测结果事件，而且在多变量环境中解决问题。例如，当量化除已建立的风险因素外的新生物标记物的附加值时，目标可能是就其预测性能对几种新标记物进行排名。因此，重要的是要考虑标记物相关结构以进行此类研究。由于其复杂性，可能需要一种仿真方法来充分评估样本大小或其他方面，例如性能指标的选择。
方法：在基于真实数据的模拟研究中，我们调查了如何生成具有实际分布的协变量，以及应使用哪种生成模型进行结果计算，旨在确定获得实际结果所需的最少信息量和复杂度。大型流行病学队列研究作为模拟的基础，使用了古腾堡健康研究。对标记的增加值进行量化，并在该总体数据的子采样数据集中进行排名，并根据排名的质量来判断模拟方法。评估方法之一是随机森林，它需要各个级别的原始数据。因此，还研究了基于随机森林的模拟试验研究规模的影响。
结果：我们发现简单的逻辑回归模型无法充分生成现实数据，即使具有交互项或非线性效应等扩展。人们认为，随机森林方法更适合于复杂数据结构的仿真。从大约250个观察值开始的试点研究被认为为该方法提供了合理水平的信息。
结论：我们建议避免过度简化的回归模型进行仿真，尤其是在关注多变量研究问题时。更一般而言，模拟应基于真实数据，以充分反映复杂的观察数据结构，例如流行病学队列研究中发现的结构。

【Simulation of complex data structures for planning of studies with focus on biomarker comparison.].

手机登录