BACKGROUND:There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure.
METHODS:In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated.
RESULTS:We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach.
CONCLUSIONS:We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies.