Methodological NEAR results: Tackling systematically missing data in multi-center studies

NEAR’s ongoing efforts to tackle statistical challenges in multi-center studies have resulted in new insights on handling systematically missing data—data that is 100% missing in some studies. In collaboration with statistician Nicola Orsini and his PhD student, Robert Thiesmeier, this recent study explores innovative methods to address gaps when data across centers cannot be fully pooled.

Addressing Data Challenges in Multi-Site Studies with Cross-site Imputation
Using multiple study sites for data analysis can present practical challenges (e.g., sharing individual data) and methodological ones (e.g., missing data) that can lead to a loss of information and potentially introduce bias in the results. Multiple imputations typically address missing values, but this method doesn’t work if the data from different study sites can’t be combined into one file. To address this issue, scientists have developed a new method that breaks down the procedure of multiple imputations into four steps that can be applied without pooling individual data across studies.

Photo: Created by Adobe Photoshop Firefly

Using Cross-site Imputation and the new STATA command

The procedure – cross-site imputation – operates in four steps:

  1. Estimate: Using the available data, estimate the statistical distribution of systematically missing data in studies with observed data.
  2. Share Information: Share statistical information (regression coefficients) to the site with missing data and use it for the imputation.
  3. Impute Missing Data: Use the shared information to fill in the missing data based on the available data.
  4. Combine Results: Use a statistical method called Rubin’s rules to combine the results from all the different imputations.

This method was tested across various simulation scenarios and showed promising first results. Based on this research, Nicola and Robert worked on a software command in Stata called “mi impute from” to facilitate the use of this approach. The command enables researchers to fill in missing values without sharing individual data across multi-center sites. So far, options for continuous, categorical, and binary variables are available.

In summary, across-site imputation can be a new way of imputing missing data in multi-center studies, avoiding legal and logistic barriers to sharing individual data. With these methods, NEAR strives toward reliable, data-driven insights across diverse datasets, helping researchers and policymakers make well-informed decisions.


Robert Thiesmeier, first author of the study.
Photo: Stephanie Pitt

Publications

Thiesmeier, R., Bottai, M., & Orsini, N. Systematically missing data in distributed data networks: multiple imputation when data cannot be pooledJournal of Statistical Computation and Simulation. 2024: 1–19. Online ahead of print. https://doi.org/10.1080/00949655.2024.2404220

Thiesmeier, R., Bottai, M., & Orsini, N. Imputing Missing Values with External Data. ArXiV. 2024. https://doi.org/10.48550/arXiv.2410.02982