Abstract
Background
Privacy-protecting analytic approaches without centralized pooling of individual-level data, such as distributed regression, are particularly important for vulnerable populations, such as children, but these methods have not yet been tested in multi-center pediatric studies.
Methods
Using the electronic health data from 34 healthcare institutions in the National Patient-Centered Clinical Research Network (PCORnet), we fit 12 multivariable-adjusted linear regression models to assess the associations of antibiotic use <24 months of age with body mass index z-score at 48 to <72 months of age. We ran these models using pooled individual-level data and conventional multivariable-adjusted regression (reference method), as well as using the more privacy-protecting pooled summary-level intermediate statistics and distributed regression technique. We compared the results from these two methods.
Results
Pooled individual-level and distributed linear regression analyses produced virtually identical parameter estimates and standard errors. Across all 12 models, the maximum difference in any of the parameter estimates or standard errors was 4.4833 × 10−10.
Conclusions
We demonstrated empirically the feasibility and validity of distributed linear regression analysis using only summary-level information within a large multi-center study of children. This approach could enable expanded opportunities for multi-center pediatric research, especially when sharing of granular individual-level data is challenging.
Similar content being viewed by others
Introduction
The use of large clinical data sources for research on children can substantially improve pragmatic evaluations of clinical interventions, enable disease surveillance and rare disease research, and expedite assessments of exposure-disease associations.1 The widespread adoption of electronic health records (EHRs) and the development of multi-center clinical data networks have facilitated these types of investigations on diverse populations using real-world data.2 This new era presents unique challenges, especially for pediatric research.3 Privacy protections for children are more stringent than the general population, because of the classification of children as a vulnerable population in the U.S. Department of Health and Human Services regulations for the protection of human subjects in research.4 New methodologies and approaches are needed to properly protect children and their data.
There are several ways to conduct multi-center or multi-database studies. An intuitive and conventional approach is to pool the entire databases or the derived study-specific individual-level datasets for analysis. However, centralized pooling of detailed individual-level datasets, even when stripped of direct patient identifiers, is not always possible. Healthcare systems and patients are often concerned about patient privacy and confidentiality, unauthorized uses of transferred data, or unintended disclosures of sensitive corporate or institutional information, issues compounded with pediatric research.5,6,7,8 Contractual agreements between health plans, delivery systems, and their members or patients may further restrict sharing of individual-level data with other entities for secondary purposes such as research. These challenges can be addressed in part by proper governance, appropriate ethical approval and data use agreements, and applicable updates to laws or regulations that oversee privacy protection in research. However, the considerable amount of time and resources required to obtain layers of formal agreements and approvals may render the project infeasible.
Another promising option is to employ more privacy-protecting analytic methods that require less granular information from participating sites yet provide results equivalent or very similar to those from the conventional pooled individual-level data analysis. In this article, we describe the application of distributed linear regression, a method that allows researchers to use only summary-level information to perform standard multivariable-adjusted linear regression analysis that is traditionally done by pooling individual-level data.9,10 Distributed regression requires only intermediate summary statistics (e.g., sums of squares and cross product matrix) to be shared but produces statistically equivalent results as if the individual-level datasets were pooled.9,10 We have previously demonstrated the use of this analytic method by comparing different bariatric surgery procedures in an adult study conducted within a large distributed research network.11 Here, we describe the use of this analytic method in a pediatric study conducted within the same network.
Methods
Pooled de-identified individual-level data analysis in a multi-center study
In a typical multi-center pediatric study, the analysis center, which can also be a data-contributing site, receives data from all participating sites and performs the statistical analysis using the pooled data. The convention in most multi-center studies is to request de-identified individual-level datasets from the participating sites. In pooled individual-level data analysis, the participating sites send the analysis center an analytic dataset with distinct covariate information from each patient. Each site-specific dataset includes one or more rows (or observations) per patient and one column per covariate (e.g., treatment status, outcome status, confounders). Upon pooling, the combined dataset is essentially a bigger individual-level dataset that allows the analysis center to perform a wide range of statistical analyses. Direct patient identifiers and most protected health information per the U.S. Health Insurance Portability and Accountability Act can often be removed or masked without compromising the validity of the analysis.12
Distributed linear regression in a multi-center study
Distributed regression is another approach that allows for the execution of standard multivariable-adjusted regression analysis in a multi-center study using only summary-level information from each data-contributing site.9,10,11 It performs the same numeric algorithm as standard individual-level regression analysis and, therefore, should theoretically produce the same results. For continuous outcomes, researchers can employ distributed linear regression to generate total sums of squares and cross products (SSCP) matrix for the intercept, the dependent variable (i.e., outcome), and independent variables (i.e., treatment and covariates) at each data-contributing site. Once this summary-level information is provided to the analysis center, it can be used to produce parameter estimates and standard errors (or 95% confidence intervals).9,10,11 Some standard statistical software procedures, including PROC REG in SAS (SAS Institute, Cary, North Carolina), can input or output the SSCP matrix, which can then be used to perform the distributed analysis. In practice, distributed linear regression analysis and the pooled individual-level data analysis follow similar steps but the former requires more data processing (specifically, the creation of SSCP matrix) to occur at the participating sites.
Application of distributed linear regression in a multi-center pediatric study
Setting
The National Patient-Centered Clinical Research Network (PCORnet) is a large distributed data network designed to facilitate multi-center research. During the time of this study, PCORnet included 13 Clinical Research Networks (CRNs), 20 Patient-Powered Research Networks (PPRNs), and 2 Health Plan Research Networks (HPRNs).13 In Fall 2018, the network condensed to nine CRNs, all of which were included in this study. The CRNs are each composed of multiple healthcare institutions, which in total contribute EHR or other healthcare data, including some pharmacy dispensing data, from millions of individuals. The PPRNs and HPRNs also can contribute data for patient-centered research projects. PCORnet uses a common data model that includes data across 15 tables and approximately 100 variables.14 Data elements include patient demographics, diagnoses, procedures, vital signs, prescribed or dispensed medications, laboratory test results, and mortality. The PCORnet Antibiotics and Childhood Growth Study was one of two inaugural observational demonstration projects funded to help develop the PCORnet data infrastructure. The other study was the PCORnet Bariatric Study,15,16 which has previously examined the distributed linear regression technique in an adult cohort.11 For these two studies, we had pooled individual-level data and the capacity to conduct distributed linear regression, allowing for direct comparisons of results from both analytic approaches.
Study cohort
Initiated in 2016, the PCORnet Antibiotics and Childhood Growth Study examined the association of antibiotic use at <24 months of age with body mass index (BMI) z-score and overweight and obesity at age 48 to <72 months. Details of the study are available elsewhere.17,18 Briefly, the study included data from 2009 to 2016 from 35 healthcare institutions that were organized into 28 “network partners” or distinct databases that served as the basis of the distributed analysis described in this article. Children were eligible for inclusion if they had same-day height and weight measures at 0 to <12 months, 12 to <30 months, and 48 to <72 months of age. Requiring multiple longitudinal measures ensured that children were receiving regular care over time, allowing for better capture of antibiotic prescriptions. During the outcome assessment period of age 48 to <72 months, we used the same-day height and weight measures closest to 60 months to calculate age-sex-specific BMI z-scores, using publicly available macros from the Centers for Disease Control and Prevention.19 The final sample size in the main study was 362,550 children. For the methods study described here, we used data from 27 network partners, including 34 of the 35 healthcare institutions; one network partner was unable to participate because it did not have the necessary SAS software to run the linear regression model.
Statistical analysis
As we did in the main PCORnet Antibiotics and Childhood Growth Study,18 we examined the continuous outcome of BMI z-score using the analyses of the pooled de-identified individual-level data as the benchmark. We fit 12 linear regression models to assess the associations of antibiotic use <24 months of age with BMI z-score at 48 to <72 months of age. The 12 models separately analyzed different categories of antibiotic exposure (all, broad-spectrum, narrow-spectrum), two exposure types (binary [yes/no], categorical [0, 1, 2, 3, ≥4 episodes]), and two strata (patients with and without complex chronic conditions). We used the condition list developed by Feudtner20 plus hypothyroidism and pituitary disorders to define complex chronic conditions; these conditions were generally considered serious chronic childhood illnesses.
Because multiple antibiotic prescriptions may be written to treat a single illness, we joined together all prescriptions written within 10 days of another prescription to create an antibiotic episode, and we classified the episode as broad- or narrow-spectrum based on the broadest spectrum antibiotic prescribed. Narrow-spectrum antibiotics included mostly amoxicillin but also penicillin and dicloxacillin; broad-spectrum antibiotics were all others. All models adjusted for age in months within the 48 to <72 month outcome assessment window, sex (male/female), race (Asian, Black or African American, White, Other, Unknown), Hispanic ethnicity (yes/no), network partner (26 binary indicator variables), preterm birth status (yes/no), asthma diagnosis (yes/no), and the number of infection episodes (0, 1, 2, 3, ≥4; treated as a continuous variable for the purpose of the analysis), systemic corticosteroid prescription episodes (0, 1, 2, 3, ≥4; treated as a continuous variable for the purposes of the analysis), and healthcare encounters (log transformed; continuous variable) measured before 24 months of age.
We then fit the same 12 models using the distributed regression approach. The SAS package used to extract the individual-level data from the participating sites (for the benchmark analysis) and summary-level information (for the distributed linear regression analysis), as well as the SAS package used to analyze the pooled data in each approach at the analysis center is freely available at https://github.com/pcornet-analytics/antibiotics. We performed all analyses using SAS version 9.4 (SAS Institute, Cary, North Carolina).
Results
We identified 356,283 patients within 27 network partners (Table 1). The number of patients ranged from 34 to 187,226 across network partners. Figure 1 shows the results from the pooled de-identified individual-level linear regression model that assessed the association of any (vs. no) antibiotic use before 24 months of age with BMI z-score at 48 to <72 months, by network partner, among patients without complex chronic conditions. Table 2 shows the results from the benchmark pooled individual-level models (exposure of any vs. no antibiotics for children without a complex chronic condition) and the corresponding distributed regression models. The results were virtually identical between the two analytic approaches, with a maximum difference in any of the parameter estimates and standard errors being 2.5886 × 10−10. The results from the remaining 11 models were also essentially identical between the two analytic approaches (Table 3). Across all 12 models, the maximum difference in any of the values was 4.4833 × 10−10.
Discussion
Using the association of antibiotic use in early life with weight outcomes in later childhood, we demonstrated the validity and feasibility of conducting distributed linear regression analysis in a real-world multi-center pediatric study. To our knowledge, this is the first study that employed the more privacy-protecting distributed regression technique in multi-center pediatric studies. The validated distributed analytic approach is particularly valuable for pediatric studies, which face greater scrutiny and require more privacy protections. In the main PCORnet Antibiotics and Childhood Growth study, we required institutions to share de-identified individual-level data, in part because the distributed approach had not been used in PCORnet at the time. Two healthcare institutions that originally signed up for the study could not participate because they were unwilling to share individual-level data for the main analysis of the study. Had we used distributed regression, both could have participated. Moving forward, PCORnet, as a large distributed network, could consider using only distributed regression to conduct certain analyses.
Distributed regression can be implemented for other generalized linear methods, including logistic, Poisson, and Cox proportional hazards models.10,21,22,23,24,25,26 These modeling approaches require multiple iterative steps, in contrast to the to the single computation step we demonstrated in this study for linear regression. The extra iterative process includes exchanges of intermediate statistics between the analysis center and the participating sites.27 These steps can be labor-intensive; and the lack of ability to execute them automatically in standard statistical software limits the use of the distributed regression. Researchers have been working to develop statistical packages and stand-alone software to facilitate the use of distributed regression in PCORnet and other networks.21,22,25,26,27 However, there are also some modeling procedures that cannot currently be performed with distributed regression, including multi-level modeling and generalized estimating equations. Some model diagnostics cannot readily be computed using summary-level information without making some compromises. For example, residual plots require data points from individual patients. More methodological development is needed to expand the capability of distributed regression methods.
Distributed regression can be more prone to errors because the analysis center does not have access to the individual-level data from all participating sites for data exploration and data quality assessment. This may lead to biased results due to the impact of unappreciated data characteristics that could not be accounted for in developing the analysis. Because of the reliance on quality of the underlying data, distributed analyses may be best suited for mature networks in which multiple cycles of data characterization and quality assurance have been done. PCORnet is now reaching that stage of maturity. As an alternative, researchers doing multi-center research can pursue a hybrid approach whereby they have access to individual-level data from one or a few institutions as a beta-testing environment, allowing for assessment of data quality and testing of analytic programs. A phased process with an initial round of queries to provide descriptive results for key variables could also help identify potential data issues early in the process, before the analytic queries are done.
Distributed regression may also introduce additional time and burden on data-contributing sites. However, this may not be a major concern within research networks like PCORnet that have standardized their information into a common data format. In these networks, the analysis center can develop an analytic program that processes the data into the correct format (e.g., SSCP matrix). As all sites have their data structured in the same manner, the participating sites can execute the program with minimal modification to the code. In the case of PCORnet distributed queries, sites were asked to execute the queries unaltered except for changing the data library name. As with conventional pooled individual-level data analysis, all statistical code in distributed regression can be shared, allowing for any institution to execute analytic programs on their data in the same manner as the institutions included in the study.
In addition to distributed regression, there are other privacy-protecting analytic methods that can perform sophisticated statistical analysis using only summary-level information in multi-center pediatric studies, including methods that leverage confounder summary scores (e.g., propensity scores) and meta-analysis of site-specific effect estimates.28,29,30,31 Some of the analytic options are available across various methods while others are unique to specific techniques. For example, it is possible to use only summary-level information to perform confounder summary score-matched or -stratified analysis of binary or categorical exposures and binary or time-to-event outcomes with any of these methods; the results will be identical to those obtained from the corresponding pooled individual-level data analysis.28,29,30,31 Meta-analysis of site-specific effect estimates allow researchers to examine the relations between different types of exposures (binary, categorical, and continuous) and outcomes (binary, categorical, continuous, and time-to-event); site-specific confounding adjustment can be achieved via matching, stratification, weighting, or modeling. However, meta-analysis of site-specific effect estimates generally produces results that are similar, but not identical, to those obtained from the corresponding pooled individual-level data analysis.28,29,30,31
In conclusion, privacy-protecting methods, such as distributed linear regression, can perform multivariable-adjusted regression analysis without transferring individual-level data in multi-center pediatric studies. The analytic approach enables researchers to analyze data that are otherwise not accessible due to restrictions to sharing individual-level data, including pediatric data, for which this approach may be particularly well-suited.
References
Cheng, T. L., Bogue, C. W. & Dover, G. J. The next 7 great achievements in pediatric research. Pediatrics 139, e20163803 (2017).
Curtis, L. H., Brown, J. & Platt, R. Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Aff. (Millwood) 33, 1178–1186 (2014).
Currie, J. “Big data” versus “big brother”: on the appropriate use of large-scale data collections in pediatrics. Pediatrics 131(Suppl 2), S127–S132 (2013).
Department of Health and Human Services. The Code of Federal Regulations. Title 45, Subtitle A, Subchapter A, Part 46: Protection of Human Subjects. (https://www.ecfr.gov/cgi-bin/retrieveECFR?gp=&SID=83cd09e1c0f5c6937cd9d7513160fc3f&pitd=20180719&n=pt45.1.46&r=PART&ty=HTML#se45.1.46_1401).
Simon, G. E. et al. Data sharing and embedded research. Ann. Intern. Med. 167, 668–670 (2017).
Brown, J. S. et al. Distributed health data networks: a practical and preferred approach to multi-institutional evaluations of comparative effectiveness, safety, and quality of care. Med. Care 48, S45–S51 (2010).
Toh, S., Platt, R., Steiner, J. F. & Brown, J. S. Comparative-effectiveness research in distributed health data networks. Clin. Pharm. Ther. 90, 883–887 (2011).
Mazor, K. M. et al. Stakeholders’ views on data sharing in multicenter studies. J. Comp. Eff. Res. 6, 537–547 (2017).
Karr, A. F., Lin, X., Sanil, A. P. & Reiter, J. P. Secure regression on distributed databases. J. Comput. Graph. Stat. 14, 263–279 (2005).
Fienberg, S. E., Fulp, W. J., Slavković, A. B. & Wrobel, T. A. “Secure” log-linear and logistic regression analysis of distributed databases. Lect. Notes Comput. Sci. 2006, 277–290 (2006).
Toh, S. et al. Combining distributed regression and propensity scores: a doubly privacy-protecting analytic method for multicenter research. Clin. Epidemiol. 10, 1773–1786 (2018).
Sarpatwari, A., Kesselheim, A. S., Malin, B. A., Gagne, J. J. & Schneeweiss, S. Ensuring patient privacy in data sharing for postapproval research. N. Engl. J. Med. 371, 1644–1649 (2014).
Fleurence, R. L. et al. Launching PCORnet, a national patient-centered clinical research network. J. Am. Med Inf. Assoc. 21, 578–582 (2014).
PCORnet. PCORnet Common Data Model. The People-Centered Research Foundation, 2019. (https://pcornet.org/data-driven-common-model/).
Toh, S. et al. The National Patient-Centered Clinical Research Network (PCORnet) Bariatric Study Cohort: Rationale, Methods, and Baseline Characteristics. JMIR Res. Protoc. 6, e222 (2017).
Arterburn, D. et al. Comparative effectiveness and safety of bariatric procedures for weight loss: a PCORnet Cohort Study. Ann. Intern. Med. 169, 741–750 (2018).
Block, J. P. et al. PCORnet Antibiotics and Childhood Growth Study: Process for cohort creation and cohort description. Acad. Pediatr. 18, 569–576 (2018).
Block, J. P. et al. Early antibiotic exposure and weight outcomes in young children. Pediatrics 2018; 142.
Kuczmarski, R. J. et al. CDC growth charts: United States. Adv. Data 2000,1–27.
Feudtner, C. et al. Deaths attributed to pediatric complex chronic conditions: national trends and implications for supportive care services. Pediatrics 107, E99 (2001).
Wu, Y., Jiang, X., Kim, J. & Ohno-Machado, L. Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. J. Am. Med. Inf. Assoc. 19, 758–764 (2012).
El Emam, K. et al. A secure distributed logistic regression protocol for the detection of rare adverse drug events. J. Am. Med. Inf. Assoc. 20, 453–461 (2012).
Fienberg, S. E., Karr, A. F., Nardi, Y. & Slavkovic, A. Secure logistic regression with multi-party distributed databases. In Proc. of the 56th Session of the ISI, 3506–3513 (The Bulletin of the International Statistical Institute, 2007).
Slavković, A. B., Nardi, Y. & Tibbits, M. M. Secure logistic regression of horizontally and vertically partitioned distributed databases. In Proc. of Workshop on Privacy and Security Aspects of Data Mining. 723–728 (IEEE Computer Society Press, 2007).
Lu, C. L. et al. WebDISCO: a web service for distributed cox model learning without patient-level data sharing. J. Am. Med. Inf. Assoc. 22, 1212–1219 (2015).
Gaye, A. et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int J. Epidemiol. 43, 1929–1944 (2014).
Her, Q. L. et al. A query workflow design to perform automatable distributed regression analysis in large distributed data networks. EGEMS (Wash. DC) 6, 11 (2018).
Toh, S. et al. Confounding adjustment in comparative effectiveness research conducted within distributed research networks. Med. Care 51, S4–S10 (2013).
Toh, S., Shetterly, S., Powers, J. D. & Arterburn, D. Privacy-preserving analytic methods for multisite comparative effectiveness and patient-centered outcomes research. Med. Care 52, 664–668 (2014).
Toh, S. et al. Multivariable confounding adjustment in distributed data networks without sharing of patient-level data. Pharmacoepidemiol. Drug Saf. 22, 1171–1177 (2013).
Li, X. et al. Validity of privacy-protecting analytical methods that use only aggregate-level information to conduct multivariable-adjusted analysis in distributed data networks. Am. J. Epidemiol. 188, 709–723 (2019).
Acknowledgements
This work was supported through the Patient-Centered Outcomes Research Institute (PCORI) Program Award (OBS-1505-30699). All statements in this manuscript are solely those of the authors and do not necessarily represent the views of PCORI, its Board of Governors, or its Methodology Committee. The PCORnet Antibiotics and Childhood Growth Study Team includes a diverse group of investigators, research staff, clinicians, community members, and parent caregivers. All members of the team including the study’s Executive Antibiotic Stakeholder Advisory Group (EASAG) contributed to the study design, data acquisition, and interpretation of results. The Study Team would like to thank the leaders of the participating PCORnet Clinical Data Research Networks (CDRNs) and PCORnet Coordinating Center as well as members of the PCORI team for their support and commitment to this project. The funding organization was not involved in the design of the study; the collection, analysis, and interpretation of the data; or the decision to approve publication of the finished manuscript.
PCORnet Antibiotics and Childhood Growth Study Group:
Brad Appelhans6, David Arterburn7, Janne Boone-Heinenon8, Andrew L. Brickman9, H. Timothy Bunnell10, F. Sessions Cole, III11, Matthew F. Daley12, Amanda Dempsey13, Jonathan Finkelstein14, Stephanie L. Fitzpatrick15, William Heerman16, Michael Horberg17, Carmen R. Isasi18, Melanie Jay19, Elyse Kharbanda20, Ritu Khare21, Dominick Lemas22, Simon M. Lin23, Mary Jo Messito24, Allison O’Neill25, Holly Landrum Peay26, Micah Prochaska27, Daksha Ranade28, Goutham Rao29, Maria Rayas30, Juliane S. Reynolds31, Marc Rosenman32, Bradley Taylor33, Zachary Willis34
Author information
Authors and Affiliations
Consortia
Contributions
S.T., S.L.R.S., L.C.B., C.B.F., C.E.H., D.L., E.M., J.L.S., J.G.Y., J.P.B., and the PCORnet Antibiotics and Childhood Growth Study Group contributed substantially to conception and design, acquisition of data, or analysis and interpretation of data; S.T., J.P.B. and P.I.L. drafted the article or revising it critically for important intellectual content; and S.T., J.P.B., L.C.B. and C.B.F. granted final approval of the version to be published.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Members of the “PCORnet Antibiotics and Childhood Growth Study Group” are listed above the Acknowledgements.
Rights and permissions
About this article
Cite this article
Toh, S., Rifas-Shiman, S.L., Lin, PI.D. et al. Privacy-protecting multivariable-adjusted distributed regression analysis for multi-center pediatric study. Pediatr Res 87, 1086–1092 (2020). https://doi.org/10.1038/s41390-019-0596-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41390-019-0596-0
This article is cited by
-
A Distributed Regression Analysis Application Package Using SAS
Statistics in Biosciences (2024)
-
Privacy-preserving estimation of an optimal individualized treatment rule: a case study in maximizing time to severe depression-related outcomes
Lifetime Data Analysis (2022)
-
Privacy-preserving statistical analyses in Learning Health Systems
Pediatric Research (2020)