1.1 Real-World Data and Real-World Evidence: Big Data in Practice
In this new era of advanced digital technologies, a huge amount of data is rapidly generated from all walks of life. In particular, the data generated from biomedical research fields and patient care practice are extremely valuable in improving public health and quality of life. Big Data is often a product of real-world practices and daily business operations and not specifically collected for research purposes. In order to use (or reuse) these real-world data to extract meaningful insights for scientific discoveries, there are many technical, analytical, and interpretation challenges and issues that need to be addressed. At the same time, great opportunities exist.
The real-world data such as the data from electronic medical record (EMR) or electronic health record (EHR) systems could provide real-world evidence for important clinical and scientific questions. These data can complement data from well-designed experiments and randomized controlled trials (RCTs), which are considered as the gold standard in clinical research and evidence-based medicine (Van Poucke et al. 2016, Frieden 2017). Usually, the environment and conditions for the designed experiments and RCTs are different from the real world and the inclusion/exclusion criteria of many RCTs are very restrictive since the designed experiments and RCTs are trying to control the noise levels and balance potential confounding factors between study groups. Usually, the confounding factors, including unknown or unmeasured confounding factors, are randomly balanced between treatment and control groups, thus, the conclusions from the well-designed experiments and RCTs are statistically and scientifically valid. However, the well-designed experiments and RCTs also have some limitations. First, the generalizability of the designed experiments and RCTs may be limited due to the following reasons: 1) the inclusion/exclusion criteria may be too restrictive; 2) the environment of designed experiments and RCTs may be different from the real-world situation; 3) subjects who consent to participate in a clinical trial may have significant differences in characteristics from those in a general population; and 4) the sample size is limited in a specific population for the convenience of subject recruitment, study implementation, and ethical considerations. In addition, the conclusion from RCTs is usually based on the population mean effect, and the individual effect of treatments or interventions are ignored. Moreover, the RCTs are often expensive with a high cost in labor and time and sometimes the RCTs are infeasible due to ethical issues and other reasons. That is why currently RCTs only provide support to 10â20% of clinical decisions (Mills, Thorlund, and Ioannidis 2013, Tricoci et al. 2009, McGinnis et al. 2013). It is necessary to use the data from observational studies and real-world data to provide complementary evidence for clinical decisions. Some literature suggests that the results from the RCTs and nonrandomized observational studies have strong agreements (Anglemyer, Horvath, and Bero 2014, Ioannidis et al. 2001). In addition, we also expect that real-world data could be used to better design and inform RCTs.
The use of large hospital discharge records, disease registry databases, Intensive Care Unit (ICU) databases and insurance claim databases have recently become popular for clinical and epidemiological research. However, the complete and large EMR or EHR databases from real-world clinical practice, including both in-patient and out-patient data from many hospitals and clinics, are not easily accessible to researchers. This type of EHR data could provide a more complete profile for patientsâ health records. Harnessing big EHR databases allows researchers to investigate rare diseases, low-frequency conditions and long-term adverse events, and to address many important questions from personalized treatment and precision medicine perspectives.
Real-world data often have the features of âBig Dataâ (Laney 2001), including big volume, big variety and big velocity (â3Vâ features of Big Data). At the same time, the real-world data are often noisy with a big variability and we also need to evaluate their reliability (veracity) in order to ensure the reproducibility of analysis results. We expect that the real-world data could produce a big value if appropriate analyses are performed, and bias and confounding factors are carefully handled.
In this book, we focus on EHR data, which are real-world data coming from patient care systems, including in-patient hospitals and out-patient clinics. The EMR or EHR systems are originally designed for clinical practice, documentation, and billing purposes. Thus, the EMR or EHR data were not collected for research or analysis purposes. If we try to reuse the EMR or EHR data to address clinical or scientific questions, we need to overcome many barriers and deal with a lot of problems in the real-world data. In this book, we share our experience in addressing these issues for scientific discoveries. This will include data extraction, cleaning, pre-processing, preparation, analysis, and modeling.
This book was based on the authorsâ experience mostly from the Cerner Health Facts database (Cerner 2020), a de-identified and longitudinal electronic health record (EHR) database to facilitate research. Our current version of the Cerner EHR database covers all of the healthcare records for 85 systems with 750 hospitals and healthcare facilities in the United States from 2000 to 2018 (19 yearsâ records). The main patient-level data in Cerner include longitudinal encounters with detailed records of diagnoses, medications, clinical events, procedures and lab procedures. It represents a total of 69 million unique patients across the United States. Of the 69 million patients, 52% are female and 42% are male (6% are gender-unidentified). The racial makeup of the 69 million patients is 52.8% Caucasian, 11.8% African American, 3.0% Hispanic, 1.8% Asian, 0.7% Native American, 6.1% other races (including Asian/Pacific Islander, Biracial, Mid Eastern Indian, Pacific Islander), and 23.7% racial status unknown. In total, the database includes 487 million unique encounters with 939 million diagnoses, coded in International Classification of Diseases (ICD-9 and ICD-10) codes. The database has 674 million medication records, 118 million procedure records, 5.3 billion clinical event records and 4.2 billion lab procedure records.
By 2016, in collaboration with biomedical informaticians, we began to work on an earlier version of the Cerner EHR database. We started an EHR project with predictions of heart failure using the Recurrent Neural Network (RNN) machine learning model (Rasmy et al. 2018). Another collaboration project was to establish a disease-disease network based on the disease comorbidities of all patients in the Cerner EHR database. In Fall 2017, we formally established an EHR Working Group with the support of the newly established Center for Big Data in Health Sciences in the Department of Biostatistics and Data Science, School of Public Health, University of Texas Health Science Center at Houston (UTHealth). This working group consists of statisticians, biomedical informaticians, data scientists, data managers, computer programmers, epidemiologists, clinicians and other investigators from the School of Public Health, School of Biomedical Informatics, McGovern Medical School and Cizik School of Nursing in UTHealth. This multidisciplinary team, including faculty, research associates, students and research staff, started to work together to harness the Big EHR database for scientific discoveries. The EHR Working Group initiated the first project on evaluation of vasopressor treatments for subarachnoid hemorrhage (SAH) patients, which was proposed by our clinical collaborator, Dr. George Williams (MD and FCCP, Associate Professor, Director of Critical Care, Department of Anesthesiology, McGovern Medical School, UTHealth) in November 2017. In Summer 2018, four PhD students were recruited to start the Cerner EHR data cleaning for the SAH project. This SAH project has resulted in publications in both biomedical journals and statistical journals (Williams et al. 2020, Yu et al. 2020, Brown et al. 2020). In April 2019, we acquired a new version of the Cerner EHR database with the data records from 2000 to 2018, and several new EHR projects, including projects of HIV/AIDS, hospital-acquired infections and antibiotics drug-resistance, and diabetes, were initiated based on this new version of the Cerner EHR database. This book is mainly based on our accumulated experience from these completed and ongoing projects over the past years.
1.2 Use of EMR/EHR Database for Research and Scientific Discoveries: Procedure and Life Cycle
Electronic medical records (EMRs) are a digital version of the paper charts of patients in the hospital, clinic or clinicianâs office. An EMR system usually contains the medical and treatment history of patients to help with the clinicianâs decision on diagnosis and treatment for patient care. EMRs allow clinicians to better track patientâs data over time, identify and remind patients for preventive checkups and disease screenings, monitor patients and improve healthcare quality. Electronic health records (EHRs) are much broader than EMRs and contain all relevant health data of patients in addition to EMRs, which may include the data from laboratories, specialists, nursing homes and other healthcare providers. EHR systems are also designed to share the patientâs data with all authorized clinicians, caregivers, stakeholders, and even the patients themselves. Thus, a fully functional EHR system enables all authorized healthcare providers to access the latest information of patients anywhere and at any time so that more coordinated and patient-centered care can be provided timely to the patients. At the same time, the EHRs also serve as documentation for administration and billing purposes. Recently, EHR data became one of the major sources for real-world evidence to evaluate treatments, improve diagnosis and healthcare quality, reduce side effects and adverse events of drugs, predict disease risks and treatment outcomes, optimize and personalize patient care (MIT 2016).
Since EHR data are very complex and noisy, analysis and interpretation require sophisticated statistical methods and data science techniques as well as multidisciplinary collaborations between data scientists and domain experts. In addition, a novel data-driven research paradigm and state-of-the-art approaches from a systematic perspective are necessary in order to harness a big EHR database and translate it into clinical knowledge for best practice. Based on our experience and from a systematic perspective, we summarize the procedure and the life cycle to use the EHR database for research and scientific discoveries in the following steps:
-
Initiate a project: proposing a research topic with some potential high-impact biomedical/clinical questions or hypotheses
-
Data queries and data extraction
-
Data cleaning
-
Data pre-processing or processing
-
Data preparation
-
Data analysis, modeling and prediction
-
Result validation
-
Result interpretation
-
Publication and dissemination
This procedure is quite similar to the data mining procedure for knowledge discoveries in databases (KDD) (McLachlan 2017, Fayyad, Piatetsky-Shapiro, and Smyth 1996, FernĂĄndez-Arteaga et al. 2016, Holzinger, Dehmer, and Jurisica 2014, Mitra, Pal, and Mitra 2002). We will provide the details and explanation for each of these steps in the following sections.
1.2.1 Initiate a Project
To initiate a project, one should start by proposing a research direction or topic, usually with a focus on a particular disease, treatment, medication, or other conditions of interest. Ideally, domain-specific clinicians, epidemiologists, or biomedical scientists in the multidisciplinary team may initiate a project with some potential biomedical or clinical hypotheses or scientific questions, although it may not need to be specific. Since the EHR database usually contains data from a large number of patients and covers many different diseases, treatments and conditions, it is easy to raise many clinical, biomedical, or epidemiological questions. However, it may not be easy to identify a good question.
What is a good question? Based on our experience, a good research question based on the EHR database should satisfy the following criteria:
-
Clinically or scientifically important and high-impact: If we could answer the question or prove/disprove the hypothesis, the results and conclusions are clinically important with a high impact so that we can publish the results in a high impact journal.
-
Appropriate to use the available EHR data to address: The EHR data are appropriate or even the best data to address the proposed question or hypothesis. Sometimes the available EHR data may not be good or the best for the question or hypothesis. It is ideal if one can justify that using EHR data is the only way to address the proposed question or hypothesis and there is no other alternative.
-
Appropriate and reliable endpoint or outcome data are available or can be derived from the EHR database for the proposed question or hypothesis. For any clinical or scientific question and hypothesis, appropriate endpoints or outcomes must be defined and identified, and sometimes good biomarkers can be used. It is necessary to confirm that these endpoint or outcome data are available and reliable in the EHR database. For example, to use mortality as the outcome or endpoint to evaluate a disease treatment, the researcher needs to carefully evaluate whether the EHR system captures the mortality reliably for most of the death cases due to the treatment. However, for chronic disease treatments, this may not be true since the follow-up time is usually not long enough to capture death due to the chronic diseases by the EHR system.
-
The sample size is big enough: The sample size (the number of subjects, events and/or measurements) is usually quite large in the EHR database. However, for a particular question or hypothesis, we must screen the subjects based on the inclusion/exclusion criteria. For questions or hypotheses related to rare diseases or rare events, the sample size may still be an issue. Thus, it is also crucial to carefully define the study cohort based on the proposed question or hypothesis and develop the appropriate inclusion/exclusion criteria in order to ensure the sample size to be large enough.
The types of observational studies include 1) Case study or case report, a descriptive report on one or a series of special or unique clinical cases, which is a good source to generate hypotheses, instead of exploring any association or cause-effect relationships; 2) Cross-sectional study, the data of exposure or intervention/prevention treatments and outcomes are coll...