The use of Electronic Health Records (EHR)/Electronic Medical Records (EMR) data is becoming more prevalent for research. However, analysis of this type of data has many unique complications due to how they are collected, processed and types of questions that can be answered. This book covers many important topics related to using EHR/EMR data for research including data extraction, cleaning, processing, analysis, inference, and predictions based on many years of practical experience of the authors. The book carefully evaluates and compares the standard statistical models and approaches with those of machine learning and deep learning methods and reports the unbiased comparison results for these methods in predicting clinical outcomes based on the EHR data.

Key Features:

Written based on hands-on experience of contributors from multidisciplinary EHR research projects, which include methods and approaches from statistics, computing, informatics, data science and clinical/epidemiological domains.

Documents the detailed experience on EHR data extraction, cleaning and preparation

Provides a broad view of statistical approaches and machine learning prediction models to deal with the challenges and limitations of EHR data.

Considers the complete cycle of EHR data analysis.

The use of EHR/EMR analysis requires close collaborations between statisticians, informaticians, data scientists and clinical/epidemiological investigators. This book reflects that multidisciplinary perspective.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Subtopic

1 Introduction: Use of EHR Data for Scientific Discoveries—Challenges and Opportunities

Hulin Wu

1.1 Real-World Data and Real-World Evidence: Big Data in Practice

In this new era of advanced digital technologies, a huge amount of data is rapidly generated from all walks of life. In particular, the data generated from biomedical research fields and patient care practice are extremely valuable in improving public health and quality of life. Big Data is often a product of real-world practices and daily business operations and not specifically collected for research purposes. In order to use (or reuse) these real-world data to extract meaningful insights for scientific discoveries, there are many technical, analytical, and interpretation challenges and issues that need to be addressed. At the same time, great opportunities exist.

The real-world data such as the data from electronic medical record (EMR) or electronic health record (EHR) systems could provide real-world evidence for important clinical and scientific questions. These data can complement data from well-designed experiments and randomized controlled trials (RCTs), which are considered as the gold standard in clinical research and evidence-based medicine (Van Poucke et al. 2016, Frieden 2017). Usually, the environment and conditions for the designed experiments and RCTs are different from the real world and the inclusion/exclusion criteria of many RCTs are very restrictive since the designed experiments and RCTs are trying to control the noise levels and balance potential confounding factors between study groups. Usually, the confounding factors, including unknown or unmeasured confounding factors, are randomly balanced between treatment and control groups, thus, the conclusions from the well-designed experiments and RCTs are statistically and scientifically valid. However, the well-designed experiments and RCTs also have some limitations. First, the generalizability of the designed experiments and RCTs may be limited due to the following reasons: 1) the inclusion/exclusion criteria may be too restrictive; 2) the environment of designed experiments and RCTs may be different from the real-world situation; 3) subjects who consent to participate in a clinical trial may have significant differences in characteristics from those in a general population; and 4) the sample size is limited in a specific population for the convenience of subject recruitment, study implementation, and ethical considerations. In addition, the conclusion from RCTs is usually based on the population mean effect, and the individual effect of treatments or interventions are ignored. Moreover, the RCTs are often expensive with a high cost in labor and time and sometimes the RCTs are infeasible due to ethical issues and other reasons. That is why currently RCTs only provide support to 10–20% of clinical decisions (Mills, Thorlund, and Ioannidis 2013, Tricoci et al. 2009, McGinnis et al. 2013). It is necessary to use the data from observational studies and real-world data to provide complementary evidence for clinical decisions. Some literature suggests that the results from the RCTs and nonrandomized observational studies have strong agreements (Anglemyer, Horvath, and Bero 2014, Ioannidis et al. 2001). In addition, we also expect that real-world data could be used to better design and inform RCTs.

The use of large hospital discharge records, disease registry databases, Intensive Care Unit (ICU) databases and insurance claim databases have recently become popular for clinical and epidemiological research. However, the complete and large EMR or EHR databases from real-world clinical practice, including both in-patient and out-patient data from many hospitals and clinics, are not easily accessible to researchers. This type of EHR data could provide a more complete profile for patients’ health records. Harnessing big EHR databases allows researchers to investigate rare diseases, low-frequency conditions and long-term adverse events, and to address many important questions from personalized treatment and precision medicine perspectives.

Real-world data often have the features of “Big Data” (Laney 2001), including big volume, big variety and big velocity (“3V” features of Big Data). At the same time, the real-world data are often noisy with a big variability and we also need to evaluate their reliability (veracity) in order to ensure the reproducibility of analysis results. We expect that the real-world data could produce a big value if appropriate analyses are performed, and bias and confounding factors are carefully handled.

In this book, we focus on EHR data, which are real-world data coming from patient care systems, including in-patient hospitals and out-patient clinics. The EMR or EHR systems are originally designed for clinical practice, documentation, and billing purposes. Thus, the EMR or EHR data were not collected for research or analysis purposes. If we try to reuse the EMR or EHR data to address clinical or scientific questions, we need to overcome many barriers and deal with a lot of problems in the real-world data. In this book, we share our experience in addressing these issues for scientific discoveries. This will include data extraction, cleaning, pre-processing, preparation, analysis, and modeling.

This book was based on the authors’ experience mostly from the Cerner Health Facts database (Cerner 2020), a de-identified and longitudinal electronic health record (EHR) database to facilitate research. Our current version of the Cerner EHR database covers all of the healthcare records for 85 systems with 750 hospitals and healthcare facilities in the United States from 2000 to 2018 (19 years’ records). The main patient-level data in Cerner include longitudinal encounters with detailed records of diagnoses, medications, clinical events, procedures and lab procedures. It represents a total of 69 million unique patients across the United States. Of the 69 million patients, 52% are female and 42% are male (6% are gender-unidentified). The racial makeup of the 69 million patients is 52.8% Caucasian, 11.8% African American, 3.0% Hispanic, 1.8% Asian, 0.7% Native American, 6.1% other races (including Asian/Pacific Islander, Biracial, Mid Eastern Indian, Pacific Islander), and 23.7% racial status unknown. In total, the database includes 487 million unique encounters with 939 million diagnoses, coded in International Classification of Diseases (ICD-9 and ICD-10) codes. The database has 674 million medication records, 118 million procedure records, 5.3 billion clinical event records and 4.2 billion lab procedure records.

By 2016, in collaboration with biomedical informaticians, we began to work on an earlier version of the Cerner EHR database. We started an EHR project with predictions of heart failure using the Recurrent Neural Network (RNN) machine learning model (Rasmy et al. 2018). Another collaboration project was to establish a disease-disease network based on the disease comorbidities of all patients in the Cerner EHR database. In Fall 2017, we formally established an EHR Working Group with the support of the newly established Center for Big Data in Health Sciences in the Department of Biostatistics and Data Science, School of Public Health, University of Texas Health Science Center at Houston (UTHealth). This working group consists of statisticians, biomedical informaticians, data scientists, data managers, computer programmers, epidemiologists, clinicians and other investigators from the School of Public Health, School of Biomedical Informatics, McGovern Medical School and Cizik School of Nursing in UTHealth. This multidisciplinary team, including faculty, research associates, students and research staff, started to work together to harness the Big EHR database for scientific discoveries. The EHR Working Group initiated the first project on evaluation of vasopressor treatments for subarachnoid hemorrhage (SAH) patients, which was proposed by our clinical collaborator, Dr. George Williams (MD and FCCP, Associate Professor, Director of Critical Care, Department of Anesthesiology, McGovern Medical School, UTHealth) in November 2017. In Summer 2018, four PhD students were recruited to start the Cerner EHR data cleaning for the SAH project. This SAH project has resulted in publications in both biomedical journals and statistical journals (Williams et al. 2020, Yu et al. 2020, Brown et al. 2020). In April 2019, we acquired a new version of the Cerner EHR database with the data records from 2000 to 2018, and several new EHR projects, including projects of HIV/AIDS, hospital-acquired infections and antibiotics drug-resistance, and diabetes, were initiated based on this new version of the Cerner EHR database. This book is mainly based on our accumulated experience from these completed and ongoing projects over the past years.

1.2 Use of EMR/EHR Database for Research and Scientific Discoveries: Procedure and Life Cycle

Electronic medical records (EMRs) are a digital version of the paper charts of patients in the hospital, clinic or clinician’s office. An EMR system usually contains the medical and treatment history of patients to help with the clinician’s decision on diagnosis and treatment for patient care. EMRs allow clinicians to better track patient’s data over time, identify and remind patients for preventive checkups and disease screenings, monitor patients and improve healthcare quality. Electronic health records (EHRs) are much broader than EMRs and contain all relevant health data of patients in addition to EMRs, which may include the data from laboratories, specialists, nursing homes and other healthcare providers. EHR systems are also designed to share the patient’s data with all authorized clinicians, caregivers, stakeholders, and even the patients themselves. Thus, a fully functional EHR system enables all authorized healthcare providers to access the latest information of patients anywhere and at any time so that more coordinated and patient-centered care can be provided timely to the patients. At the same time, the EHRs also serve as documentation for administration and billing purposes. Recently, EHR data became one of the major sources for real-world evidence to evaluate treatments, improve diagnosis and healthcare quality, reduce side effects and adverse events of drugs, predict disease risks and treatment outcomes, optimize and personalize patient care (MIT 2016).

Since EHR data are very complex and noisy, analysis and interpretation require sophisticated statistical methods and data science techniques as well as multidisciplinary collaborations between data scientists and domain experts. In addition, a novel data-driven research paradigm and state-of-the-art approaches from a systematic perspective are necessary in order to harness a big EHR database and translate it into clinical knowledge for best practice. Based on our experience and from a systematic perspective, we summarize the procedure and the life cycle to use the EHR database for research and scientific discoveries in the following steps:

Initiate a project: proposing a research topic with some potential high-impact biomedical/clinical questions or hypotheses
Data queries and data extraction
Data cleaning
Data pre-processing or processing
Data preparation
Data analysis, modeling and prediction
Result validation
Result interpretation
Publication and dissemination

This procedure is quite similar to the data mining procedure for knowledge discoveries in databases (KDD) (McLachlan 2017, Fayyad, Piatetsky-Shapiro, and Smyth 1996, Fernández-Arteaga et al. 2016, Holzinger, Dehmer, and Jurisica 2014, Mitra, Pal, and Mitra 2002). We will provide the details and explanation for each of these steps in the following sections.

1.2.1 Initiate a Project

To initiate a project, one should start by proposing a research direction or topic, usually with a focus on a particular disease, treatment, medication, or other conditions of interest. Ideally, domain-specific clinicians, epidemiologists, or biomedical scientists in the multidisciplinary team may initiate a project with some potential biomedical or clinical hypotheses or scientific questions, although it may not need to be specific. Since the EHR database usually contains data from a large number of patients and covers many different diseases, treatments and conditions, it is easy to raise many clinical, biomedical, or epidemiological questions. However, it may not be easy to identify a good question.

What is a good question? Based on our experience, a good research question based on the EHR database should satisfy the following criteria:

Clinically or scientifically important and high-impact: If we could answer the question or prove/disprove the hypothesis, the results and conclusions are clinically important with a high impact so that we can publish the results in a high impact journal.
Appropriate to use the available EHR data to address: The EHR data are appropriate or even the best data to address the proposed question or hypothesis. Sometimes the available EHR data may not be good or the best for the question or hypothesis. It is ideal if one can justify that using EHR data is the only way to address the proposed question or hypothesis and there is no other alternative.
Appropriate and reliable endpoint or outcome data are available or can be derived from the EHR database for the proposed question or hypothesis. For any clinical or scientific question and hypothesis, appropriate endpoints or outcomes must be defined and identified, and sometimes good biomarkers can be used. It is necessary to confirm that these endpoint or outcome data are available and reliable in the EHR database. For example, to use mortality as the outcome or endpoint to evaluate a disease treatment, the researcher needs to carefully evaluate whether the EHR system captures the mortality reliably for most of the death cases due to the treatment. However, for chronic disease treatments, this may not be true since the follow-up time is usually not long enough to capture death due to the chronic diseases by the EHR system.
The sample size is big enough: The sample size (the number of subjects, events and/or measurements) is usually quite large in the EHR database. However, for a particular question or hypothesis, we must screen the subjects based on the inclusion/exclusion criteria. For questions or hypotheses related to rare diseases or rare events, the sample size may still be an issue. Thus, it is also crucial to carefully define the study cohort based on the proposed question or hypothesis and develop the appropriate inclusion/exclusion criteria in order to ensure the sample size to be large enough.

The types of observational studies include 1) Case study or case report, a descriptive report on one or a series of special or unique clinical cases, which is a good source to generate hypotheses, instead of exploring any association or cause-effect relationships; 2) Cross-sectional study, the data of exposure or intervention/prevention treatments and outcomes are coll...

Cover
Title
Copyright
Contents
Preface
About the Editors
Contributor
1 Introduction: Use of EHR Data for Scientific Discoveries—Challenges and Opportunities
2 EHR Project Management
3 EHR Databases and Data Management: Data Query and Extraction
4 EHR Data Cleaning
5 EHR Data Pre-Processing and Preparation
6 Missing Data Issues in EHR
7 Causal Inference and Analysis for EHR Data
8 EHR Data Exploration, Analysis and Predictions: Statistical Models and Methods
9 Neural Network and Deep Learning Methods for EHR Data
10 EHR Data Analytics and Predictions: Machine Learning Methods
11 Use of EHR Data for Research: Future
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Statistics and Machine Learning Methods for EHR Data by Hulin Wu, Jose Miguel Yamal, Ashraf Yaseen, Vahed Maroufy, Hulin Wu,Jose Miguel Yamal,Ashraf Yaseen,Vahed Maroufy in PDF and/or ePUB format, as well as other popular books in Negocios y empresa & Sector de servicios. We have over one million books available in our catalogue for you to explore.

About this book