1.1. Origin of medical data
In medical research, cohort, caseācontrol, and cross-sectional studies are three special types of observational studies [1]. A clinical cohort study comprises data from a group of people who share common disease occurrences and medical conditions (e.g., experience a common type of a chronic disease) and are useful for measuring the disease occurrence and progress [1]. A cohort study design can be either prospective or retrospective. In a prospective study, the cohort data are expected to be updated within the duration of the study, whereas in a retrospective study, the patient data are predefined. In prospective studies, the existence of individual follow-up time points is necessary to keep track of the upcoming data. On the other hand, cross-sectional studies measure the disease occurrence at one particular time point and thus are not able to capture the relationship between the occurrence and the progress of a disease. To understand the meaning of a cohort, it is necessary to understand the fundamental types and sources of medical data.
Laboratory results comprise a widely known source of medical data. Laboratory tests include a large number of biochemical tests [2], such as (i) hematological tests that measure the oxygen levels in the blood flow, urine tests that are usually used to detect kidney, liver disease, and diabetes, (ii) serological tests that are blood tests that seek for antibodies (e.g., to detect rubella, fungal infections), (iii) coagulation tests that are used to detect thrombophilia and hemophilia, (iv) histological tests that are employed to examine different types of tissues (e.g., muscle, nervous, epithelial), etc. Laboratory results combined with valuable information from medical conditions and medications can offer a powerful basis for (i) understanding the progress of a disease, (ii) dividing sensitive populations into subgroups (i.e., patient stratification), and (iii) evaluating existing and/or proposing new treatments, in large-scale population studies. Other common parameters that can often be found in clinical datasets include demographic information (e.g., age, gender, socioeconomic factors), vital parameters (e.g., heart rate, blood pressure), medications (e.g., antibiotics, antiseptics) and medical conditions (e.g., Alzheimer's, Parkinson's), physical and mental conditions, nutrition habits, and environmental and lifestyle factors [3], among others.
Other sources of medical data include medical images that are obtained by a variety of diagnostic imaging modalities or systems, such as, computed tomography, magnetic resonance, optical topography, ultrasound, positron emission tomography, single-photon emission computed tomography, etc. Advances in surface-rendering and volume-rendering methods have led to three-dimensional medical image visualization that has significantly improved the quality of image interpretation. Moreover, the rapidly increasing spatial resolution of such systems combined with the technical advances in medical image processing (e.g., reconstruction, fusion) can significantly enhance the diagnostic accuracy and the consistency of the image interpretation by doctors in a variety of diseases ranging from heart failure, osteoporosis, and diabetes to Alzheimer's disease and cancer [4]. Undoubtedly, computer-aided diagnosis comprises one of the major computer-assisted technologies for medical diagnostics.
Biosignals comprise another domain of medical data including a variety of biomedical signals, such as (i) electroencephalography (EEG) and (ii) electrocorticography, which capture the electrical fields that are produced by the activity of the brain cells, (iii) magnetoencephalography that captures the magnetic fields that are produced by the electrical activity of the brain cells, (iv) electrocardiography that records the electrical activity that arises from the depolarization and repolarization activity of the heart, (v) electromyography that records the electric potential that is generated by the muscle cells, (vi) electrooculography that records the electric potential generated by the cornea and the retinal activity, etc. Biosignals provide high temporal information about a disease's onset and progress and have been employed in a variety of diseases ranging from epilepsy and schizophrenia to heart failure and muscle atrophy [5]. Biomedical signals are usually combined with medical imaging systems (e.g., EEG and MRI) to provide both high spatial and temporal information for more effective diagnosis and treatment. The advances in biomedical signal processing have made signal manipulation much easier.
The field of genetics constitutes a vast domain of medical data. Genetic data can be generated from high-throughput (next-generation) DNA and RNA sequences. The outrageous number of these sequences has created the well-known field of genomics. Genetic data are generally of more complex form than the aforementioned types of medical data because they require the use of multiple processing pipelines with unique input. This complexity arises from the different formats of the genetic data, such as the FASTQ files used for RNA sequence analysis, the haplotypes for haplotype analysis, etc. In the last decade, genetic data generated from genome-wide association studies have led to thousands of robust associations between common single-nucleotide polymorphisms (SNPs) and common diseases ranging from autoimmune diseases to psychiatric disorders, quantitative traits, and genomic traits [6,7].
The recent advances in omics technologies [8,9], such as genomics (the study of genomic information), transcriptomics (the study of all the RNA transcripts of an organism), proteomics (the study of proteins and their interactions), lipidomics (the study of lipids, i.e., biomolecules with structural diversity and complexity), and metabolomics (the study of the multitude of metabolites) have increased the demand for properly annotated and well-preserved biospecimens, which has led to the development of the biobanks [10]. Biobanking involves the (i) collection, (ii) processing, (iii) storage, and (iv) quality control of the biological samples along with their associated clinical information. Biobanks have been widely used for meeting scientific goals in genetic and molecular biology due to their long-term sustainability [10].
1.2. Toward medical data sharing and harmonization
Cohort, caseācontrol, and cross-sectional studies are capable of resolving crucial scientific questions related to predictive modeling of a disease's onset and progress, the clinical significance of genetic variants, the adequate identification of high-risk individuals, and the effective patient selection for clinical trials, among others [1]. However, the fact that these cohorts are dispersed withholds important clinical information and leads to small-scale studies with reduced statistical significance and thus poor clinical importance. In addition, the rapidly increasing gap between healthcare costs and outcomes obscures the evolvement of a sustainable healthcare system that is able to adapt on the technological advances of our era [1]. Traditional clinical epidemiology poses several obstacles on the development of clinical decision support systems, public health policies, as well as medical research in general, as it is conducted by individual researchers who do not share any common research interests, a fact that often leads to indigent manipulation of the available clinical data and hampers health research and innovation.
On the other hand, the current technological advances in biomedical and health research significantly increase the generation of digital data, leading to vast amounts of data in a variety of disciplines, varying from finance to medicine [11]. This kind of data is widely known as big data. In the healthcare sector, big data have many sources varying from large-scale clinical trials, clinical registries, and electronic health records to medical imaging and genetic data. Medical big data comprise a powerful tool toward the establishment of an expandable and āsmartā healthcare system that is able to improve the existing healthcare quality using machine learning to extract actionable knowledge. However, several technical challenges lie behind the concept of big data in healthcareāthe heterogeneity of the protocols among clinical centers, the lack of tools to interpret and visualize such a large amount of data, and the dissimilarity and incompleteness of the dataset structures, to name but a few, are issues that need to be addressed by a modern healthcare system.
All these clinical needs promote the existence of a medical data sharing and data governance framework that is capable to effectively address the needs for (i) sharing medical data across international heterogeneous cohorts, (ii) assessing the quality of the data, and (iii) overcoming the heterogeneity among the cohorts, toward the establishment of a secure federated cloud system [12,13]. Such a framework will not only be able to interlink heterogeneous medical cohorts but also to lead to more accurate studies of rare diseases, i.e., studies with high statistical power. Indubitably, federated analysis must comply with all the legal, ethical, and patient privacy issues under the technical requirements and challenges of our era. These challenges involve the imposed regulations of the General Data Protection Regulation (GDPR) in Europe [14ā16] and its effect on the existing medical infrastructures, its relationship with data protection regulations in other continents, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States [17], and finally novel methods for medical big data manipulation, visualization, and analytics for general purposes.
Data sharing is a complex procedure that faces several obstacles related to the heterogeneity of ethical and legal issues across different countries all over the world. Data sharing reduces the duplication of studies and provides cheaper and more transparent infrastructures for data curation and storage and thus more efficient infrastructures for conducting clinical research. However, the underlying fear for data abuse and the loss of data control comprise the main aspects that obscure the sharing of medical data. Moreover, data sharing is effective only when the scope of sharing is well defined, a fact that is not always taken into consideration in research studies. According to the former EU's Data Protection Supervisor, Mr. Peter J. Hustinx, privacy and protection is defined as āthe right to respect for private life and the right to protection of one's personal dataāare both fairly expressions of a universal idea with quite strong ethical dimensions: the dignity, autonomy and unique value of every human beingā [14]. Toward this direction, strict data protection regulations provide legal barriers to privacy issues for avoiding any data breach, embezzlement, and misuse.
In the United States, the HIPAA of 1996, Public Law 104ā191, which is part of the Social Security Act, aims to protect the healthcare coverage of individuals who lose or change their jobs, as well as embraces the sharing of certain patient administrative data for promoting the healthcare industry [17]. Toward this direction, federal data protection and legal obligation rules have been developed concerning the security and privacy of electronic health transactions to ensure the confidentiality, integrity, and security of health information. More specifically, the US Department of Health and Human Services (HHS), considering the fact that the technological advances could potentially erode the privacy of health i...