1.2 Overview of omic data
Omic data refers to data collected massively in a specific omic domain. The notion of an unbiased scan of numerous biological entities of similar nature comes to mind. The definition is clearly operational, given the differences in understanding biological similarity and, perhaps more challenging, the varying characteristics of different biological levels. Genomic data, for instance, is ultimately concerned with the full characterization of the DNA sequence of an individual. As such, it is highly stable across tissues and the individual’s lifespan. Some variations may arise in terms of somatic mutations that give rise to mosaicisms or to specific mutations, as found in tumorous cells. By contrast, transcriptomic or epigenomic data are highly variable across tissues, each of which changes on different time scales. Transcription data is highly dynamic and responds to physiological activity while epigenetic changes are expected to occur at developmental and aging rates.
An additional consideration is the differences in the expected coverage of each omic data or the data’s dimensionality. Nowadays, for instance, one can expect from current technology that the complete DNA sequence of an individual may be determined, or estimated to high accuracy; and therefore, genomic data is close to full coverage. However, transcriptomic data is currently far from giving us the full picture: the complete set of transcripts of an individual in a given time across all cell types. While transcriptomic data is clearly not complete, it is, however, a highly-dimensional unbiased-scan of a possible state of the transcriptome; that is, the complete set of transcripts in a biological sample of the individual.
Current extensions of omic data include metabolomics and proteomics, and other domains not strictly associated with specific molecular levels. These include, for instance, phenomic and exposomic data, which record multiple phenotypes and exposures at any level: molecular, organic or population. Studies including such data, therefore, allow high dimensionality on the response, traits or environmental conditions of individuals. Here, we will be concerned with studies of single phenotypes and conditions that are controlled or can be adjusted for covariates. We are primarily interested in describing subject variability on single phenotypes at a molecular level. We, therefore, study high dimensional data of DNA structure and function of groups of individuals, whose analysis methods show wide consensus. Some attention will also be given to exogenous factors given by exposomic data, which is a massive collection of environmental conditions in an unbiased manner.
The genome of an individual is the entire DNA content of all the individual’s chromosomes. Genomic data comprises extensive and unbiased measurements of all the chromosomes’ nucleotide sequences. Therefore, the highest possible dimensionality of genomic data is the number of nucleotides in the genome. However, it is the comparison between genomes what informs about their biological and meaningful substructures. As such, a collection of genomic data across individuals is based on the sequence variability of given structures.
The simplest and most common structural variants in the genome are single nucleotide polymorphisms (SNPs). They are changes in only one nucleotide within a short DNA sequence that is otherwise conserved across individuals. The changes considered as SNPs are those given by only one substitution of a nucleotide for another, they are bi-allelic mutations and not rare in the population. Their allele frequencies are considered to be higher than 1%. SNPs can be detected with microarrays or sequencing techniques.
Short DNA sequences, with their variant nucleotides at their ends, constitute probes that can be interrogated by its hybridization with the DNA of a given subject, which has been amplified, cut and marked with fluorescent dyes, one for each variant nucleotide or allele. Microarrays are scilico chips of millions of immobilized probes that capture the luminous DNA fragments of the subject, creating an optical pattern that is given by the individual’s allele pairs, or genotypes, at each probe.
Different microarray technologies are used to genotype individuals with this approach, which is currently the most efficient and economical method to measure a substantial part of the genomic variability between individuals. The end result is an extensive coverage of SNP variants across the genomes of thousands/hundred-of-thousands of individuals. For large studies, the dimensionality of this data can achieve 105 (individuals) times 107 (SNPs), where the SNP variables are typically encoded as 0, 1 and 2 for annotated homozygous, heterozygous and variant homozygous, respectively. Annotations are complementary data on the genomic variables containing the two possible alleles at a given SNP; among adenine (A), thymine (T), cytosine (C) or guanine (C); the DNA strand, 5’ to 3’ (+) or 3’ to 5’ (-); and the alleles that should be considered as reference. Other specific considerations, that influence posterior analysis, include quality measurements of technical and biological conditions affecting SNPs and individuals.
A typical human SNP array assay includes a couple of millions of reference SNPs, from about 85 million SNPs existent in humans [23]. Neighboring SNPs are, however, highly correlated. Due to recombination, the correlation between SNPs diminishes with their distance but it is still substantial (R2 ~ 0.2) for SNPs as far as 200,000 base pairs. Blocks of correlated SNPs, namely haplotypes, in reference populations have been used to impute the value of unmeasured SNPs and thus help to increase the number of SNPs of a particular study or facilitate the merging of genomic data from multiple studies [59]. The scalability of microarray-based studies is, therefore, their biggest asset to identify the likely small independent effects of numerous SNPs on complex traits [89].
SNP microarrays collect the genetic variability of individuals in known sequence variants. The known variants have been determined from reference population samples which have been fully sequenced. It remains to be determined the extent to which the selected references can offer a complete and unbiased coverage of different population samples. Despite the benefits of microarray genotyping, genome sequencing is still the ultimate source of information to fully define the genomic variability of individuals.
1.2.1.3 Sequencing methods
High-throughput sequencing methods aim to sequence all the DNA content of individuals. Broadly, in these methods, DNA is cut at small sizes (~ 100 base-pairs) or reads. Hundreds of millions of reads are then produced, which can cover the genome a number of times (~ 5/8), and need to be assembled to reconstruct an individual genome. Specific sequence variants of individuals can be estimated with high accuracy. The mapping of the reads of different individual genomes to a reference genome recovers genomic SNP data with the greatest coverage, unconditioned to ancestry. The scalability of genomic data, obtained from sequencing, is, however, limited. Current technology is expensive and computationally demanding and a suitable increase in the number of individuals, required to detect the likely small effects of common variants, is at the moment unattainable.
Sequencing call of structural variants, therefore, remains an important tool to investigate rare variations and specific genomic architectures, while SNP arrays are most powerful in large studies of common genomic variation.
1.2.2 Genomic data for other structural variants
Genomic variation is rich, even between individuals with common recent ancestry. In a specific population, several DNA segments, of various lengths and up to the order of mega bases-pairs, can be found inserted, duplicated, deleted, translocated or inverted. While DNA sequencing is the best way to detect genomic variation, its price and analysis demands in large cohorts limit its use. SNP microarrays can, however, be exploited to detect many of these variants. For instance, luminous intensities used to genotype SNPs, can also be utilized to either detect regions with copy number alterations or cell populations with different genotypes (mosaicism )[167, 45]. In addition, specific haplotype patterns, which are produced by suppression of recombination, are indicative of mispairing between homologous chromosomes due to likely structural differences between them. Large and divergent haplotype groups have been associated with the suppression of recombination due to inversion polymorphisms. From genomic SNP data, inversion genotyping can be performed and their variability and functional impact can be studied in large cohorts [16].
Microarray SNP data opens the possibility to study more complex structural DNA variation in population samples across the genome. We can, therefore, exploit SNP data to have a more complete knowledge of genomic variability and to study the potential role of large structural variation in the phenotypic differences between individuals.
1.2.3 Transcriptomic data
Complex biochemical reactions are involved in the de-codification, or transcription, of DNA sequences. A direct product of these reactions is the production of RNA molecules some of which is further processed to produce proteins, the basic tools of the cells’ physiology. Transcriptomic data is, therefore, a large-scale survey of the transcribed RNA repertoire of a biological sample.
The dimensionality of transcriptomic data is much smaller than that of genomic data. While in the producti...