Introduction
Epigenetics represents a rapidly growing and promising field for the discovery of novel disease biomarkers and understanding the pathophysiology of complex diseases. Epigenetic modifications regulate gene expression and gene activity without altering the underlying DNA sequence, but instead modifying the chromatin structure via DNA methylation, histone modifications, miRNAs, and noncoding RNAs [1]. These epigenetic mechanisms play important roles in embryonic development, transcriptional regulation, chromatin structure, genomic imprinting, and maintenance of genome integrity. While epigenetic changes are required for normal development and cell function, they can also be responsible for disease initiation and progression, especially cancer. Technological advances such as high-throughput technologies (e.g., next-generation sequencing [NGS] and microarray) and modern bioinformatics tools have enabled the profiling and mapping of large-scale epigenomic data [1]. Thus, computational approaches are required as part of the epigenomic research, especially during experimental design, data visualization, hypothesis validation, and result interpretation. Moreover, a computational modeling is required to facilitate the integration of variable data sources, including differentially methylated regions, miRNA binding, chromatin modifications, gene expressions, genetic variations, genomic regions, phenotypic characteristics, etc. Although the field of computational epigenetics is still in its infancy, the potential payoffs are enormous. It is possible to understand the mechanistic basis of human diseases by using computational approaches, even without a deep understanding of the fundamental pathophysiologic mechanisms behind the illness. By writing this book, we aim to provide theoretical insight, summarize practical implications, and draw attention to the emerging area of computational epigenetics and disease.
Computational Approaches in DNA Methylation
DNA methylation is one of the most intensely studied epigenetic modifications in humans. A methyl group is covalently added at the fifth position of cytosine (C) to form 5-methylcytosine (5mC), which is catalyzed by DNA methyltransferases (DNMTs). DNMTs are a group of enzymes that involved in the regulation of DNA methylation patterns, especially during normal development and diseases [2]. For instance, DNMT3a and DNMT3b play important roles in de novo methylation and embryonic development, while DNMT1 maintains DNA methylation patterns during gene duplication and mitosis. Methyl-CpG-binding domain proteins (MBDs) recruit the specific components of the epigenetic machinery to read and interpret the genetic information encoded by the methylated DNA. DNA methylation can be occurred in the repetitive genomic regions, including satellite DNA and parasitic elements (e.g., long interspersed transposable elements [LINES], short interspersed transposable elements [SINES], and endogenous retroviruses), which contained CpG dinucleotides for cytosine to be methylated. In humans, methylation of cytosine occurs predominantly at 5ā²-CpG-3ā² dinucleotides, and to a lesser extent at non-CpG sites (e.g., CpA, CpT, and CpC). The CpG dinucleotides are highly concentrated in CpG islands (CGIs), which are often located in the gene promoters, near the transcription start sites, and the enhancer regions [3ā5]. CGIs are typically unmethylated and may undergo dynamic methylation changes during development, differentiation, and disease [5,6]. Methylated or unmethylated CGIs could affect the gene expression patterns through regulation of chromatin structure and transcription factor binding [7]. Therefore, it is crucial to measure the differential DNA methylation in the context of CG. Numerous approaches have been proposed to study DNA methylation, including bisulfite PCR sequencing, PyroMark CpG assay, Illumina's Infinium Methylation assay, quantitative MethyLight assay, luminometric methylation assay, methylated DNA immunoprecipitation (MeDIP), MeDIP coupled with high-throughput sequencing (MeDIP-seq), methyl-CpG-binding domain coupled with high-throughput sequencing (MBD-seq), methylation-sensitive restriction enzyme sequencing (MRE-seq), reduced representation bisulfite sequencing (RRBS), and whole genome bisulfite sequencing (WGBS) [8ā11].
Bisulfite sequencing remains the gold standard method for the detection of DNA methylome, due to the increasing throughput of NGS technologies and the decrement in cost. The mapping and alignment of bisulfite reads from NGS (e.g., RRBS, Agilent SureSelect Human Methyl-Seq, NimbleGen SeqCap Epi CpGiant, and whole genomic bisulfite sequencing) are more complicated than the regular sequence reads. However, this massive task can become less burdensome via computational tools, which can be filtered and quality controlled by using BALM, Bismark, BRAT-nova, BS-seeker, BSMAP, MAQ, MOABS, MACAU, MEDIPS, RMAP, PASH, TAMeBS, WALT, etc. [1]. Bisulfite treatment converts the unmethylated cytosines to uracils, and subsequently recognized as thymines in the sequencing reads. The degree of DNA methylation can be calculated from the frequency of cytosines and thymines at a specific CpG locus, by aligning the raw reads against cytosines in the reference genomic sequence [1]. In brief, wild card aligners (e.g., BSMAP, RMAP, and Pash 3.0) substitute cytosines with IUPAC letter āYā and then align with hashing extension method, in order to match to thymines in the bisulfite reads [1]. Alternatively, three-letter aligners (e.g., Bismark, BS-seeker, and BRAT-nova) can be used to convert all cytosines to lower case ātā in both reference sequence and reads, followed by short read alignment (e.g., Bowtie or Bowtie 2) based on the three-letter code of DNA (A, G, and T) [1]. Upon obtaining the processed data, DNA methylation regions can be highly predictive based on the transcriptional activity of downstream genes, transcription start sites, transcription factor binding sites, presence or absence of TATA box, and/or RNA polymerase II occupancy on DNA [3]. Such computational predictions [3] are useful, particularly where experimental data are still lacking [11,12], which represent the first step toward quantitative analysis of DNA methylation data. When no a priori knowledge is available on a candidate gene methylation, it is more acceptable to assess the DNA methylated regions comprising a number of cytosines or known as āCpG island.ā Although several statistical methods have been applied in the detection of differential DNA methylated regions [13], Fisher's exact test or paired nonparametric tests are the most common methods for comparing the methylation levels of the cytosines within the regions of interest. The false discovery rate is required to be corrected for multiple testing, based on the BenjaminiāHochberg procedure. Alternatively, probabilistic and more unbiased methods such as Hidden Markov Models (HMM) can be used for this segmentation problem. Additionally, a multivariate statistical model has been proposed for analyzing epigenetic data [14]. Such approaches are much more realistic than marginal models, in order to optimize the interpretation of the resulting epigenetic data.
Computational Approaches in Histone Modifications
In addition to DNA methylation, histone modifications are also widely studied epigenetic mechanisms. DNA is wrapped around by an octamer of histone core to form nucleosomes, and subsequently organized into chromatin. Each nucleosome is composed of two copies of four histone proteins H2A, H2B, H3, and H4. Overall structure of chromatin can be altered through the posttranslational modifications of histone N-terminal tails, such as methylation, phosphorylation, acetylation, ubiquitination, SUMOylation, ADP ribosylation, biotinylation, deamination, and proline isomerization [15]. Notably, histone acetylation, methylation, phosphorylation, and ubiquitination are involved in gene activation, whereas methylation, ubiquitination, SUMOylation, biotinylation, deamination, and proline isomerization are involved in gene repression. These histone modifications act as the docking sites for chromatin to recruit histone chaperones and nucleosome remodellers, and subsequently alter the chromatin architecture for transcriptional activity and ge...