
- 376 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
eBook - ePub
Omic Association Studies with R and Bioconductor
About this book
After the great expansion of genome-wide association studies, their scientific methodology and, notably, their data analysis has matured in recent years, and they are a keystone in large epidemiological studies. Newcomers to the field are confronted with a wealth of data, resources and methods. This book presents current methods to perform informative analyses using real and illustrative data with established bioinformatics tools and guides the reader through the use of publicly available data. Includes clear, readable programming codes for readers to reproduce and adapt to their own data.
-
- Emphasises extracting biologically meaningful associations between traits of interest and genomic, transcriptomic and epigenomic data
-
- Uses up-to-date methods to exploit omic data
-
- Presents methods through specific examples and computing sessions
-
- Supplemented by a website, including code, datasets, and solutions
Frequently asked questions
Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
- Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
- Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Omic Association Studies with R and Bioconductor by Juan R. González,Alejandro Cáceres in PDF and/or ePUB format, as well as other popular books in Matematica & Probabilità e statistica. We have over one million books available in our catalogue for you to explore.
Information
1
Introduction
CONTENTS
- 1.1 Book overview
- 1.2 Overview of omic data
- 1.2.1 Genomic data
- 1.2.1.1 Genomic SNP data
- 1.2.1.2 SNP arrays
- 1.2.1.3 Sequencing methods
- 1.2.2 Genomic data for other structural variants
- 1.2.3 Transcriptomic data
- 1.2.3.1 Microarrays
- 1.2.3.2 RNA-seq
- 1.2.4 Epigenomic data
- 1.2.5 Exposomic data
- 1.3 Association studies
- 1.3.1 Genome-wide association studies
- 1.3.2 Whole transcriptome profiling
- 1.3.3 Epigenome-wide association studies
- 1.3.4 Exposome-wide association studies
- 1.4 Publicly available resources
- 1.4.1 dbGaP
- 1.4.2 EGA
- 1.4.3 GEO
- 1.4.4 1000 Genomes
- 1.4.5 GTEx
- 1.4.6 TCGA
- 1.4.7 Others
- 1.5 Bioconductor
- 1.5.1 R
- 1.5.2 Omic data in Bioconductor
- 1.6 Book’s outline
1.1 Book overview
This book is concerned with the analysis of high dimensional data that is acquired at specific biological domains. The aim of the analyses is the explanation of phenotypic differences among individuals. We, therefore, search for endogenous and exogenous factors that may influence such differences. The endogenous domains on which we turn our attention are those at the molecular level involving basic DNA structure and function, which have been labeled with the omic suffix. In particular, we will describe current methods to analyze genomic data, which is high-dimensional at the gene (DNA sequence) level, transcriptomic data involving transcription of DNA into mRNA and epigenomic/methylomic data that relate to the epigenetic modifications of DNA. Many of the methods used at each domain overlap due to the biological nature and high dimensionality of data. However, important specificities remain, some derived from the acquisition of data and others from differences in the underlying biological processes. Within the exogenous domain, we study the high dimensional acquisition of exposure factors that are believed to influence the development or progression of individual traits.
1.2 Overview of omic data
Omic data refers to data collected massively in a specific omic domain. The notion of an unbiased scan of numerous biological entities of similar nature comes to mind. The definition is clearly operational, given the differences in understanding biological similarity and, perhaps more challenging, the varying characteristics of different biological levels. Genomic data, for instance, is ultimately concerned with the full characterization of the DNA sequence of an individual. As such, it is highly stable across tissues and the individual’s lifespan. Some variations may arise in terms of somatic mutations that give rise to mosaicisms or to specific mutations, as found in tumorous cells. By contrast, transcriptomic or epigenomic data are highly variable across tissues, each of which changes on different time scales. Transcription data is highly dynamic and responds to physiological activity while epigenetic changes are expected to occur at developmental and aging rates.
An additional consideration is the differences in the expected coverage of each omic data or the data’s dimensionality. Nowadays, for instance, one can expect from current technology that the complete DNA sequence of an individual may be determined, or estimated to high accuracy; and therefore, genomic data is close to full coverage. However, transcriptomic data is currently far from giving us the full picture: the complete set of transcripts of an individual in a given time across all cell types. While transcriptomic data is clearly not complete, it is, however, a highly-dimensional unbiased-scan of a possible state of the transcriptome; that is, the complete set of transcripts in a biological sample of the individual.
Current extensions of omic data include metabolomics and proteomics, and other domains not strictly associated with specific molecular levels. These include, for instance, phenomic and exposomic data, which record multiple phenotypes and exposures at any level: molecular, organic or population. Studies including such data, therefore, allow high dimensionality on the response, traits or environmental conditions of individuals. Here, we will be concerned with studies of single phenotypes and conditions that are controlled or can be adjusted for covariates. We are primarily interested in describing subject variability on single phenotypes at a molecular level. We, therefore, study high dimensional data of DNA structure and function of groups of individuals, whose analysis methods show wide consensus. Some attention will also be given to exogenous factors given by exposomic data, which is a massive collection of environmental conditions in an unbiased manner.
1.2.1 Genomic data
The genome of an individual is the entire DNA content of all the individual’s chromosomes. Genomic data comprises extensive and unbiased measurements of all the chromosomes’ nucleotide sequences. Therefore, the highest possible dimensionality of genomic data is the number of nucleotides in the genome. However, it is the comparison between genomes what informs about their biological and meaningful substructures. As such, a collection of genomic data across individuals is based on the sequence variability of given structures.
1.2.1.1 Genomic SNP data
The simplest and most common structural variants in the genome are single nucleotide polymorphisms (SNPs). They are changes in only one nucleotide within a short DNA sequence that is otherwise conserved across individuals. The changes considered as SNPs are those given by only one substitution of a nucleotide for another, they are bi-allelic mutations and not rare in the population. Their allele frequencies are considered to be higher than 1%. SNPs can be detected with microarrays or sequencing techniques.
1.2.1.2 SNP arrays
Short DNA sequences, with their variant nucleotides at their ends, constitute probes that can be interrogated by its hybridization with the DNA of a given subject, which has been amplified, cut and marked with fluorescent dyes, one for each variant nucleotide or allele. Microarrays are scilico chips of millions of immobilized probes that capture the luminous DNA fragments of the subject, creating an optical pattern that is given by the individual’s allele pairs, or genotypes, at each probe.
Different microarray technologies are used to genotype individuals with this approach, which is currently the most efficient and economical method to measure a substantial part of the genomic variability between individuals. The end result is an extensive coverage of SNP variants across the genomes of thousands/hundred-of-thousands of individuals. For large studies, the dimensionality of this data can achieve 105 (individuals) times 107 (SNPs), where the SNP variables are typically encoded as 0, 1 and 2 for annotated homozygous, heterozygous and variant homozygous, respectively. Annotations are complementary data on the genomic variables containing the two possible alleles at a given SNP; among adenine (A), thymine (T), cytosine (C) or guanine (C); the DNA strand, 5’ to 3’ (+) or 3’ to 5’ (-); and the alleles that should be considered as reference. Other specific considerations, that influence posterior analysis, include quality measurements of technical and biological conditions affecting SNPs and individuals.
A typical human SNP array assay includes a couple of millions of reference SNPs, from about 85 million SNPs existent in humans [23]. Neighboring SNPs are, however, highly correlated. Due to recombination, the correlation between SNPs diminishes with their distance but it is still substantial (R2 ~ 0.2) for SNPs as far as 200,000 base pairs. Blocks of correlated SNPs, namely haplotypes, in reference populations have been used to impute the value of unmeasured SNPs and thus help to increase the number of SNPs of a particular study or facilitate the merging of genomic data from multiple studies [59]. The scalability of microarray-based studies is, therefore, their biggest asset to identify the likely small independent effects of numerous SNPs on complex traits [89].
SNP microarrays collect the genetic variability of individuals in known sequence variants. The known variants have been determined from reference population samples which have been fully sequenced. It remains to be determined the extent to which the selected references can offer a complete and unbiased coverage of different population samples. Despite the benefits of microarray genotyping, genome sequencing is still the ultimate source of information to fully define the genomic variability of individuals.
1.2.1.3 Sequencing methods
High-throughput sequencing methods aim to sequence all the DNA content of individuals. Broadly, in these methods, DNA is cut at small sizes (~ 100 base-pairs) or reads. Hundreds of millions of reads are then produced, which can cover the genome a number of times (~ 5/8), and need to be assembled to reconstruct an individual genome. Specific sequence variants of individuals can be estimated with high accuracy. The mapping of the reads of different individual genomes to a reference genome recovers genomic SNP data with the greatest coverage, unconditioned to ancestry. The scalability of genomic data, obtained from sequencing, is, however, limited. Current technology is expensive and computationally demanding and a suitable increase in the number of individuals, required to detect the likely small effects of common variants, is at the moment unattainable.
Sequencing call of structural variants, therefore, remains an important tool to investigate rare variations and specific genomic architectures, while SNP arrays are most powerful in large studies of common genomic variation.
1.2.2 Genomic data for other structural variants
Genomic variation is rich, even between individuals with common recent ancestry. In a specific population, several DNA segments, of various lengths and up to the order of mega bases-pairs, can be found inserted, duplicated, deleted, translocated or inverted. While DNA sequencing is the best way to detect genomic variation, its price and analysis demands in large cohorts limit its use. SNP microarrays can, however, be exploited to detect many of these variants. For instance, luminous intensities used to genotype SNPs, can also be utilized to either detect regions with copy number alterations or cell populations with different genotypes (mosaicism )[167, 45]. In addition, specific haplotype patterns, which are produced by suppression of recombination, are indicative of mispairing between homologous chromosomes due to likely structural differences between them. Large and divergent haplotype groups have been associated with the suppression of recombination due to inversion polymorphisms. From genomic SNP data, inversion genotyping can be performed and their variability and functional impact can be studied in large cohorts [16].
Microarray SNP data opens the possibility to study more complex structural DNA variation in population samples across the genome. We can, therefore, exploit SNP data to have a more complete knowledge of genomic variability and to study the potential role of large structural variation in the phenotypic differences between individuals.
1.2.3 Transcriptomic data
Complex biochemical reactions are involved in the de-codification, or transcription, of DNA sequences. A direct product of these reactions is the production of RNA molecules some of which is further processed to produce proteins, the basic tools of the cells’ physiology. Transcriptomic data is, therefore, a large-scale survey of the transcribed RNA repertoire of a biological sample.
The dimensionality of transcriptomic data is much smaller than that of genomic data. While in the producti...
Table of contents
- Cover
- Half Title
- Title Page
- Copyright Page
- Dedication Page
- Contents
- Preface
- 1 Introduction
- 2 Case examples
- 3 Dealing with omic data in Bioconductor
- 4 Genetic association studies
- 5 Genomic variant studies
- 6 Addressing batch effects
- 7 Transcriptomic studies
- 8 Epigenomic studies
- 9 Exposomic studies
- 10 Enrichment analysis
- 11 Multiomic data analysis
- Bibliography
- Index