eBook - ePub

Computational Genomics with R

Name: Computational Genomics with R
Author: Altuna Akalin

Altuna Akalin

Share book

440 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Computational Genomics with R

Altuna Akalin

Book details

Book preview

Table of contents

Citations

About This Book

Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology.

After reading:

You will have the basics of R and be able to dive right into specialized uses of R for computational genomics such as using Bioconductor packages.
You will be familiar with statistics, supervised and unsupervised learning techniques that are important in data modeling, and exploratory analysis of high-dimensional data.
You will understand genomic intervals and operations on them that are used for tasks such as aligned read counting and genomic feature annotation.
You will know the basics of processing and quality checking high-throughput sequencing data.
You will be able to do sequence analysis, such as calculating GC content for parts of a genome or finding transcription factor binding sites.
You will know about visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization.
You will be familiar with analysis of different high-throughput sequencing data sets, such as RNA-seq, ChIP-seq, and BS-seq.
You will know basic techniques for integrating and interpreting multi-omics datasets.

Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Computational Genomics with R an online PDF/ePUB?

Yes, you can access Computational Genomics with R by Altuna Akalin in PDF and/or ePUB format, as well as other popular books in Mathematics & Probability & Statistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Chapman and Hall/CRC

Year

2020

ISBN

9780429532764

Edition

Topic

Mathematics

Subtopic

Probability & Statistics

Index

Mathematics

1

Introduction to Genomics

The aim of this chapter is to provide the reader with some of the fundamentals required for understanding genome biology. By no means, is this a complete overview of the subject, but just a summary that will help the non-biologist reader understand the recurring biological concepts in computational genomics. Readers that are well-versed in genome biology and modern genome-wide quantitative assays should feel free to skip this chapter or skim it through.

1.1 Genes, DNA and central dogma

A central concept that will come up again and again is “the gene”. Before we can explain that, we need to introduce a few other concepts that are important to understand the gene concept. The human body is made up of billions of cells. These cells specialize in different tasks. For example, in the liver there are cells that help produce enzymes to break toxins. In the heart, there are specialized muscle cells that make the heart beat. Yet, all these different kinds of cells come from a single-celled embryo. All the instructions to make different kinds of cells are contained within that single cell and with every division of that cell, those instructions are transmitted to new cells. These instructions can be coded into a string – a molecule of DNA, a polymer made of recurring units called nucleotides. The four nucleotides in DNA molecules, Adenine, Guanine, Cytosine and Thymine (coded as four letters: A, C, G, and T) in a specific sequence, store the information for life. DNA is organized in a double-helix form where two complementary polymers interlace with each other and twist into the familiar helical shape.

1.1.1 What is a genome?

The full DNA sequence of an organism, which contains all the hereditary information, is called a genome. The genome contains all the information to build and maintain an organism. Genomes come in different sizes and structures. Our genome is not only a naked stretch of DNA. In eukaryotic cells, DNA is wrapped around proteins (histones) forming higher-order structures like nucleosomes which make up chromatins and chromosomes (see Figure 1.1).

**Figure 1.1:**Chromosome structure in animals.

There might be several chromosomes depending on the organism. However, in some species (such as most prokaryotes) DNA is stored in a circular form. The size of the genome between species differs too. The human genome has 46 chromosomes and over 3 billion base-pairs, whereas the wheat genome has 42 chromosomes and 17 billion base-pairs; both genome size and chromosome numbers are variable between different organisms. Genome sequences of organisms are obtained using sequencing technology. With this technology, fragments of the DNA sequence from the genome, called reads, are obtained. Larger chunks of the genome sequence are later obtained by stitching the initial fragments to larger ones by using the overlapping reads. The latest sequencing technologies made genome sequencing cheaper and faster. These technologies output more reads, longer reads and more accurate reads.

The estimated cost of the first human genome was $300 million in 1999–2000; today a high-quality human genome can be obtained for $1500. Since the costs are going down, researchers and clinicians can generate more data. This drives up the costs for data storage and also drives up the demand for qualified people to analyze genomic data. This was one of the motivations behind writing this book.

1.1.2 What is a gene?

In the genome, there are specific regions containing the precise information that encodes for physical products of genetic information. A region in the genome with this information is traditionally called a “gene”. However, the precise definition of the gene is still developing. According to the classical textbooks in molecular biology, a gene is a segment of a DNA sequence corresponding to a single protein or to a single catalytic and structural RNA molecule (Alberts et al., 2002). A modern definition is: “A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript” (Eilbeck et al., 2005). No matter how variable the definitions are, all agree on the fact that genes are basic units of heredity in all living organisms.

All cells use their hereditary information in the same way most of the time; the DNA is replicated to transfer the information to new cells. If activated, the genes are transcribed into messenger RNAs (mRNAs) in the nucleus (in eukaryotes), followed by mRNAs (if the gene is protein coding) getting translated into proteins in the cytoplasm. This is essentially a process of information transfer between information-carrying polymers; DNA, RNA and proteins, known as the “central dogma” of molecular biology (see Figure 1.2 for a summary). Proteins are essential elements for life. The growth and repair, functioning and structure of all living cells depend on them. This is why the gene is a central concept in genome biology, because a gene can encode information for proteins and other functional molecules. How genes are controlled and activated dictates everything about an organism. From the identity of a cell to response to an infection, how cells develop and behave against certain stimuli is governed by the activity of the genes and the functional molecules they encode. The liver cell becomes a liver cell because certain genes are activated and their functional products are produced to help the liver cell achieve its tasks.

**Figure 1.2:**Central Dogma: replication, transcription, translation.

1.1.3 How are genes controlled? Transcriptional and post-transcriptional regulation

In order to answer this question, we have to dig a little deeper into the transcription concept we introduced via the central dogma. The first step in a process of information transfer - the production of an RNA copy of a part of the DNA sequence - is called transcription. This task is carried out by the RNA polymerase enzyme. RNA polymerase-dependent initiation of transcription is enabled by the existence of a specific region in the sequence of DNA - a core promoter. Core promoters are regions of DNA that promote transcription and are found upstream from the start site of transcription. In eukaryotes, several proteins, called general transcription factors, recognize and bind to core promoters and form a pre-initiation complex. RNA polymerases recognize these complexes and initiate synthesis of RNAs, the polymerase travels along the template DNA and makes an RNA copy (Hager et al., 2009). After mRNA is produced it is often spliced by spliceosome. The sections, called ‘introns’, are removed and sections called ‘exons’ left in. Then, the remaining mRNA is translated into proteins. Which exons will be part of the final mature transcript can also be regulated and creates diversity in protein structure and function (See Figure 1.3).

**Figure 1.3:**Transcription can be followed by splicing, which creates different transcript isoforms. This will in return create different protein isoforms since the information required to produce the protein is encoded in the transcripts. Differences in transcripts of the same gene can give rise to different protein isoforms.

Contrary to protein coding genes, non-coding RNA (ncRNAs) genes are processed and assume their functional structures after transcription and without going into translation, hence the name: non-coding RNAs. Certain ncRNAs can also be spliced but still not translated. ncRNAs and other RNAs in general can form complementary base-pairs within the RNA molecule which gives them additional complexity. This self-complementarity-based structure, termed the RNA secondary structure, is often necessary for functions of many ncRNA species.

In summary, the set of processes, from transcription initiation to production of the functional product, is referred to as gene expression. Gene expression quantification and regulation is a fundamental topic in genome biology.

1.1.4 What does a gene look like?

Before we move forward, it will be good to discuss how we can visualize genes. As someone interested in computational genomics, you will frequently encounter a gene on a computer screen, and how it is represented on the computer will be equivalent to what you imagine when you hear the word “gene”. In the online databases, the genes will appear as a sequence of letters or as a series of connected boxes showing exon-intron structure, which may include the direction of transcription as well (see Figure 1.4). You will encounter more with the latter, so this is likely what will pop into your mind when you think of genes.

As we have mentioned, DNA has two strands. A gene can be located on either of them, and the direction of transcription will depend on that. In Figure 1.4, you can see arrows on introns (lines connecting boxes) indicating the direction of the gene.

1.2 Elements of gene regulation

The mechanisms regulating gene expression are essential for all living organisms as they dictate where and how much of a gene product (it may be protein or ncRNA) should be manufactured. This regulation could occur at the pre- and co-transcriptional level b...