eBook - ePub

R Bioinformatics Cookbook

Name: R Bioinformatics Cookbook
Author: Dan MacLean

Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Dan MacLean

Condividi libro

316 pagine
English
ePUB (disponibile sull'app)
Disponibile su iOS e Android

eBook - ePub

R Bioinformatics Cookbook

Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Dan MacLean

Dettagli del libro

Anteprima del libro

Indice dei contenuti

Citazioni

Informazioni sul libro

Over 60 recipes to model and handle real-life biological data using modern libraries from the R ecosystem

Key Features

Apply modern R packages to handle biological data using real-world examples
Represent biological data with advanced visualizations suitable for research and publications
Handle real-world problems in bioinformatics such as next-generation sequencing, metagenomics, and automating analyses

Book Description

Handling biological data effectively requires an in-depth knowledge of machine learning techniques and computational skills, along with an understanding of how to use tools such as edgeR and DESeq. With the R Bioinformatics Cookbook, you'll explore all this and more, tackling common and not-so-common challenges in the bioinformatics domain using real-world examples.

This book will use a recipe-based approach to show you how to perform practical research and analysis in computational biology with R. You will learn how to effectively analyze your data with the latest tools in Bioconductor, ggplot, and tidyverse. The book will guide you through the essential tools in Bioconductor to help you understand and carry out protocols in RNAseq, phylogenetics, genomics, and sequence analysis. As you progress, you will get up to speed with how machine learning techniques can be used in the bioinformatics domain. You will gradually develop key computational skills such as creating reusable workflows in R Markdown and packages for code reuse.

By the end of this book, you'll have gained a solid understanding of the most important and widely used techniques in bioinformatic analysis and the tools you need to work with real biological data.

What you will learn

Employ Bioconductor to determine differential expressions in RNAseq data
Run SAMtools and develop pipelines to find single nucleotide polymorphisms (SNPs) and Indels
Use ggplot to create and annotate a range of visualizations
Query external databases with Ensembl to find functional genomics information
Execute large-scale multiple sequence alignment with DECIPHER to perform comparative genomics
Use d3.js and Plotly to create dynamic and interactive web graphics
Use k-nearest neighbors, support vector machines and random forests to find groups and classify data

Who this book is for

This book is for bioinformaticians, data analysts, researchers, and R developers who want to address intermediate-to-advanced biological and bioinformatics problems by learning through a recipe-based approach. Working knowledge of R programming language and basic knowledge of bioinformatics are prerequisites.

Domande frequenti

Come faccio ad annullare l'abbonamento?

È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui

È possibile scaricare libri? Se sì, come?

Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui

Che differenza c'è tra i piani?

Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.

Cos'è Perlego?

Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.

Perlego supporta la sintesi vocale?

Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.

R Bioinformatics Cookbook è disponibile online in formato PDF/ePub?

Sì, puoi accedere a R Bioinformatics Cookbook di Dan MacLean in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Computer Science e Bioinformatics. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Editore

Packt Publishing

Anno

2019

ISBN

9781789955590

Edizione

Argomento

Computer Science

Categoria

Bioinformatics

Performing Quantitative RNAseq

The technology of RNAseq has revolutionized the study of transcript abundances, bringing high-sensitivity detection and high-throughput analysis. Bioinformatic analysis pipelines using RNAseq data typically start with a read quality control step followed by either alignment to a reference or the assembly of sequence reads into longer transcripts de novo. After that, transcript abundances are estimated with read counting and statistical models and differential expression between samples is assessed. Naturally, there are many technologies available for all steps of this pipeline. The quality control and read alignment steps will usually take place outside of R, so analysis in R will begin with a file of transcript or gene annotations (such as GFF and BED files) and a file of aligned reads (such as BAM files).

The tools in R for performing analysis are powerful and flexible. Many of them are part of the Bioconductor suite and, as such, integrate together very nicely. The key question researchers wish to answer with RNAseq is usually: Which transcripts are differentially expressed? In this chapter, we'll look at some recipes for that in standard cases where we already know the genomic positions of genes we're interested in, and in cases where we need to find unannotated transcripts. We'll also look at other important recipes that help answer the questions How many replicates are enough? and Which allele is expressed more?

In this chapter, we will cover the following recipes:

Estimating differential expression with edgeR
Estimating differential expression with DESeq2
Power analysis with powsimR
Finding unannotated transcribed regions with GRanges objects
Finding regions showing high expression ab initio with bumphunter
Differential peak analysis
Estimating batch effects using SVA
Finding allele-specific expression with AllelicImbalance
Plotting and presenting RNAseq data

Technical requirements

The sample data you'll need is available from this book's GitHub repository: https://github.com/PacktPublishing/R-Bioinformatics_Cookbook. If you want to use the code examples as they are written, then you will need to make sure that this data is in a sub-directory of whatever your working directory is.

Here are the R packages that you'll need. Most of these will install with install.packages(); others are a little more complicated:

Bioconductor
- AllelicImbalance
- bumphunter
- csaw
- DESeq
- edgeR
- IRanges
- Rsamtools
- rtracklayer
- sva
- SummarizedExperiment
- VariantAnnotation
dplyr
extRemes
forcats
magrittr
powsimR
readr

Bioconductor is huge and has its own installation manager. You can install it with the following code:

if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install()

Further information is available at https://www.bioconductor.org/install/.

Normally, in R, a user will load a library and use the functions directly by name. This is great in interactive sessions but it can cause confusion when many packages are loaded. To clarify which package and function I'm using at a given moment, I will occasionally use the packageName::functionName() convention.

Sometimes, in the middle of a recipe, I'll interrupt the code so you can see some intermediate output or the structure of an object it's important to understand. Whenever that happens, you'll see a code block where each line begins with ## (double hash symbols). Consider the following command:

letters[1:5]

This will give us output as follows:

## a b c d e

Note that the output lines are prefixed with ##.

Estimating differential expression with edgeR

edgeR is a widely used and powerful package that implements negative binomial models suitable for sparse count data such as RNAseq data in a general linear model framework, which are powerful for describing and understanding count relationships and exact tests for multi-group experiments. It uses a weighted style normalization called TMM, which is the weighted mean of log ratio between sample and control, after removal of genes with high counts and outlying log ratios. The TMM value should be close to one, but can be used as a correction factor to be applied to overall library sizes

In this recipe, we'll look at some options from preparing read counts for annotated regions in some object to identifying the differentially expressed features in a genome. Usually, there is an upstream step requiring us to take high-throughput sequence reads, align them to a reference and produce files describing those alignments, such as .bam files. With those files prepared, we'd fire up R and start to analyze. So that we can concentrate on the differential expression analysis part of the process, we'll use a prepared dataset for which all of the data is ready. Chapter 8, Working with Databases and Remote Data Sources, shows you how to go from raw data to this stage if you're looking for how to do that step.

As there are many different tools and methods for getting those alignments of reads, we will look at starting the process with two common input object types. We'll use a count table, like that we would have if we were loading from a text file and we'll use an ExpressionSet (eset) object, which is an object type common in Bioconductor.

Our prepared dataset will be the modencodefly data from the NHGRI encyclopedia of DNA elements project for the model organism, Drosophila melanogaster. You can read about this project at www.modencode.org. The dataset contains 147 different samples for D. melanogaster, a fruit fly with an approximately 110 Mbp genome, annotated with about 15,000 gene features.

Getting ready

The data is provided as both a count matrix and an ExpressionSet object and you can see the Appendix at the end of this book for further information on these object types. The data is in this book's code and data repository at https://github.com/PacktPublishing/R_Bioinformatics_Cookbook under datasets/ch1/modencodefly_eset.RData, datasets/ch1/modencodefly_count_table.txt, and datasets/ch1/modencodelfy_phenodata.txt . We'll also use the edgeR (from Bioconductor), readr, and magrittr libraries.

How to do it...

We will see two ways of estimating differential expressions with edgeR.

Using edgeR from a count table

For estimating differential expressions with edgeR from a count table (for example, in a text file), we will use the following steps:

Load the count data:

count_dataframe <- readr::read_tsv(file.path(getwd(), "datasets", "ch1"...