eBook - ePub

R Bioinformatics Cookbook

Name: R Bioinformatics Cookbook
Author: Dan MacLean

Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Dan MacLean

Partager le livre

316 pages
English
ePUB (adapté aux mobiles)
Disponible sur iOS et Android

eBook - ePub

R Bioinformatics Cookbook

Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Dan MacLean

Détails du livre

Aperçu du livre

Table des matières

Citations

À propos de ce livre

Over 60 recipes to model and handle real-life biological data using modern libraries from the R ecosystem

Key Features

Apply modern R packages to handle biological data using real-world examples
Represent biological data with advanced visualizations suitable for research and publications
Handle real-world problems in bioinformatics such as next-generation sequencing, metagenomics, and automating analyses

Book Description

Handling biological data effectively requires an in-depth knowledge of machine learning techniques and computational skills, along with an understanding of how to use tools such as edgeR and DESeq. With the R Bioinformatics Cookbook, you'll explore all this and more, tackling common and not-so-common challenges in the bioinformatics domain using real-world examples.

This book will use a recipe-based approach to show you how to perform practical research and analysis in computational biology with R. You will learn how to effectively analyze your data with the latest tools in Bioconductor, ggplot, and tidyverse. The book will guide you through the essential tools in Bioconductor to help you understand and carry out protocols in RNAseq, phylogenetics, genomics, and sequence analysis. As you progress, you will get up to speed with how machine learning techniques can be used in the bioinformatics domain. You will gradually develop key computational skills such as creating reusable workflows in R Markdown and packages for code reuse.

By the end of this book, you'll have gained a solid understanding of the most important and widely used techniques in bioinformatic analysis and the tools you need to work with real biological data.

What you will learn

Employ Bioconductor to determine differential expressions in RNAseq data
Run SAMtools and develop pipelines to find single nucleotide polymorphisms (SNPs) and Indels
Use ggplot to create and annotate a range of visualizations
Query external databases with Ensembl to find functional genomics information
Execute large-scale multiple sequence alignment with DECIPHER to perform comparative genomics
Use d3.js and Plotly to create dynamic and interactive web graphics
Use k-nearest neighbors, support vector machines and random forests to find groups and classify data

Who this book is for

This book is for bioinformaticians, data analysts, researchers, and R developers who want to address intermediate-to-advanced biological and bioinformatics problems by learning through a recipe-based approach. Working knowledge of R programming language and basic knowledge of bioinformatics are prerequisites.

Foire aux questions

Comment puis-je résilier mon abonnement ?

Il vous suffit de vous rendre dans la section compte dans paramètres et de cliquer sur « Résilier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez résilié votre abonnement, il restera actif pour le reste de la période pour laquelle vous avez payé. Découvrez-en plus ici.

Puis-je / comment puis-je télécharger des livres ?

Pour le moment, tous nos livres en format ePub adaptés aux mobiles peuvent être téléchargés via l’application. La plupart de nos PDF sont également disponibles en téléchargement et les autres seront téléchargeables très prochainement. Découvrez-en plus ici.

Quelle est la différence entre les formules tarifaires ?

Les deux abonnements vous donnent un accès complet à la bibliothèque et à toutes les fonctionnalités de Perlego. Les seules différences sont les tarifs ainsi que la période d’abonnement : avec l’abonnement annuel, vous économiserez environ 30 % par rapport à 12 mois d’abonnement mensuel.

Qu’est-ce que Perlego ?

Nous sommes un service d’abonnement à des ouvrages universitaires en ligne, où vous pouvez accéder à toute une bibliothèque pour un prix inférieur à celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! Découvrez-en plus ici.

Prenez-vous en charge la synthèse vocale ?

Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte à haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accélérer ou le ralentir. Découvrez-en plus ici.

Est-ce que R Bioinformatics Cookbook est un PDF/ePUB en ligne ?

Oui, vous pouvez accéder à R Bioinformatics Cookbook par Dan MacLean en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Computer Science et Bioinformatics. Nous disposons de plus d’un million d’ouvrages à découvrir dans notre catalogue.

Informations

Éditeur

Packt Publishing

Année

2019

ISBN

9781789955590

Édition

Sujet

Computer Science

Sous-sujet

Bioinformatics

Performing Quantitative RNAseq

The technology of RNAseq has revolutionized the study of transcript abundances, bringing high-sensitivity detection and high-throughput analysis. Bioinformatic analysis pipelines using RNAseq data typically start with a read quality control step followed by either alignment to a reference or the assembly of sequence reads into longer transcripts de novo. After that, transcript abundances are estimated with read counting and statistical models and differential expression between samples is assessed. Naturally, there are many technologies available for all steps of this pipeline. The quality control and read alignment steps will usually take place outside of R, so analysis in R will begin with a file of transcript or gene annotations (such as GFF and BED files) and a file of aligned reads (such as BAM files).

The tools in R for performing analysis are powerful and flexible. Many of them are part of the Bioconductor suite and, as such, integrate together very nicely. The key question researchers wish to answer with RNAseq is usually: Which transcripts are differentially expressed? In this chapter, we'll look at some recipes for that in standard cases where we already know the genomic positions of genes we're interested in, and in cases where we need to find unannotated transcripts. We'll also look at other important recipes that help answer the questions How many replicates are enough? and Which allele is expressed more?

In this chapter, we will cover the following recipes:

Estimating differential expression with edgeR
Estimating differential expression with DESeq2
Power analysis with powsimR
Finding unannotated transcribed regions with GRanges objects
Finding regions showing high expression ab initio with bumphunter
Differential peak analysis
Estimating batch effects using SVA
Finding allele-specific expression with AllelicImbalance
Plotting and presenting RNAseq data

Technical requirements

The sample data you'll need is available from this book's GitHub repository: https://github.com/PacktPublishing/R-Bioinformatics_Cookbook. If you want to use the code examples as they are written, then you will need to make sure that this data is in a sub-directory of whatever your working directory is.

Here are the R packages that you'll need. Most of these will install with install.packages(); others are a little more complicated:

Bioconductor
- AllelicImbalance
- bumphunter
- csaw
- DESeq
- edgeR
- IRanges
- Rsamtools
- rtracklayer
- sva
- SummarizedExperiment
- VariantAnnotation
dplyr
extRemes
forcats
magrittr
powsimR
readr

Bioconductor is huge and has its own installation manager. You can install it with the following code:

if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install()

Further information is available at https://www.bioconductor.org/install/.

Normally, in R, a user will load a library and use the functions directly by name. This is great in interactive sessions but it can cause confusion when many packages are loaded. To clarify which package and function I'm using at a given moment, I will occasionally use the packageName::functionName() convention.

Sometimes, in the middle of a recipe, I'll interrupt the code so you can see some intermediate output or the structure of an object it's important to understand. Whenever that happens, you'll see a code block where each line begins with ## (double hash symbols). Consider the following command:

letters[1:5]

This will give us output as follows:

## a b c d e

Note that the output lines are prefixed with ##.

Estimating differential expression with edgeR

edgeR is a widely used and powerful package that implements negative binomial models suitable for sparse count data such as RNAseq data in a general linear model framework, which are powerful for describing and understanding count relationships and exact tests for multi-group experiments. It uses a weighted style normalization called TMM, which is the weighted mean of log ratio between sample and control, after removal of genes with high counts and outlying log ratios. The TMM value should be close to one, but can be used as a correction factor to be applied to overall library sizes

In this recipe, we'll look at some options from preparing read counts for annotated regions in some object to identifying the differentially expressed features in a genome. Usually, there is an upstream step requiring us to take high-throughput sequence reads, align them to a reference and produce files describing those alignments, such as .bam files. With those files prepared, we'd fire up R and start to analyze. So that we can concentrate on the differential expression analysis part of the process, we'll use a prepared dataset for which all of the data is ready. Chapter 8, Working with Databases and Remote Data Sources, shows you how to go from raw data to this stage if you're looking for how to do that step.

As there are many different tools and methods for getting those alignments of reads, we will look at starting the process with two common input object types. We'll use a count table, like that we would have if we were loading from a text file and we'll use an ExpressionSet (eset) object, which is an object type common in Bioconductor.

Our prepared dataset will be the modencodefly data from the NHGRI encyclopedia of DNA elements project for the model organism, Drosophila melanogaster. You can read about this project at www.modencode.org. The dataset contains 147 different samples for D. melanogaster, a fruit fly with an approximately 110 Mbp genome, annotated with about 15,000 gene features.

Getting ready

The data is provided as both a count matrix and an ExpressionSet object and you can see the Appendix at the end of this book for further information on these object types. The data is in this book's code and data repository at https://github.com/PacktPublishing/R_Bioinformatics_Cookbook under datasets/ch1/modencodefly_eset.RData, datasets/ch1/modencodefly_count_table.txt, and datasets/ch1/modencodelfy_phenodata.txt . We'll also use the edgeR (from Bioconductor), readr, and magrittr libraries.

How to do it...

We will see two ways of estimating differential expressions with edgeR.

Using edgeR from a count table

For estimating differential expressions with edgeR from a count table (for example, in a text file), we will use the following steps:

Load the count data:

count_dataframe <- readr::read_tsv(file.path(getwd(), "datasets", "ch1"...