R Bioinformatics Cookbook
eBook - ePub

R Bioinformatics Cookbook

Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Dan MacLean

Partager le livre
  1. 316 pages
  2. English
  3. ePUB (adapté aux mobiles)
  4. Disponible sur iOS et Android
eBook - ePub

R Bioinformatics Cookbook

Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Dan MacLean

DĂ©tails du livre
Aperçu du livre
Table des matiĂšres
Citations

À propos de ce livre

Over 60 recipes to model and handle real-life biological data using modern libraries from the R ecosystem

Key Features

  • Apply modern R packages to handle biological data using real-world examples
  • Represent biological data with advanced visualizations suitable for research and publications
  • Handle real-world problems in bioinformatics such as next-generation sequencing, metagenomics, and automating analyses

Book Description

Handling biological data effectively requires an in-depth knowledge of machine learning techniques and computational skills, along with an understanding of how to use tools such as edgeR and DESeq. With the R Bioinformatics Cookbook, you'll explore all this and more, tackling common and not-so-common challenges in the bioinformatics domain using real-world examples.

This book will use a recipe-based approach to show you how to perform practical research and analysis in computational biology with R. You will learn how to effectively analyze your data with the latest tools in Bioconductor, ggplot, and tidyverse. The book will guide you through the essential tools in Bioconductor to help you understand and carry out protocols in RNAseq, phylogenetics, genomics, and sequence analysis. As you progress, you will get up to speed with how machine learning techniques can be used in the bioinformatics domain. You will gradually develop key computational skills such as creating reusable workflows in R Markdown and packages for code reuse.

By the end of this book, you'll have gained a solid understanding of the most important and widely used techniques in bioinformatic analysis and the tools you need to work with real biological data.

What you will learn

  • Employ Bioconductor to determine differential expressions in RNAseq data
  • Run SAMtools and develop pipelines to find single nucleotide polymorphisms (SNPs) and Indels
  • Use ggplot to create and annotate a range of visualizations
  • Query external databases with Ensembl to find functional genomics information
  • Execute large-scale multiple sequence alignment with DECIPHER to perform comparative genomics
  • Use d3.js and Plotly to create dynamic and interactive web graphics
  • Use k-nearest neighbors, support vector machines and random forests to find groups and classify data

Who this book is for

This book is for bioinformaticians, data analysts, researchers, and R developers who want to address intermediate-to-advanced biological and bioinformatics problems by learning through a recipe-based approach. Working knowledge of R programming language and basic knowledge of bioinformatics are prerequisites.

Foire aux questions

Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via l’application. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă  la bibliothĂšque et Ă  toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode d’abonnement : avec l’abonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă  12 mois d’abonnement mensuel.
Qu’est-ce que Perlego ?
Nous sommes un service d’abonnement Ă  des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă  toute une bibliothĂšque pour un prix infĂ©rieur Ă  celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte Ă  haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que R Bioinformatics Cookbook est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă  R Bioinformatics Cookbook par Dan MacLean en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Computer Science et Bioinformatics. Nous disposons de plus d’un million d’ouvrages Ă  dĂ©couvrir dans notre catalogue.

Informations

Année
2019
ISBN
9781789955590
Édition
1
Sous-sujet
Bioinformatics

Performing Quantitative RNAseq

The technology of RNAseq has revolutionized the study of transcript abundances, bringing high-sensitivity detection and high-throughput analysis. Bioinformatic analysis pipelines using RNAseq data typically start with a read quality control step followed by either alignment to a reference or the assembly of sequence reads into longer transcripts de novo. After that, transcript abundances are estimated with read counting and statistical models and differential expression between samples is assessed. Naturally, there are many technologies available for all steps of this pipeline. The quality control and read alignment steps will usually take place outside of R, so analysis in R will begin with a file of transcript or gene annotations (such as GFF and BED files) and a file of aligned reads (such as BAM files).
The tools in R for performing analysis are powerful and flexible. Many of them are part of the Bioconductor suite and, as such, integrate together very nicely. The key question researchers wish to answer with RNAseq is usually: Which transcripts are differentially expressed? In this chapter, we'll look at some recipes for that in standard cases where we already know the genomic positions of genes we're interested in, and in cases where we need to find unannotated transcripts. We'll also look at other important recipes that help answer the questions How many replicates are enough? and Which allele is expressed more?
In this chapter, we will cover the following recipes:
  • Estimating differential expression with edgeR
  • Estimating differential expression with DESeq2
  • Power analysis with powsimR
  • Finding unannotated transcribed regions with GRanges objects
  • Finding regions showing high expression ab initio with bumphunter
  • Differential peak analysis
  • Estimating batch effects using SVA
  • Finding allele-specific expression with AllelicImbalance
  • Plotting and presenting RNAseq data

Technical requirements

The sample data you'll need is available from this book's GitHub repository: https://github.com/PacktPublishing/R-Bioinformatics_Cookbook. If you want to use the code examples as they are written, then you will need to make sure that this data is in a sub-directory of whatever your working directory is.
Here are the R packages that you'll need. Most of these will install with install.packages(); others are a little more complicated:
  • Bioconductor
    • AllelicImbalance
    • bumphunter
    • csaw
    • DESeq
    • edgeR
    • IRanges
    • Rsamtools
    • rtracklayer
    • sva
    • SummarizedExperiment
    • VariantAnnotation
  • dplyr
  • extRemes
  • forcats
  • magrittr
  • powsimR
  • readr
Bioconductor is huge and has its own installation manager. You can install it with the following code:
if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install()
Further information is available at https://www.bioconductor.org/install/.
Normally, in R, a user will load a library and use the functions directly by name. This is great in interactive sessions but it can cause confusion when many packages are loaded. To clarify which package and function I'm using at a given moment, I will occasionally use the packageName::functionName() convention.
Sometimes, in the middle of a recipe, I'll interrupt the code so you can see some intermediate output or the structure of an object it's important to understand. Whenever that happens, you'll see a code block where each line begins with ## (double hash symbols). Consider the following command:

letters[1:5]
This will give us output as follows:
## a b c d e
Note that the output lines are prefixed with ##.

Estimating differential expression with edgeR

edgeR is a widely used and powerful package that implements negative binomial models suitable for sparse count data such as RNAseq data in a general linear model framework, which are powerful for describing and understanding count relationships and exact tests for multi-group experiments. It uses a weighted style normalization called TMM, which is the weighted mean of log ratio between sample and control, after removal of genes with high counts and outlying log ratios. The TMM value should be close to one, but can be used as a correction factor to be applied to overall library sizes
In this recipe, we'll look at some options from preparing read counts for annotated regions in some object to identifying the differentially expressed features in a genome. Usually, there is an upstream step requiring us to take high-throughput sequence reads, align them to a reference and produce files describing those alignments, such as .bam files. With those files prepared, we'd fire up R and start to analyze. So that we can concentrate on the differential expression analysis part of the process, we'll use a prepared dataset for which all of the data is ready. Chapter 8, Working with Databases and Remote Data Sources, shows you how to go from raw data to this stage if you're looking for how to do that step.
As there are many different tools and methods for getting those alignments of reads, we will look at starting the process with two common input object types. We'll use a count table, like that we would have if we were loading from a text file and we'll use an ExpressionSet (eset) object, which is an object type common in Bioconductor.
Our prepared dataset will be the modencodefly data from the NHGRI encyclopedia of DNA elements project for the model organism, Drosophila melanogaster. You can read about this project at www.modencode.org. The dataset contains 147 different samples for D. melanogaster, a fruit fly with an approximately 110 Mbp genome, annotated with about 15,000 gene features.

Getting ready

The data is provided as both a count matrix and an ExpressionSet object and you can see the Appendix at the end of this book for further information on these object types. The data is in this book's code and data repository at https://github.com/PacktPublishing/R_Bioinformatics_Cookbook under datasets/ch1/modencodefly_eset.RData, datasets/ch1/modencodefly_count_table.txt, and datasets/ch1/modencodelfy_phenodata.txt . We'll also use the edgeR (from Bioconductor), readr, and magrittr libraries.

How to do it...

We will see two ways of estimating differential expressions with edgeR.

Using edgeR from a count table

For estimating differential expressions with edgeR from a count table (for example, in a text file), we will use the following steps:
  1. Load the count data:
count_dataframe <- readr::read_tsv(file.path(getwd(), "datasets", "ch1"...

Table des matiĂšres