1 Assessing the MicrobiomeâCurrent and Future Technologies and Applications
Thomas Gurry , Shrish Budree , Alim Ladha , Bharat Ramakrishna and Zain Kassam
CONTENTS
Methods for Sequencing the Microbiome
16S rRNA Sequencing
Shotgun Metagenomic Sequencing
Generating Data from Samples
Common Descriptive Analysis Techniques for Microbiome Data
Diversity Analysis
Relative Abundance Plots
Functional Genomics
Critical Appraisal of Microbiome Data
Simple, Community-Level Analysesâ16S rRNA Sequencing
Detailed Metagenomic AnalysesâShotgun Metagenomic Sequencing
References
Andrew is an energetic 7-year-old, but his parents suspect something isnât quite right. His eyes always dart away, never quite meeting their gaze. He watches the same repeating eight seconds of a bright red monster truck squashing the shells of abandoned cars. He dutifully watches every day for exactly three hours. Andrewâs parents are worried, and bring him to see his pediatrician, Dr. Sara McDonald, who suspects that Andrew has autism spectrum disorder (ASD), a complex neurodevelopmental disorder. In the past, Dr. McDonald would have had to perform an in-depth, time-consuming evaluation of diagnostic criteria because there were no laboratory or diagnostic tests for ASD. But today, in 2040, the world is different. Dr. McDonald sends Andrewâs stool sample off to a lab for microbiome sequencing, and the microbial profile comes back with decreased Lachnospiracea and Dialister bacteria, a finding which supports a diagnosis of ASD. Crazy as it may seem, a world where a patientâs microbial signature deeply impacts their care is not far from reality. The goal of this chapter is to highlight the current methods for sequencing the microbiome, including common analyses, to understand potential future advances in the field, and to arm clinicians with practical knowledge for critically appraising extant and future literature surrounding the microbiome.
Recent advances in microbiome sequencing technology have led to an explosion of intestinal microbiome research (Lloyd-Price et al. 2016 and Jovel et al. 2016). Innovation in bacterial DNA sequencing methods has allowed researchers to describe the intestinal microbial community with unparalleled ease and precision. Previously, the identification of intestinal microbes was performed using culture-based techniques, which were limited both in resolution and throughput. Furthermore, bacterial sequencing has led to the identification of many bacterial species that were previously unculturable (Knight et al. 2017). These technological breakthroughs have opened new avenues through which to explore the relationship between the gut microbiome and the human host, including the role of gut bacteria in the pathogenesis of disease.
Although multiple sequencing technologies were at the forefront of this developing field, the Illumina sequencing platform has undoubtedly outperformed all others in terms of cost, reliability, user interface, and data quality. Illumina is now considered the gold standard technology for performing the two most common methods of bacterial DNA sequencing: 16S rRNA sequencing and shotgun metagenomic sequencing (Knight et al. 2017).
Methods for Sequencing the Microbiome
The term âsequencingâ describes the scientific technique of determining the order of nucleotides in a given sampleâs genomic material (e.g. bacterial DNA/RNA). This genetic information can, in turn, be used to describe the identity, population distribution, and, as discussed later in this chapter, the complex functional characteristics of the host microbiome. 16S rRNA sequencing and shotgun metagenomic sequencing differ significantly in cost, resolution, and difficulty. Therefore, investigators deciding which technique to use must consider numerous factors, including the experimental question being explored, the samples being analyzed, and the total budget.
16S rRNA Sequencing
In bacteria and archaea, the 16S rRNA gene contains both highly conserved and variable regions. Although this domain is always present in bacteria, differences in the variable regions of the gene correlate with specific bacterial species. 16S rRNA sequencing works by leveraging the known sequence of highly conserved regions of the 16S rRNA gene to amplify and sequence the variable regions (often regions V3, V4, and V5) in order to accurately characterize a sampleâs microbial community (Knight et al. 2017 and Olsen 2016).
16S rRNA sequencing is a relatively simple technique with many advantages over more complex sequencing strategies, including its low cost, standardized protocolsâincluding sample preparation, sequencing, and downstream analysisâand high-quality reference databases against which to map obtained sequence data. However, limiting the scope of sequenced DNA to the 16S gene only allows identification at the level of bacteria genera, excluding species-level resolution. For example, a sequence read of the V4 region may include hits to multiple species in a reference database, restricting an investigatorâs ability to draw conclusions about the specific species associated with the sequence read (Jovel et al. 2016 and Olsen 2016).
Shotgun Metagenomic Sequencing
Shotgun metagenomic sequencing, commonly referred to as whole-genome sequencing, determines the nucleotide sequence of all the genomic materials present in a sample. The DNA in a sample is too lengthy to amplify and sequence in one piece, so sample DNA is typically fragmented before being sequenced. The process of piecing sequence fragments back together requires both deep expertise and significant computational power. Once the fragments are realigned, the whole-genome sequence reads can be mapped against a reference database of known bacterial sequences to determine the microbial community. Marker genes, specific genes that are well characterized and sequenced across multiple bacterial strains and species, are mapped to a reference database to identify microbes. Given that this method sequences the entire bacterial genome, it enables much higher-resolution characterization of the microbial community, allowing investigators to make definitive conclusions about the species, and in some cases, the specific strains, present in a sample (Franzosa et al. 2016).
While shotgun metagenomics can increase the sequence resolution, it also generates vast amounts of ânoisyâ data. Therefore, significant computational expertise is required to clean and filter the resulting data into a more usable form. Furthermore, the added complexity of analysis means that significant variation can be introduced into the results by divergent analysis techniques. Shotgun metagenomics is also relatively expensive, often limiting study size, and, consequently, the statistical power of studies employing this sequencing technique.
Generating Data from Samples
Processing biological samples (e.g. stool, skin, etc.) for sequencing begins with the extraction of DNA from bacterial cells. This can be achieved by chemically dissolving the bacterial membrane, bursting the membrane using physical force, or a combination of the two (Olsen 2016). This process is commonly referred to as bacterial cell lysis. Cell lysis is also often accompanied by methods aimed at separating DNA from other components inside the cell membranes, including proteins, lipids, and other cell lysates. It is important to note that one of the main sources of variation in sequencing data stems from the different approaches to DNA extraction, as this step is a primary determinant of DNA purity and integrity (Debelius et al. 2016).
DNA extraction is followed by polymerase chain reaction (PCR) amplification. In the case of 16S rRNA sequencing, primers first bind to the constant region of the 16S gene and are subsequently extended into the 16S variable regions using a specifically engineered DNA polymerase enzyme. This process creates amplified sequences of the variable 16S regions, referred to as amplicons. In shotgun metagenomics, PCR is used to amplify the fragmented DNA sequences from the sample. Amplified sequences are then tagged with sample-specific barcodes, which facilitate multiplexingâa process in which multiple samples are run in a single Illumina sequencing lane, significantly increasing the sequencing throughput and reducing the overall cost of analysis. In the final preparatory step, adapters, which are required for binding the Illumina flow cell, are added to the amplicon sequences. Once this is completed, the sample is ready for sequencing. In a process similar to PCR, the Illumina platform is able to identify the exact nucleotide sequence of amplicons and amplified fragments by monitoring fluorescent output. Barcoded and adapter-modified nucleotide sequences are amplified using fluorescently labeled nucleotides, emitting a unique fluorescent pattern that can be directly translated into a sequence readout. The Illumina sequencing platform outputs FASTQ files which contain the ârawâ data comprised of both sequence reads and accompanying quality control scores. Finally, using open-source computational pipelines, the raw data can be quality trimmed and filtered.
In 16S rRNA sequencing, characterization of the microbial community using the sequence data begins by either clustering the 16S reads or comparing individual reads to a reference database. Clustering can be done using various computational methods, but the output is generally the same: groups of sequences (called operational taxonomic units or OTUs) that meet a threshold criterion for similarity (usually 97%). The defined OTUs are then mapped to a reference database to assign a taxonomic classification (Jovel et al. 2016 and Olsen 2016). In contrast, sequence reads can be directly mapped to a reference database without previous clustering to identify groups of OTUs and their most likely taxonomic classifications. Both methods are valid but often produce divergent results. The final product, in either case, is referred to as an OTU table, which contains the abundance of each identified OTU and its corresponding taxonomic classification. Using common descriptive analysis techniques, which are described later, the processed data is analyzed to answer experimental questions. More advanced comparative statistics, beyond the scope of this discussion, including linear modeling, may also be performed to identify statistically significant differences in microbial communities between samples or clinical covariates.
Common Descriptive Analysis Techniques for Microbiome Data
Diversity Analysis
Diversity measures provide information about the composition of a microbial community. Diversity analysis, in the context of the intestinal microbiome, is classified into alpha diversity and beta diversity. Alpha diversity is a metric used to quantify microbial diversity within a single sample (Jovel et al. 2016 and Olsen 2016). It may refer to the richness, or a number of different species present, the abundance of different species, or distribution of different species in the sample. The most common method of reporting alpha diversity is the Shannon Diversity Index, which is the sum of the proportion of each species relative to the total number of species in the community; therefore, it is a measure of both microbial abundance and distribution. Typically, studies will compare the alpha diversity between covariates under investigation, such as a comparison of the alpha diversity between the diseased and healthy control group. Numerous studies have correlated low alpha diversity with poorer health outcomes (Jovel et al. 2016 and Knight et al. 2017).
On the other hand, beta diversity is an analysis technique used to compare diversity between samples (Jovel et al. 2016 and Olsen 2016). It is typically used to determine âhow differentâ samples are from each other by effectively measuring the distance between samples because similar samples are âcloserâ together. This technique can be done with the supervision of phylogenetic data (e.g., UniFrac) or without it (e.g., BrayâCurtis dissimilarity). Once the beta ...