eBook - ePub

Bioinformatics with Python Cookbook

Name: Bioinformatics with Python Cookbook
Author: Tiago Antao

Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology, 2nd Edition

Tiago Antao

Buch teilen

360 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Bioinformatics with Python Cookbook

Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology, 2nd Edition

Tiago Antao

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Discover modern, next-generation sequencing libraries from Python ecosystem to analyze large amounts of biological data

Key Features

Perform complex bioinformatics analysis using the most important Python libraries and applications
Implement next-generation sequencing, metagenomics, automating analysis, population genetics, and more
Explore various statistical and machine learning techniques for bioinformatics data analysis

Book Description

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data.

This book covers next-generation sequencing, genomics, metagenomics, population genetics, phylogenetics, and proteomics. You'll learn modern programming techniques to analyze large amounts of biological data. With the help of real-world examples, you'll convert, analyze, and visualize datasets using various Python tools and libraries.

This book will help you get a better understanding of working with a Galaxy server, which is the most widely used bioinformatics web-based pipeline system. This updated edition also includes advanced next-generation sequencing filtering techniques. You'll also explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks such as Dask and Spark.

By the end of this book, you'll be able to use and implement modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.

What you will learn

Learn how to process large next-generation sequencing (NGS) datasets
Work with genomic dataset using the FASTQ, BAM, and VCF formats
Learn to perform sequence comparison and phylogenetic reconstruction
Perform complex analysis with protemics data
Use Python to interact with Galaxy servers
Use High-performance computing techniques with Dask and Spark
Visualize protein dataset interactions using Cytoscape
Use PCA and Decision Trees, two machine learning techniques, with biological datasets

Who this book is for

This book is for Data data Scientistsscientists, Bioinformatics bioinformatics analysts, researchers, and Python developers who want to address intermediate-to-advanced biological and bioinformatics problems using a recipe-based approach. Working knowledge of the Python programming language is expected.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Bioinformatics with Python Cookbook als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Bioinformatics with Python Cookbook von Tiago Antao im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Ciencia de la computación & Bioinformática. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Packt Publishing

Jahr

2018

ISBN

9781789349986

Auflage

Thema

Ciencia de la computación

Thema

Bioinformática

Next-Generation Sequencing

In this chapter, we will cover the following recipes:

Accessing GenBank and moving around NCBI databases
Performing basic sequence analysis
Working with modern sequence formats
Working with alignment data
Analyzing data in VCF
Studying genome accessibility and filtering SNP data
Processing NGS data with HTSeq

Introduction

Next-generation sequencing (NGS) is one of the fundamental technological developments of the decade in life sciences. Whole genome sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and several other technologies are routinely used to investigate important biological problems. These are also called high-throughput sequencing technologies, and with good reason: they generate vast amounts of data that needs to be processed. NGS is the main reason that computational biology has become a big-data discipline. More than anything else, this is a field that requires strong bioinformatics techniques.

Here, we will not discuss each individual NGS technique per se (this would require a whole book on its own). We will use an existing WGS dataset and the 1,000 Genomes Project to illustrate the most common steps necessary to analyze genomic data. The recipes presented here will be easily applicable to other genomic sequencing approaches. Some of them can also be used for transcriptomic analysis (for example, RNA-Seq). The recipes are also species-independent, so you will be able to apply them to any other species for which you have sequenced data. The biggest difference in processing data from different species is related to genome size, diversity, and the quality of the assembled genome (if it exists for your species). These will not affect the automated Python part of NGS processing much. In any case, we will discuss different genomes in the next chapter, Chapter 3, Working with Genomes.

As this is not an introductory book, you are expected to know at least what FASTA, FASTQ, Binary Alignment Map (BAM), and Variant Call Format (VCF) files are. I will also make use of the basic genomic terminology without introducing it (such as exomes, nonsynonymous mutations, and so on). You are required to be familiar with basic Python. We will leverage this knowledge to introduce the fundamental libraries in Python to perform the NGS analysis. Here, we will follow the flow of a standard bioinformatics pipeline.

However, before we delve into real data from a real project, let's get comfortable with accessing existing genomic databases and basic sequence processing—a simple start before the storm.

Accessing GenBank and moving around NCBI databases

Although you may have your own data to analyze, you will probably need existing genomic datasets. Here, we will look at how to access such databases at the National Center for Biotechnology Information (NCBI). We will not only discuss GenBank, but also other databases at NCBI. Many people refer (wrongly) to the whole set of NCBI databases as GenBank, but NCBI includes the nucleotide database and many others, for example, PubMed.

As sequencing analysis is a long subject, and this book targets intermediate to advanced users, we will not be very exhaustive with a topic that is, at its core, not very complicated. Nonetheless, it's a good warm-up for the more complex recipes that we will see at the end of this chapter.

Getting ready

We will use Biopython, which you installed in Chapter 1, Python and the Surrounding Software Ecology. Biopython provides an interface to Entrez, the data retrieval system made available by NCBI. This recipe is made available in the Chapter02/Accessing_Databases.ipynb Notebook.

You will be accessing a live API from NCBI. Note that the performance of the system may vary during the day. Furthermore, you are expected to be a "good citizen" while using it. You will find some recommendations at http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen. Notably, you are required to specify an email address with your query. You should try to avoid large number of requests (100 or more) during peak times (between 9.00 a.m. and 5.00 p.m. American Eastern Time on weekdays), and do not post more than three queries per second (Biopython will take care of this for you). It's not only good citizenship, but you risk getting blocked if you over use NCBI's servers (a good reason to give a real email address, because NCBI may try to contact you).

How to do it...

Now, let's look at how we can search and fetch data from NCBI databases:

We will start by importing the relevant module and configuring the email address:

from Bio import Entrez, SeqIO
Entrez.email = '[email protected]'

We will also import the module to process sequences. Do not forget to put in the correct email address.

We will now try to find the chloroquine resistance transporter (CRT) gene in Plasmodium falciparum (the parasite that causes the deadliest form of malaria) on the nucleotide database:

handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]')
rec_list = Entrez.read(handle)
if rec_list['RetMax'] < rec_list['Count']:
 handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax=rec_list['Count'])
 rec_list = Entrez.read(handle)

We will search the nucleotide database for our gene and organism (for the syntax of the search string, check the NCBI website). Then, we will read the result that is returned. Note that the standard search will limit the number of record references to 20, so if you have more, you may want to repeat the query with an increased maximum limit. In our case, we will actually override the default limit with retmax. The Entrez system provides quite a few sophisticated ways to retrieve large number of results (for more information, check the Biopython or NCBI Entrez documentation). Although you now have the IDs of all of the records, you still need to retrieve the records properly.

Now, let's try to retrieve all of these records. The following query will download all matching nucleotide sequences from GenBank, which is 481, at the time of writing this book. You probably won't want to do this all the time:

id_list = rec_list['IdList']
hdl = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb')

Well, in this case, go ahead and do it. However, be careful with this technique, because you will retrieve a large amount of complete records, and some of them will have fairly large sequences inside. You risk downloading a lot of data (which would be a strain both on your side and on NCBI servers).

There are several ways around this. One way is to make a more restrictive query and/or download just a few at a time and stop when you have found the one that you need. The precise strategy will depend on what you are trying to achieve. In
any case, we will retrieve a list of records in the GenBank format (which includes sequences, plus a lot of interesting metadata).

Let's read and parse the result:

recs = list(SeqIO.parse(hdl, 'gb'))

Note that we have converted an iterator (the result of SeqIO.parse) to a list. The advantage of doing this is that we can use the result as many times as we want (for example, iterate many times over), without repeating the query on the server. This saves time, bandwidth, and server usage if you plan to iterate many times over. The disadvantage is that it will allocate memory for all records. This will not work for very large datasets; you might not want to do this conversion genome-wide as in the next chapter, Chapter 3, Working with Genomes. We will return to this topic in the last part of this book. If you are doing interactive computing, you will probably prefer to have a list (so that you can analyze and experiment with it multiple times), but if you are developing a library, an iterator will probably be the best approach.

We will now just concentrate on a single record. This will only work if you used the exact same preceding query:

for rec in recs:
 if rec.name == 'KM288867':
 break
print(rec.name) print(rec.description)

The rec variable now has our record of interest. The rec.description file will contain its human-readable description.

Now, let's extract some sequence features, which contain information such as gene products and exon positions on the sequence:

for feature in rec.features:
 if feature.type == 'gene':
 print(feature.qualifiers['gene'])
 elif feature.type == 'exon':
 loc = feature.location
 print(loc.start, loc.end, loc.strand)
 else:
 print('not processed:\n%s' % feature)

If the feature.type is gene, we will print its name, which will be in the qualifiers dictionary. We will also print all the locations of exons. Exons, as with all features, have locations in this sequence: a start, an end, and the strand from where they are read. While all the start and end positions for our exons are ExactPosition, note that Biopython supports many other types of positions. One type of position is BeforePosition, which specifies that a location point is before a certain sequence position. Another type of position is BetweenPosition...