eBook - ePub

Bioinformatics with Python Cookbook

Name: Bioinformatics with Python Cookbook
Author: Tiago Antao

Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology, 2nd Edition

Tiago Antao

Share book

360 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Bioinformatics with Python Cookbook

Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology, 2nd Edition

Tiago Antao

Book details

Book preview

Table of contents

Citations

About This Book

Discover modern, next-generation sequencing libraries from Python ecosystem to analyze large amounts of biological data

Key Features

Perform complex bioinformatics analysis using the most important Python libraries and applications
Implement next-generation sequencing, metagenomics, automating analysis, population genetics, and more
Explore various statistical and machine learning techniques for bioinformatics data analysis

Book Description

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data.

This book covers next-generation sequencing, genomics, metagenomics, population genetics, phylogenetics, and proteomics. You'll learn modern programming techniques to analyze large amounts of biological data. With the help of real-world examples, you'll convert, analyze, and visualize datasets using various Python tools and libraries.

This book will help you get a better understanding of working with a Galaxy server, which is the most widely used bioinformatics web-based pipeline system. This updated edition also includes advanced next-generation sequencing filtering techniques. You'll also explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks such as Dask and Spark.

By the end of this book, you'll be able to use and implement modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.

What you will learn

Learn how to process large next-generation sequencing (NGS) datasets
Work with genomic dataset using the FASTQ, BAM, and VCF formats
Learn to perform sequence comparison and phylogenetic reconstruction
Perform complex analysis with protemics data
Use Python to interact with Galaxy servers
Use High-performance computing techniques with Dask and Spark
Visualize protein dataset interactions using Cytoscape
Use PCA and Decision Trees, two machine learning techniques, with biological datasets

Who this book is for

This book is for Data data Scientistsscientists, Bioinformatics bioinformatics analysts, researchers, and Python developers who want to address intermediate-to-advanced biological and bioinformatics problems using a recipe-based approach. Working knowledge of the Python programming language is expected.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Bioinformatics with Python Cookbook an online PDF/ePUB?

Yes, you can access Bioinformatics with Python Cookbook by Tiago Antao in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Bioinformática. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2018

ISBN

9781789349986

Edition

Topic

Ciencia de la computación

Subtopic

Bioinformática

Next-Generation Sequencing

In this chapter, we will cover the following recipes:

Accessing GenBank and moving around NCBI databases
Performing basic sequence analysis
Working with modern sequence formats
Working with alignment data
Analyzing data in VCF
Studying genome accessibility and filtering SNP data
Processing NGS data with HTSeq

Introduction

Next-generation sequencing (NGS) is one of the fundamental technological developments of the decade in life sciences. Whole genome sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and several other technologies are routinely used to investigate important biological problems. These are also called high-throughput sequencing technologies, and with good reason: they generate vast amounts of data that needs to be processed. NGS is the main reason that computational biology has become a big-data discipline. More than anything else, this is a field that requires strong bioinformatics techniques.

Here, we will not discuss each individual NGS technique per se (this would require a whole book on its own). We will use an existing WGS dataset and the 1,000 Genomes Project to illustrate the most common steps necessary to analyze genomic data. The recipes presented here will be easily applicable to other genomic sequencing approaches. Some of them can also be used for transcriptomic analysis (for example, RNA-Seq). The recipes are also species-independent, so you will be able to apply them to any other species for which you have sequenced data. The biggest difference in processing data from different species is related to genome size, diversity, and the quality of the assembled genome (if it exists for your species). These will not affect the automated Python part of NGS processing much. In any case, we will discuss different genomes in the next chapter, Chapter 3, Working with Genomes.

As this is not an introductory book, you are expected to know at least what FASTA, FASTQ, Binary Alignment Map (BAM), and Variant Call Format (VCF) files are. I will also make use of the basic genomic terminology without introducing it (such as exomes, nonsynonymous mutations, and so on). You are required to be familiar with basic Python. We will leverage this knowledge to introduce the fundamental libraries in Python to perform the NGS analysis. Here, we will follow the flow of a standard bioinformatics pipeline.

However, before we delve into real data from a real project, let's get comfortable with accessing existing genomic databases and basic sequence processing—a simple start before the storm.

Accessing GenBank and moving around NCBI databases

Although you may have your own data to analyze, you will probably need existing genomic datasets. Here, we will look at how to access such databases at the National Center for Biotechnology Information (NCBI). We will not only discuss GenBank, but also other databases at NCBI. Many people refer (wrongly) to the whole set of NCBI databases as GenBank, but NCBI includes the nucleotide database and many others, for example, PubMed.

As sequencing analysis is a long subject, and this book targets intermediate to advanced users, we will not be very exhaustive with a topic that is, at its core, not very complicated. Nonetheless, it's a good warm-up for the more complex recipes that we will see at the end of this chapter.

Getting ready

We will use Biopython, which you installed in Chapter 1, Python and the Surrounding Software Ecology. Biopython provides an interface to Entrez, the data retrieval system made available by NCBI. This recipe is made available in the Chapter02/Accessing_Databases.ipynb Notebook.

You will be accessing a live API from NCBI. Note that the performance of the system may vary during the day. Furthermore, you are expected to be a "good citizen" while using it. You will find some recommendations at http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen. Notably, you are required to specify an email address with your query. You should try to avoid large number of requests (100 or more) during peak times (between 9.00 a.m. and 5.00 p.m. American Eastern Time on weekdays), and do not post more than three queries per second (Biopython will take care of this for you). It's not only good citizenship, but you risk getting blocked if you over use NCBI's servers (a good reason to give a real email address, because NCBI may try to contact you).

How to do it...

Now, let's look at how we can search and fetch data from NCBI databases:

We will start by importing the relevant module and configuring the email address:

from Bio import Entrez, SeqIO
Entrez.email = '[email protected]'

We will also import the module to process sequences. Do not forget to put in the correct email address.

We will now try to find the chloroquine resistance transporter (CRT) gene in Plasmodium falciparum (the parasite that causes the deadliest form of malaria) on the nucleotide database:

handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]')
rec_list = Entrez.read(handle)
if rec_list['RetMax'] < rec_list['Count']:
 handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax=rec_list['Count'])
 rec_list = Entrez.read(handle)

We will search the nucleotide database for our gene and organism (for the syntax of the search string, check the NCBI website). Then, we will read the result that is returned. Note that the standard search will limit the number of record references to 20, so if you have more, you may want to repeat the query with an increased maximum limit. In our case, we will actually override the default limit with retmax. The Entrez system provides quite a few sophisticated ways to retrieve large number of results (for more information, check the Biopython or NCBI Entrez documentation). Although you now have the IDs of all of the records, you still need to retrieve the records properly.

Now, let's try to retrieve all of these records. The following query will download all matching nucleotide sequences from GenBank, which is 481, at the time of writing this book. You probably won't want to do this all the time:

id_list = rec_list['IdList']
hdl = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb')

Well, in this case, go ahead and do it. However, be careful with this technique, because you will retrieve a large amount of complete records, and some of them will have fairly large sequences inside. You risk downloading a lot of data (which would be a strain both on your side and on NCBI servers).

There are several ways around this. One way is to make a more restrictive query and/or download just a few at a time and stop when you have found the one that you need. The precise strategy will depend on what you are trying to achieve. In
any case, we will retrieve a list of records in the GenBank format (which includes sequences, plus a lot of interesting metadata).

Let's read and parse the result:

recs = list(SeqIO.parse(hdl, 'gb'))

Note that we have converted an iterator (the result of SeqIO.parse) to a list. The advantage of doing this is that we can use the result as many times as we want (for example, iterate many times over), without repeating the query on the server. This saves time, bandwidth, and server usage if you plan to iterate many times over. The disadvantage is that it will allocate memory for all records. This will not work for very large datasets; you might not want to do this conversion genome-wide as in the next chapter, Chapter 3, Working with Genomes. We will return to this topic in the last part of this book. If you are doing interactive computing, you will probably prefer to have a list (so that you can analyze and experiment with it multiple times), but if you are developing a library, an iterator will probably be the best approach.

We will now just concentrate on a single record. This will only work if you used the exact same preceding query:

for rec in recs:
 if rec.name == 'KM288867':
 break
print(rec.name) print(rec.description)

The rec variable now has our record of interest. The rec.description file will contain its human-readable description.

Now, let's extract some sequence features, which contain information such as gene products and exon positions on the sequence:

for feature in rec.features:
 if feature.type == 'gene':
 print(feature.qualifiers['gene'])
 elif feature.type == 'exon':
 loc = feature.location
 print(loc.start, loc.end, loc.strand)
 else:
 print('not processed:\n%s' % feature)

If the feature.type is gene, we will print its name, which will be in the qualifiers dictionary. We will also print all the locations of exons. Exons, as with all features, have locations in this sequence: a start, an end, and the strand from where they are read. While all the start and end positions for our exons are ExactPosition, note that Biopython supports many other types of positions. One type of position is BeforePosition, which specifies that a location point is before a certain sequence position. Another type of position is BetweenPosition...