eBook - ePub

Hands on Data Science for Biologists Using Python

Name: Hands on Data Science for Biologists Using Python
ISBN: 9781000345506

Yasha Hasija,

Rajkumar Chakraborty,

224 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Hands on Data Science for Biologists Using Python

Yasha Hasija,

Rajkumar Chakraborty,

About this book

Hands-on Data Science for Biologists using Python has been conceptualized to address the massive data handling needs of modern-day biologists. With the advent of high throughput technologies and consequent availability of omics data, biological science has become a data-intensive field. This hands-on textbook has been written with the inception of easing data analysis by providing an interactive, problem-based instructional approach in Python programming language.

The book starts with an introduction to Python and steadily delves into scrupulous techniques of data handling, preprocessing, and visualization. The book concludes with machine learning algorithms and their applications in biological data science. Each topic has an intuitive explanation of concepts and is accompanied with biological examples.

Features of this book:

The book contains standard templates for data analysis using Python, suitable for beginners as well as advanced learners.

This book shows working implementations of data handling and machine learning algorithms using real-life biological datasets and problems, such as gene expression analysis; disease prediction; image recognition; SNP association with phenotypes and diseases.

Considering the importance of visualization for data interpretation, especially in biological systems, there is a dedicated chapter for the ease of data visualization and plotting.

Every chapter is designed to be interactive and is accompanied with Jupyter notebook to prompt readers to practice in their local systems.

Other avant-garde component of the book is the inclusion of a machine learning project, wherein various machine learning algorithms are applied for the identification of genes associated with age-related disorders. A systematic understanding of data analysis steps has always been an important element for biological research. This book is a readily accessible resource that can be used as a handbook for data analysis, as well as a platter of standard code templates for building models.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

eBook ISBN

Topic

Subtopic

Programming in Python

Index

Computer Science

1

Python: Introduction and Environment Setup

Why Learn Python

Before knowing about Python, we should first understand why people working in the area of life sciences should learn to program. As we are in the era of information technology, we have seen a massive explosion in biological data like sequences, annotations, interactions, biologically active compounds, etc. For instance, while this chapter was being written last April 2019, the Gene Bank (NCBI) - which is one of the largest databases for nucleotide sequences - contains 212 million sequences in its repository (https://www.ncbi.nlm.nih.gov/genbank/statistics/). EMBL, which is also a raw nucleotide sequence repository, contains 2,253.8 million annotated sequence data which are expected to double in about 19.9 months (https://www.ebi.ac.uk/ena/about/statistics). This extensive data is being generated by the advent of high-throughput technologies. For the analysis of this massive amount of data, we need the help of computers. Computers consist of a central processing unit (CPU), a primary memory, and a secondary memory storage device. The CPU is the component that does operations on the data stored in primary and secondary memory. Primary memory is as fast as the CPU and is designed to keep up with its speed, but it loses its memory as soon as the power is switched off. A secondary memory storage device can store data after the computer shuts down. These make up our digital assistant - which is pretty fast and accurate in its tasks and does not get bored with repetitive jobs. However, in order to assign the job to computers and to receive the desired output, we need to comprehend their language, which is also known as the programming language. Every biological research involves using different datasets and has unique problems to solve - from filtering, merging, subsetting, finding commonalities between lists, and may even require customization of data formats for preserving and using information. Programming gives a free hand to users to think and implement innovative algorithms and solve various problems.

Over time, data science has also found its applications in life sciences. Data science helps in finding patterns in a huge amount of structured or unstructured data which can help in providing valuable insights in almost all frontiers of biology - ranging from finding putative variations, predicting amino acid substitution consequences, diagnosing diseases quickly, predicting lead drug toxicity, predicting pharmacophores, personalized, or precision medicine, prediction in the field of protein secondary and tertiary structure, microRNA interaction with their targets, epigenetics, etc. The very first step in generating a hypothesis from a big amount of data is the curation of large datasets. A task like curating data is very tedious and time-consuming work. It consists of repetitive searching of data from certain database’s websites, literature, and others. Here comes our digital assistant to the rescue, saving us from this tedious job as it can work much faster than how humans think and perform things manually. A 3.0-gigahertz CPU can process 3 billion instructions per second - that is an example of the tremendous power of computing.

The central theme of this book is to provide a practical approach to biologists in applying data science techniques on omics data. Data science usually consists of data analysis, data visualization, data preparation, Machine Learning, and more. We will discuss each aspect in relation to relevant biological problems along with their solutions - starting with basic Python programming so that readers can get accustomed to programming terminologies.

Programming skills are a valuable asset for any biologist. There are many programming languages that have been developed. Some are for instantaneous computation, website creation, and database generation, among others, and some are general-purpose programming languages that were developed to be used in a variety of application domains. Python is one example of a general-purpose programming language. Guido van Rossum developed it as a hobby in the Netherlands around 30 years ago and named it after a famous British comedian group called “Monty Python’s Circus”. Now, Python has applications in various domains like data science, web development, data visualization, and desktop applications, to name a few. Python is one of the popular programming languages in the data science and Machine Learning area, and it is community-driven. Since it has a very steady learning curve, it is recommended by many experts for beginners as their first programming language to learn. Primarily, Python has simple English-like readable syntax which is easily understandable by users. For example, if one wants to find the proportion of the amino acid Leucine with a symbol “L” contained in a protein sequence, the following Python code will do that:

Protein = “MKLFWLLFTIGFCWAQYSSNTQQGRTSIVHLFEWRWVDIALECERY”

Leu_contain = Protein.count(‘L’)/len(Protein)

print(Leu_contain)

The code is very much similar to the English language. The first line is the protein sequence. The second line calculates the Leucine residues (denoted by the letter “L”) by counting the number of times “L” appears in the sequence and then dividing it by the total length of the sequence. Moreover, at last printing the value, it turns out to be 0.108

Thanks to the readability of Python codes, learners can concentrate on the concepts of programming and problems more than learning the syntax of the language. As Python is community-driven and it has one of the largest communities, Python has evolved to contain several important libraries that are pre-installed or are freely available to install. These libraries help in the quick and efficient development of complex applications, because these do not need to be written from scratch.

Another advantage of learning Python is that it can be used for various purposes due to the development of popular libraries, such as:

•Frameworks like Django, Flask, Pylons are used for creating static and dynamic websites.

•Libraries like Pandas, NumPy, and Matplotlib are accessible for data science and visualization.

•Scikit-Learn and TensorFlow are advanced libraries for Machine Learning and deep learning

•Desktop applications can be built using packages like PyQt, Gtk, and wxWidgets, among others.

•Modules like BeeWare or Kivy are taking the lead in mobile applications.

Learning programming is the same as learning a new language; we have to first understand the vocabulary and syntaxes. Next, we learn how to construct some meaningful but terse sentences. Using those sentences, we then form paragraphs, and finally, we write our own story. In this book, we will start with Python syntaxes and vocabulary. Then, we will construct small programs with biological relevance to help biologists learn programming with problems that are important to them.

Installing Python

We are using Python 3.7, which is the current and stable version of Python. Most of the operating systems either already have Python installed by default, or it can be downloaded from the Python Software Foundation’s website (https://www.python.org/), where it is freely available. After installing Python, open the Python Shell in Windows or type “python3” in the terminal of Mac or Linux as follows:

 Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit

(AMD64)] on win32

 Type “help”, “copyright”, “credits” or “license()” for more information.

»>Instructions are typed after “»>”. Let us start typing our first instruction

and press enter.


»> print(‘Welcome to Python’)

Welcome to Python

Our first instruction was simple - to print “Welcome to Python”. If it runs correctly, then Python has been successfully installed and we are all set and ready to go!

Installing Anaconda Distribution

As we have discussed, Python has various packages that aid us in writing fewer lines of codes. Installing each package one by one is a time-consuming job. Moreover, because this book is centered on data science applications, we will require many widely used packages and along with their dependencies. For the sake of investing less time in setting up the coding environment, we will install the Anaconda distribution of Python. The Anaconda distribution comes with preinstalled packages for data science, and it is the most popular among data scientists. Most of the statistics, data visualization, and Machine Learning packages are built-in with the installation of Anaconda distribution. It is basically Python with a set of various useful tools and packages preinstalled within itself. We will also get IPython (i.e. an interactive Python shell) and Jupyter Notebook-like packages along with it. Jupyter Notebook will be used throughout this book for writing codes and executing these. Jupyter Notebook is a kind of interactive notebook based on IPython distribution. As a server-client application, the Jupyter Notebook App enables us to write, edit, and run our codes in notebooks through an internet browser. The application can be executed on a personal computer even without internet access. It comes with an Integrated Development Environment (IDE) which has autofill options for variables and packages. The Jupyter Notebook is also an easy way to share codes, so the codes used in this book may be downloaded and executed in the machines of users.

For more information about the Anaconda distribution, one can visit their official website (https://www.Anaconda.com/distribution/). To install Anaconda on the computer, go to (https://www.Anaconda.com/distribution/#download-section). Choose Python 3.x version, where x is equal to or greater than 7, and then download the graphical installer according to the user’s operating system (i.e. Windows, Linux, or Mac OS). Follow the instructions for the graphical installer and keep all of the default options ticked.

Running the Jupyter Notebook

After installing the Anaconda distribution, we may now proceed to opening the Jupyter Notebook and then writing our first line of code. To do this, open the Anaconda command prompt in Windows or terminal for Linux or Mac OS users...

Cover
Half Title
Title Page
Copyright Page
Contents
Preface
Author Bio
1. Python: Introduction and Environment Setup
2. Basic Python Programming
3. Biopython
4. Python for Data Analysis
5. Python for Data Visualization
6. Principal Component Analysis
7. Hands-On Projects
8. Machine Learning and Linear Regression
9. Logistic Regression
10. K-Nearest Neighbors (K-NN)
11. Decision Trees and Random Forests
12. Support Vector Machines
13. Neural Nets and Deep Learning
14. The Machine Learning Project
15. Natural Language Processing
16. K-Means Clustering
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Hands on Data Science for Biologists Using Python by Yasha Hasija,Rajkumar Chakraborty in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming in Python. We have over one million books available in our catalogue for you to explore.