eBook - ePub

Hands on Data Science for Biologists Using Python

Name: Hands on Data Science for Biologists Using Python
Author: Yasha Hasija, Rajkumar Chakraborty

Yasha Hasija, Rajkumar Chakraborty

Compartir libro

224 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Hands on Data Science for Biologists Using Python

Yasha Hasija, Rajkumar Chakraborty

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Hands-on Data Science for Biologists using Python has been conceptualized to address the massive data handling needs of modern-day biologists. With the advent of high throughput technologies and consequent availability of omics data, biological science has become a data-intensive field. This hands-on textbook has been written with the inception of easing data analysis by providing an interactive, problem-based instructional approach in Python programming language.

The book starts with an introduction to Python and steadily delves into scrupulous techniques of data handling, preprocessing, and visualization. The book concludes with machine learning algorithms and their applications in biological data science. Each topic has an intuitive explanation of concepts and is accompanied with biological examples.

Features of this book:

The book contains standard templates for data analysis using Python, suitable for beginners as well as advanced learners.
This book shows working implementations of data handling and machine learning algorithms using real-life biological datasets and problems, such as gene expression analysis; disease prediction; image recognition; SNP association with phenotypes and diseases.
Considering the importance of visualization for data interpretation, especially in biological systems, there is a dedicated chapter for the ease of data visualization and plotting.
Every chapter is designed to be interactive and is accompanied with Jupyter notebook to prompt readers to practice in their local systems.

Other avant-garde component of the book is the inclusion of a machine learning project, wherein various machine learning algorithms are applied for the identification of genes associated with age-related disorders. A systematic understanding of data analysis steps has always been an important element for biological research. This book is a readily accessible resource that can be used as a handbook for data analysis, as well as a platter of standard code templates for building models.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Hands on Data Science for Biologists Using Python un PDF/ePUB en línea?

Sí, puedes acceder a Hands on Data Science for Biologists Using Python de Yasha Hasija, Rajkumar Chakraborty en formato PDF o ePUB, así como a otros libros populares de Ciencia de la computación y Programación en Python. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

CRC Press

Año

2021

ISBN

9781000345506

Edición

Categoría

Ciencia de la computación

Categoría

Programación en Python

1

Python: Introduction and Environment Setup

Why Learn Python

Before knowing about Python, we should first understand why people working in the area of life sciences should learn to program. As we are in the era of information technology, we have seen a massive explosion in biological data like sequences, annotations, interactions, biologically active compounds, etc. For instance, while this chapter was being written last April 2019, the Gene Bank (NCBI) - which is one of the largest databases for nucleotide sequences - contains 212 million sequences in its repository (https://www.ncbi.nlm.nih.gov/genbank/statistics/). EMBL, which is also a raw nucleotide sequence repository, contains 2,253.8 million annotated sequence data which are expected to double in about 19.9 months (https://www.ebi.ac.uk/ena/about/statistics). This extensive data is being generated by the advent of high-throughput technologies. For the analysis of this massive amount of data, we need the help of computers. Computers consist of a central processing unit (CPU), a primary memory, and a secondary memory storage device. The CPU is the component that does operations on the data stored in primary and secondary memory. Primary memory is as fast as the CPU and is designed to keep up with its speed, but it loses its memory as soon as the power is switched off. A secondary memory storage device can store data after the computer shuts down. These make up our digital assistant - which is pretty fast and accurate in its tasks and does not get bored with repetitive jobs. However, in order to assign the job to computers and to receive the desired output, we need to comprehend their language, which is also known as the programming language. Every biological research involves using different datasets and has unique problems to solve - from filtering, merging, subsetting, finding commonalities between lists, and may even require customization of data formats for preserving and using information. Programming gives a free hand to users to think and implement innovative algorithms and solve various problems.

Over time, data science has also found its applications in life sciences. Data science helps in finding patterns in a huge amount of structured or unstructured data which can help in providing valuable insights in almost all frontiers of biology - ranging from finding putative variations, predicting amino acid substitution consequences, diagnosing diseases quickly, predicting lead drug toxicity, predicting pharmacophores, personalized, or precision medicine, prediction in the field of protein secondary and tertiary structure, microRNA interaction with their targets, epigenetics, etc. The very first step in generating a hypothesis from a big amount of data is the curation of large datasets. A task like curating data is very tedious and time-consuming work. It consists of repetitive searching of data from certain database’s websites, literature, and others. Here comes our digital assistant to the rescue, saving us from this tedious job as it can work much faster than how humans think and perform things manually. A 3.0-gigahertz CPU can process 3 billion instructions per second - that is an example of the tremendous power of computing.

The central theme of this book is to provide a practical approach to biologists in applying data science techniques on omics data. Data science usually consists of data analysis, data visualization, data preparation, Machine Learning, and more. We will discuss each aspect in relation to relevant biological problems along with their solutions - starting with basic Python programming so that readers can get accustomed to programming terminologies.

Programming skills are a valuable asset for any biologist. There are many programming languages that have been developed. Some are for instantaneous computation, website creation, and database generation, among others, and some are general-purpose programming languages that were developed to be used in a variety of application domains. Python is one example of a general-purpose programming language. Guido van Rossum developed it as a hobby in the Netherlands around 30 years ago and named it after a famous British comedian group called “Monty Python’s Circus”. Now, Python has applications in various domains like data science, web development, data visualization, and desktop applications, to name a few. Python is one of the popular programming languages in the data science and Machine Learning area, and it is community-driven. Since it has a very steady learning curve, it is recommended by many experts for beginners as their first programming language to learn. Primarily, Python has simple English-like readable syntax which is easily understandable by users. For example, if one wants to find the proportion of the amino acid Leucine with a symbol “L” contained in a protein sequence, the following Python code will do that:

Protein = “MKLFWLLFTIGFCWAQYSSNTQQGRTSIVHLFEWRWVDIALECERY”

Leu_contain = Protein.count(‘L’)/len(Protein)

print(Leu_contain)

The code is very much similar to the English language. The first line is the protein sequence. The second line calculates the Leucine residues (denoted by the letter “L”) by counting the number of times “L” appears in the sequence and then dividing it by the total length of the sequence. Moreover, at last printing the value, it turns out to be 0.108

Thanks to the readability of Python codes, learners can concentrate on the concepts of programming and problems more than learning the syntax of the language. As Python is community-driven and it has one of the largest communities, Python has evolved to contain several important libraries that are pre-installed or are freely available to install. These libraries help in the quick and efficient development of complex applications, because these do not need to be written from scratch.

Another advantage of learning Python is that it can be used for various purposes due to the development of popular libraries, such as:

•Frameworks like Django, Flask, Pylons are used for creating static and dynamic websites.

•Libraries like Pandas, NumPy, and Matplotlib are accessible for data science and visualization.

•Scikit-Learn and TensorFlow are advanced libraries for Machine Learning and deep learning

•Desktop applications can be built using packages like PyQt, Gtk, and wxWidgets, among others.

•Modules like BeeWare or Kivy are taking the lead in mobile applications.

Learning programming is the same as learning a new language; we have to first understand the vocabulary and syntaxes. Next, we learn how to construct some meaningful but terse sentences. Using those sentences, we then form paragraphs, and finally, we write our own story. In this book, we will start with Python syntaxes and vocabulary. Then, we will construct small programs with biological relevance to help biologists learn programming with problems that are important to them.

Installing Python

We are using Python 3.7, which is the current and stable version of Python. Most of the operating systems either already have Python installed by default, or it can be downloaded from the Python Software Foundation’s website (https://www.python.org/), where it is freely available. After installing Python, open the Python Shell in Windows or type “python3” in the terminal of Mac or Linux as follows:

 Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit

(AMD64)] on win32

 Type “help”, “copyright”, “credits” or “license()” for more information.

»>Instructions are typed after “»>”. Let us start typing our first instruction

and press enter.


»> print(‘Welcome to Python’)

Welcome to Python

Our first instruction was simple - to print “Welcome to Python”. If it runs correctly, then Python has been successfully installed and we are all set and ready to go!

Installing Anaconda Distribution

As we have discussed, Python has various packages that aid us in writing fewer lines of codes. Installing each package one by one is a time-consuming job. Moreover, because this book is centered on data science applications, we will require many widely used packages and along with their dependencies. For the sake of investing less time in setting up the coding environment, we will install the Anaconda distribution of Python. The Anaconda distribution comes with preinstalled packages for data science, and it is the most popular among data scientists. Most of the statistics, data visualization, and Machine Learning packages are built-in with the installation of Anaconda distribution. It is basically Python with a set of various useful tools and packages preinstalled within itself. We will also get IPython (i.e. an interactive Python shell) and Jupyter Notebook-like packages along with it. Jupyter Notebook will be used throughout this book for writing codes and executing these. Jupyter Notebook is a kind of interactive notebook based on IPython distribution. As a server-client application, the Jupyter Notebook App enables us to write, edit, and run our codes in notebooks through an internet browser. The application can be executed on a personal computer even without internet access. It comes with an Integrated Development Environment (IDE) which has autofill options for variables and packages. The Jupyter Notebook is also an easy way to share codes, so the codes used in this book may be downloaded and executed in the machines of users.

For more information about the Anaconda distribution, one can visit their official website (https://www.Anaconda.com/distribution/). To install Anaconda on the computer, go to (https://www.Anaconda.com/distribution/#download-section). Choose Python 3.x version, where x is equal to or greater than 7, and then download the graphical installer according to the user’s operating system (i.e. Windows, Linux, or Mac OS). Follow the instructions for the graphical installer and keep all of the default options ticked.

Running the Jupyter Notebook

After installing the Anaconda distribution, we may now proceed to opening the Jupyter Notebook and then writing our first line of code. To do this, open the Anaconda command prompt in Windows or terminal for Linux or Mac OS users...