Why Learn Python
Before knowing about Python, we should first understand why people working in the area of life sciences should learn to program. As we are in the era of information technology, we have seen a massive explosion in biological data like sequences, annotations, interactions, biologically active compounds, etc. For instance, while this chapter was being written last April 2019, the Gene Bank (NCBI) - which is one of the largest databases for nucleotide sequences - contains 212 million sequences in its repository (https://www.ncbi.nlm.nih.gov/genbank/statistics/). EMBL, which is also a raw nucleotide sequence repository, contains 2,253.8 million annotated sequence data which are expected to double in about 19.9 months (https://www.ebi.ac.uk/ena/about/statistics). This extensive data is being generated by the advent of high-throughput technologies. For the analysis of this massive amount of data, we need the help of computers. Computers consist of a central processing unit (CPU), a primary memory, and a secondary memory storage device. The CPU is the component that does operations on the data stored in primary and secondary memory. Primary memory is as fast as the CPU and is designed to keep up with its speed, but it loses its memory as soon as the power is switched off. A secondary memory storage device can store data after the computer shuts down. These make up our digital assistant - which is pretty fast and accurate in its tasks and does not get bored with repetitive jobs. However, in order to assign the job to computers and to receive the desired output, we need to comprehend their language, which is also known as the programming language. Every biological research involves using different datasets and has unique problems to solve - from filtering, merging, subsetting, finding commonalities between lists, and may even require customization of data formats for preserving and using information. Programming gives a free hand to users to think and implement innovative algorithms and solve various problems.
Over time, data science has also found its applications in life sciences. Data science helps in finding patterns in a huge amount of structured or unstructured data which can help in providing valuable insights in almost all frontiers of biology - ranging from finding putative variations, predicting amino acid substitution consequences, diagnosing diseases quickly, predicting lead drug toxicity, predicting pharmacophores, personalized, or precision medicine, prediction in the field of protein secondary and tertiary structure, microRNA interaction with their targets, epigenetics, etc. The very first step in generating a hypothesis from a big amount of data is the curation of large datasets. A task like curating data is very tedious and time-consuming work. It consists of repetitive searching of data from certain database’s websites, literature, and others. Here comes our digital assistant to the rescue, saving us from this tedious job as it can work much faster than how humans think and perform things manually. A 3.0-gigahertz CPU can process 3 billion instructions per second - that is an example of the tremendous power of computing.
The central theme of this book is to provide a practical approach to biologists in applying data science techniques on omics data. Data science usually consists of data analysis, data visualization, data preparation, Machine Learning, and more. We will discuss each aspect in relation to relevant biological problems along with their solutions - starting with basic Python programming so that readers can get accustomed to programming terminologies.
Programming skills are a valuable asset for any biologist. There are many programming languages that have been developed. Some are for instantaneous computation, website creation, and database generation, among others, and some are general-purpose programming languages that were developed to be used in a variety of application domains. Python is one example of a general-purpose programming language. Guido van Rossum developed it as a hobby in the Netherlands around 30 years ago and named it after a famous British comedian group called “Monty Python’s Circus”. Now, Python has applications in various domains like data science, web development, data visualization, and desktop applications, to name a few. Python is one of the popular programming languages in the data science and Machine Learning area, and it is community-driven. Since it has a very steady learning curve, it is recommended by many experts for beginners as their first programming language to learn. Primarily, Python has simple English-like readable syntax which is easily understandable by users. For example, if one wants to find the proportion of the amino acid Leucine with a symbol “L” contained in a protein sequence, the following Python code will do that:
Protein = “MKLFWLLFTIGFCWAQYSSNTQQGRTSIVHLFEWRWVDIALECERY”
Leu_contain = Protein.count(‘L’)/len(Protein)
print(Leu_contain)
The code is very much similar to the English language. The first line is the protein sequence. The second line calculates the Leucine residues (denoted by the letter “L”) by counting the number of times “L” appears in the sequence and then dividing it by the total length of the sequence. Moreover, at last printing the value, it turns out to be 0.108
Thanks to the readability of Python codes, learners can concentrate on the concepts of programming and problems more than learning the syntax of the language. As Python is community-driven and it has one of the largest communities, Python has evolved to contain several important libraries that are pre-installed or are freely available to install. These libraries help in the quick and efficient development of complex applications, because these do not need to be written from scratch.
Another advantage of learning Python is that it can be used for various purposes due to the development of popular libraries, such as:
•Frameworks like Django, Flask, Pylons are used for creating static and dynamic websites.
•Libraries like Pandas, NumPy, and Matplotlib are accessible for data science and visualization.
•Scikit-Learn and TensorFlow are advanced libraries for Machine Learning and deep learning
•Desktop applications can be built using packages like PyQt, Gtk, and wxWidgets, among others.
•Modules like BeeWare or Kivy are taking the lead in mobile applications.
Learning programming is the same as learning a new language; we have to first understand the vocabulary and syntaxes. Next, we learn how to construct some meaningful but terse sentences. Using those sentences, we then form paragraphs, and finally, we write our own story. In this book, we will start with Python syntaxes and vocabulary. Then, we will construct small programs with biological relevance to help biologists learn programming with problems that are important to them.