1.1Who Should Read this Book
This book is for the life science researcher who wants to learn how to program. He/she may have previous exposure to computer programming, but this is not necessary to understand this book (although it surely helps).
This book is designed to be useful to several separate but related audiences, students, graduates, postdocs, and staff scientists, since all of them can benefit from knowing how to program.
Exposing students to programming at early stages in their career helps to boost their creativity and logical thinking, and both skills can be applied in research. In order to ease the learning process for students, all subjects are introduced with the minimal prerequisites. There are also questions at the end of each chapter. They can be used for self-assessing how much you’ve learned. The answers are available to teachers in a separate guide.
Graduates and staff scientists having actual programming needs should find its several real-world examples and abundant reference material extremely valuable.
1.1.1What the Reader Should Already Know
Since this book is called Python for Bioinformatics, it has been written with the following assumptions in mind:
•No programming knowledge is assumed, but the reader is required to have minimum computer proficiency to be able to use a text editor and handle basic tasks in your operating system (OS). Since Python is multi-platform, most instructions from this book will apply to the most common operating systems (Windows, macOS and Linux); when there is a command or a procedure that applies only to a specific OS, it will be clearly noted.
•The reader should be working (or at least planning to work) with bioinformatics tools. Even low-scale handmade jobs, such as using the NCBI BLAST to ID a sequence, aligning proteins, primer searching, or estimating a phylogenetic tree will be useful to follow the examples. The more familiar the reader is with bioinformatics, the better he will be able to apply the concepts learned in this book.
1.2.1Typographical Conventions
There are some typographical conventions I have tried to use in a uniform way throughout the book. They should aid readability and were chosen to tell apart user-made names (or variables) from language keywords. This comes in handy when learning a new computer language.
Bold: Objects provided by Python and by third-party modules. With this notation it should be clear that round is part of the language and not a user-defined name. Bold is also used to highlight parts of the text. There is no way to confuse one bold usage with the other.
Mono-spaced font: User declared variables, code, and filenames. For example: sequence = ‘MRVLLVALALLALAASATS’.
Italics: In commands, it is used to denote a variable that can take different values. For example, in len(iterable), “iterable” can take different values. Used in text, it marks a new word or concept. For example “One such fundamental data structure is a dictionary.”
The content of lines starting with $ (dollar sign) are meant to be typed in your operating system console (also called command prompt in Windows or terminal in macOS).
↲: Break line. Some lines are longer than the available space in a printed page, so this symbol is inserted to mean that what is on the next line in the page represents the same line on the computer screen. Inside code, the symbol used is <=.
The current version of Python at this moment is 3.6.1. There is a 2.7.12 version that is maintained1 because there are still a sizable number of applications in production using the 2.7 branch. Versions 3.x and 2.x are slightly different, at the point of being incompatible. Python 3 is more efficient than Python 2 in many aspects. Large websites such as Instagram migrated from Python 2.7 to Python 3.6 to save in CPU and memory consumption by up to 30%. This book uses Python 3.6.
The only scenario where you may need to use Python 2.7, apart from maintenance of old code, is when there is no availability of a specific library for Python 3. In this case, before starting a project in Python 2.7, try to search for a replacement library. For example, you want to connect with a MySQL database and you are told to use MySQLdb, since this package is not Python 3 compatible; instead of using Python 2.7, use mysqlclient or mysql-connector-python, both works with Python 3.
Python source code that appears in this book is presented as listings. Each line of these listings is numbered. These numbers are not intended to be typed; they are used to reference each line in the text. You don’t need to copy the code from the book, since it can be downloaded from the GitHub repository at https://github.com/Serulab/Py4Bio.
Code can be formatted in several ways and still be valid to the Python interpreter. This following code is syntactically correct:
def GetAverage(X): avG=sum(X)/len(X) " Calculate the average " return avG
Also this one:
def get_average(items): """ Calculate the average """ average = sum(items) / len(items) return average
The former code sample follows most accepted coding styles for Python.2 Throughout the book you will find mostly code formatted as the second sample. Some code in the book will not follow accepted coding styles for the following reasons:
•There are some instances where the most didactic way to show a particular piece of code conflicts with the style guide. On those few occasions, I choose to deviate from the style guide in favor of clarity.
•Due to size limitation in a printed book, some names were shortened and other minor drifts from the coding styles have been introduced.
•To show th...