eBook - ePub

Beginners Guide to Bioinformatics for High Throughput Sequencing

Name: Beginners Guide to Bioinformatics for High Throughput Sequencing
Author: Eric Lee, T W Tan

Eric Lee,

T W Tan,

276 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Beginners Guide to Bioinformatics for High Throughput Sequencing

Eric Lee,

T W Tan,

About this book

Biologists find computing bewildering; yet they are expected to be able to process the voluminous data available from the machines they buy and the datasets that has accumulated in genomic databanks worldwide. It is now increasingly difficult for them to avoid dealing with large volumes of data, that goes beyond just doing manual programming.

Most books in this realm are full of equations and complex code but this book gives a much gentler entry point particularly for biologists, with code snippets users can use to cut and paste, and run on their Linux or MacOSX operating system or cloud instance. It also provides a step by step installation instructions which they can easily follow. Those who are in the field of genome sequencing and already familiar with the procedures of analysis, may also find this book useful in closing some knowledge gaps.

High throughput sequencing requires high throughput and high performance computing. This book provides a gentle entry to high throughput sequencing by dealing with simple skills which the average biologist is increasingly required to master. You will find this book a breeze to read, and some suggestions in this book maybe new to you, something you might want to try out.

Contents:

Preparing Your Computing Environment
Learning Basic Linux Commands
Checking Sequence Quality
Sequence Alignment
Speeding-up with GPUs
Establishing a Research Workflow Pipeline
Using a Bioinformatics Cloud Computing Platform
Appendix: Learning Regular Expressions through Practising Simple Data Processing

Readership: Students and researchers in bioinformatics, biocomputing, computational biology, genetics and genomics.
Key Features:

Most books in this realm are full of equations and complex code
This book gives a much gentler entry point particularly for biologists, with code snippets users can cut and paste and run on their Linux or MacOSX operating system or cloud instance, as well as step by step installation instructions which they could follow

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.

Yes, you can access Beginners Guide to Bioinformatics for High Throughput Sequencing by Eric Lee, T W Tan in PDF and/or ePUB format, as well as other popular books in Informatica & Bioinformatica. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Year

eBook ISBN

Topic

Subtopic

CHAPTER 1

Preparing Your Computing Environment

1.1Buying Your Own Computer

As a bioinformatics researcher, it is perfectly reasonable to wish for an ideal computer as a necessary computing tool for your work. In reality, the main purpose of your computer may merely be to serve as a terminal, because for bioinformatics these days, due to rapid progress in information technology and the development of biotechnology, the volume of data is so huge that most computational work can no longer be done on a PC — one has to rely on computing on supercomputers. In actual routine situations, if sequence assembly has to be carried out on large servers running jobs continuously for more than several hours, even extending to days and weeks of computing time, how can we expect a personal computer to be able to do the job?

Perhaps for these reasons, in selecting a functional computing platform with not too exceptional performance, the most important consideration is that of stability of the terminal interface during run-time. Suppose you were picking and purchasing a personal computer for practical usage (in a standalone system), we would recommend buying one with a larger disk storage and more memory. A larger hard disk storage is needed because you may need to temporarily use your own computer to store raw data and analyse the results, and you need more memory on the basis that you need speedy analysis. What we think would be useful is that you should be able to use Excel to open the results file and carry out statistical analyses as the next step, or export it to a free open source statistical platform such as R.

A further enhanced expectation would be to opt for a 64-bit multicore computer, because there are some bioinformatics applications that only work on a 64-bit environment. Such a requirement for a personal computer is very much attainable, but to achieve an environment supporting a 64-bit architecture without installing a 64-bit operating system is also pretty futile.

In practical terms, it has been mentioned earlier that most calculations are completed on the server, which is usually also the experimental “infrastructure”, that establishes an internal open shared computing environment. If you are not a computer hardware afficionado, then forget about this, as maybe asking you to immediately install Linux is impossible. Given the focus of this section, I would like to leave it to the chapters and sections detailing servers.

However, if you still want to have a computer supporting a 64-bit operating system, you are recommended to buy an Apple Mac series computer directly if you do not already have a mental block against non-Windows operating systems. The Apple Company is now producing machine types that are 64-bit multi-core hardware architecture, with a preloaded UNIX architecture, FreeBSD, as the basic foundation of its macOS (formerly named OS X) operating system.

After booting, you can directly have a complete environment to learn Unix shell commands to create a remote connection with the server directly in this test; there are no problems. Therefore, many bioinformatics software are native Mac applications. Its mode of operation, including a graphical interface, in terms of performance is better than the Windows version; the command operations are more humane than Linux versions. This is a personal practical experience to share with you. We really believe that spending too much time on the system used in order to establish a working environment is very wasteful and meaningless.

Today’s generation is a cloud computing generation. Besides purchasing a bare metal machine as we described above, “Infrastructure as a Service” (IaaS) is your new choice. Just pay the usage fees for your actual usage time, and you can have a private computing resource directly accessible from any computer anywhere, or even from a mobile device. We did a quick survey for general IaaS providers when we updated the book, and found that the fees start from USD$5 per month. If you just want to practice the Unix shell command, the low-end level service is enough. If you are at an advanced stage, you can pay a higher fee for a more powerful dedicated server with optimized memory, solid state drive and GPU. The flexibility is tremendous, compared to buying your own bare metal machine for your home or office.

Figure 1.1In today’s retail shops, whether a desktop PC or a notebook computer, there will be 64-bit systems (for example, Intel Core i5 4-processor notebook computer). There is also the option of getting a GPU graphics processor capability in your PC (notwithstanding the crucial issue of whether one can use them properly). Running basic parallel computing on such a platform is not a problem, and using it as a dumb terminal would be more than sufficient

This kind of cloud service usually provides a control panel via a typical Web browser interface such that you can “create” a machine with the desired installed operating system in seconds. With the snapshot feature, rolling-back to the initial state is simple. So not to worry about any minor mistakes, you can easily undo any damage you accidentally create.

Otherwise, you may apply for publicly available computing resources from your government or research institution. Sometimes it’s free.

Note that whether you opt for cloud services or institutional resources, you will still need your own computer or device to access them. In fact, many of us are increasingly using our mobile devices, whether it is a smart phone or a tablet computer, to access these powerful and flexible remote computers and reducing the need for a desktop.

1.2Setting up a Computing Server

After being introduced to how to choose and purchase a standalone PC or to use remote computing resources, perhaps some of you may have already acquired a certain standard of conceptual understanding in these matters. Some may already be familiar with the Linux operating system, or consider yourself a power user or even UNIX guru. It may just be that you are not too familiar with the environment of bioinformatics applications and need additional affirmation. Or perhaps you have reached a certain level of competence that you have been assigned responsibility for system administration and maintenance. We would like to share with you in this chapter some of our experience with Linux servers in the bioinformatics applications space.

In fact, various flavours of the Linux system which you will typically encounter, e.g. Linux distributions (or distros as they are fondly called) such as Ubuntu, Fedora, Red Hat, CentOS, Debian, actually originate from a common base, with most commands, functionalities and usage being the same as the Linux kernel itself as a unified specification. The only difference lies in the development kit, where developers have acquired their own taste and evolved their own design philosophies and architectural preferences. In fact, as long as system administrators and IT managers pick a familiar and comfortable flavour to use, it should suffice.

The only caveat that remains is still the end-user application requirements. In particular, the obvious limitation is that of the operating system environment. Recall that the basic entry requirements for a server is a 64-bit operating system, while the minimum requirements for memory is also very important. Prior to installing the software, one should look at the relevant specification requirements, which a subsequent chapter on software operations will explain how to assess such criteria.

Back to the limitations of the software itself, should one pay attention to whether the necessary supporting software libraries are already pre-installed? This is because most software development builds on and inherits a pre-existing software framework structure or prior completed subroutine modules. Software libraries organize the framework for development such that whenever the need arises, a call to the library can be conveniently made, thereby cutting down the time needed for the software developers to debug or troubleshoot their programs. So what types of library support are there? One popular run-time library is the Java Runtime Environment, also known as JRE, which you should install. Look also at the version number of the software, noting their backward compatibility, so you can install updates right up to the latest version.

Next, let us look at the actual software installation process, which is usually a point of confusion with newbies. Today’s smart phone users are familiar with having to go to Google Play or the Apple App Store to search for and install apps from the respective app repository with the installation process made so simple and almost idiot-proof. Similarly, every Linux software installation package has instructions that are not necessarily the same. For example, Ubuntu uses apt-get to install from its software repository, and for Fedora, we must use rpm instead. For JRE, whether it is the Oracle or the Open source version, if we use the wrong version of the software package, then nothing works. This diversity of software packaging management system is a common point of great confusion among many beginners.

Next, regarding Linux system maintenance, you may wonder why Ubuntu or Debian both use “apt-get” for software installation, while Fedora, RHEL (Red Hat Enterprise Linux) and SUSE use “rpm”? The reason is because Ubuntu and Debian are of the same genre of Linuxes, and use .deb format with tools such as apt, apt-get and dpkg, whereas CentOS, SUSE, Fedora, and othes of the Red Hat family use .rpm format and the corresponding installation ...

Cover Page
Title
Copyright
Preface
Chapter 1 Preparing Your Computing Environment
Chapter 2 Learning Basic Linux Commands
Chapter 3 Checking Sequence Quality
Chapter 4 Sequence Alignment
Chapter 5 Speeding-up with GPUs
Chapter 6 Establishing a Research Workflow Pipeline
Chapter 7 Using a Bioinformatics Cloud Computing Platform
Appendix Learning Regular Expressions through Practising Simple Data Processing
Index

About this book

Frequently asked questions

Information

Table of contents