CHAPTER 1
Preparing Your Computing Environment
1.1Buying Your Own Computer
As a bioinformatics researcher, it is perfectly reasonable to wish for an ideal computer as a necessary computing tool for your work. In reality, the main purpose of your computer may merely be to serve as a terminal, because for bioinformatics these days, due to rapid progress in information technology and the development of biotechnology, the volume of data is so huge that most computational work can no longer be done on a PC — one has to rely on computing on supercomputers. In actual routine situations, if sequence assembly has to be carried out on large servers running jobs continuously for more than several hours, even extending to days and weeks of computing time, how can we expect a personal computer to be able to do the job?
Perhaps for these reasons, in selecting a functional computing platform with not too exceptional performance, the most important consideration is that of stability of the terminal interface during run-time. Suppose you were picking and purchasing a personal computer for practical usage (in a standalone system), we would recommend buying one with a larger disk storage and more memory. A larger hard disk storage is needed because you may need to temporarily use your own computer to store raw data and analyse the results, and you need more memory on the basis that you need speedy analysis. What we think would be useful is that you should be able to use Excel to open the results file and carry out statistical analyses as the next step, or export it to a free open source statistical platform such as R.
A further enhanced expectation would be to opt for a 64-bit multicore computer, because there are some bioinformatics applications that only work on a 64-bit environment. Such a requirement for a personal computer is very much attainable, but to achieve an environment supporting a 64-bit architecture without installing a 64-bit operating system is also pretty futile.
In practical terms, it has been mentioned earlier that most calculations are completed on the server, which is usually also the experimental “infrastructure”, that establishes an internal open shared computing environment. If you are not a computer hardware afficionado, then forget about this, as maybe asking you to immediately install Linux is impossible. Given the focus of this section, I would like to leave it to the chapters and sections detailing servers.
However, if you still want to have a computer supporting a 64-bit operating system, you are recommended to buy an Apple Mac series computer directly if you do not already have a mental block against non-Windows operating systems. The Apple Company is now producing machine types that are 64-bit multi-core hardware architecture, with a preloaded UNIX architecture, FreeBSD, as the basic foundation of its macOS (formerly named OS X) operating system.
After booting, you can directly have a complete environment to learn Unix shell commands to create a remote connection with the server directly in this test; there are no problems. Therefore, many bioinformatics software are native Mac applications. Its mode of operation, including a graphical interface, in terms of performance is better than the Windows version; the command operations are more humane than Linux versions. This is a personal practical experience to share with you. We really believe that spending too much time on the system used in order to establish a working environment is very wasteful and meaningless.
Today’s generation is a cloud computing generation. Besides purchasing a bare metal machine as we described above, “Infrastructure as a Service” (IaaS) is your new choice. Just pay the usage fees for your actual usage time, and you can have a private computing resource directly accessible from any computer anywhere, or even from a mobile device. We did a quick survey for general IaaS providers when we updated the book, and found that the fees start from USD$5 per month. If you just want to practice the Unix shell command, the low-end level service is enough. If you are at an advanced stage, you can pay a higher fee for a more powerful dedicated server with optimized memory, solid state drive and GPU. The flexibility is tremendous, compared to buying your own bare metal machine for your home or office.
Figure 1.1In today’s retail shops, whether a desktop PC or a notebook computer, there will be 64-bit systems (for example, Intel Core i5 4-processor notebook computer). There is also the option of getting a GPU graphics processor capability in your PC (notwithstanding the crucial issue of whether one can use them properly). Running basic parallel computing on such a platform is not a problem, and using it as a dumb terminal would be more than sufficient
This kind of cloud service usually provides a control panel via a typical Web browser interface such that you can “create” a machine with the desired installed operating system in seconds. With the snapshot feature, rolling-back to the initial state is simple. So not to worry about any minor mistakes, you can easily undo any damage you accidentally create.
Otherwise, you may apply for publicly available computing resources from your government or research institution. Sometimes it’s free.
Note that whether you opt for cloud services or institutional resources, you will still need your own computer or device to access them. In fact, many of us are increasingly using our mobile devices, whether it is a smart phone or a tablet computer, to access these powerful and flexible remote computers and reducing the need for a desktop.
1.2Setting up a Computing Server
After being introduced to how to choose and purchase a standalone PC or to use remote computing resources, perhaps some of you may have already acquired a certain standard of conceptual understanding in these matters. Some may already be familiar with the Linux operating system, or consider yourself a power user or even UNIX guru. It may just be that you are not too familiar with the environment of bioinformatics applications and need additional affirmation. Or perhaps you have reached a certain level of competence that you have been assigned responsibility for system administration and maintenance. We would like to share with you in this chapter some of our experience with Linux servers in the bioinformatics applications space.
In fact, various flavours of the Linux system which you will typically encounter, e.g. Linux distributions (or distros as they are fondly called) such as Ubuntu, Fedora, Red Hat, CentOS, Debian, actually originate from a common base, with most commands, functionalities and usage being the same as the Linux kernel itself as a unified specification. The only difference lies in the development kit, where developers have acquired their own taste and evolved their own design philosophies and architectural preferences. In fact, as long as system administrators and IT managers pick a familiar and comfortable flavour to use, it should suffice.
The only caveat that remains is still the end-user application requirements. In particular, the obvious limitation is that of the operating system environment. Recall that the basic entry requirements for a server is a 64-bit operating system, while the minimum requirements for memory is also very important. Prior to installing the software, one should look at the relevant specification requirements, which a subsequent chapter on software operations will explain how to assess such criteria.
Back to the limitations of the software itself, should one pay attention to whether the necessary supporting software libraries are already pre-installed? This is because most software development builds on and inherits a pre-existing software framework structure or prior completed subroutine modules. Software libraries organize the framework for development such that whenever the need arises, a call to the library can be conveniently made, thereby cutting down the time needed for the software developers to debug or troubleshoot their programs. So what types of library support are there? One popular run-time library is the Java Runtime Environment, also known as JRE, which you should install. Look also at the version number of the software, noting their backward compatibility, so you can install updates right up to the latest version.
Next, let us look at the actual software installation process, which is usually a point of confusion with newbies. Today’s smart phone users are familiar with having to go to Google Play or the Apple App Store to search for and install apps from the respective app repository with the installation process made so simple and almost idiot-proof. Similarly, every Linux software installation package has instructions that are not necessarily the same. For example, Ubuntu uses apt-get to install from its software repository, and for Fedora, we must use rpm instead. For JRE, whether it is the Oracle or the Open source version, if we use the wrong version of the software package, then nothing works. This diversity of software packaging management system is a common point of great confusion among many beginners.
Next, regarding Linux system maintenance, you may wonder why Ubuntu or Debian both use “apt-get” for software installation, while Fedora, RHEL (Red Hat Enterprise Linux) and SUSE use “rpm”? The reason is because Ubuntu and Debian are of the same genre of Linuxes, and use .deb format with tools such as apt, apt-get and dpkg, whereas CentOS, SUSE, Fedora, and othes of the Red Hat family use .rpm format and the corresponding installation ...