Why R?
If you work in the field of biodata analysis, or if you are interested in getting a bioinformatics job, you can see a large number of related job advertisements targeting young professionals. There is one common topic coming back in those ads: they demand âa high degree of familiarity with R/Bioconductor.â (Here, I am quoting an actual recent ad from Monster.com.)
Besides, when we have to create and analyze a large amount of data during our bioâresearcher career, sooner or later we realize that simple approaches using spread sheets (aka the Excel part of MS Office) are not flexible anymore to fulfill the needs of our projects. In these situations, we start to look for dedicated statistical software tools, and soon we encounter the countless alternatives from which we can choose. The R statistical environment is one among the possibilities.
With the exponential spread of highâthroughput experimental methods, including microarray and nextâgeneration sequencing (NGS)-based experiments, the skills related to largeâscale analysis of data from biological experiments have higher and higher value. R and Bioconductor offer a free and flexible toolâset for these types of analyses; therefore, many research groups and companies select it as their data analysis platform.
R is an openâsource software licensed under the GNU General Public License (GPL). This has an advantage that you can install R for free on your desktop computer, regardless of whether you use Windows, Mac OS X, or a Linux distribution.
Introducing all the features of R thoroughly at a general level exceeds the scope and purpose of this book, which is to focus on molecular biologyâspecific applications. For those who are interested in a deeper introduction into R itself, it is suggested reading the book R for Beginners by Emmanuel Paradis as a reference guide. It is an excellent general guide, which can be found online (Paradis 2005). In the course, we use more biologyâoriented examples to illustrate the most important topics. The other recommended book for this chapter is R in a Nutshell by Joseph Adler (2012).
Installing R
The first task of analyzing data with R is to install R on the computer. There is a nice discussion on the bioinformatics blogs about why people so seldom use their knowledge acquired on short bioinformatics courses. One of the main considerations points out that it is because the greatest challenge is to install the software in question.
There are plenty of available information on the web about how to install R, but the most authentic source is the website of the R project itself. In this page, the official documentation, installer, and other related links from the developers of R themselves are collected. The first step is to navigate to the download section of the page and find the mirror pages closest to the location of the user.
However, there are some differences in the installation process depending on the operating system of the computer in use. Windows users should find the Windows installer to their system from the download pages. It is useful to check for the base installer, not the contributed libraries. In the case of a Linux distribution, R can be installed via the package manager. Several Linux distributions provide R (and many R libraries) as a part of their repositories. This way, the package manager can take care of the updates. Mac OS X users and Apple fans can find the pkg file containing the R framework, 64âbit graphical user interface (GUI) (R.app) and Tcl/Tk 8.6.0 X11 libraries for installing the R base systems on their computer. Brave users of other UNIX systems (i.e., FreeBSD or OpenBSD) can use R, but they should compile it from the source. This is not a beginner topic. In the case of a computer owned by a company, university, or library, the installation of R (just like many other programs) requires most often superuser rights.
Interacting with R
The interface of R is somewhat different from other software used for statistics, such as SPSS, Sâplus, Prism, or MS Excel (which is not a statistical software tool!). There are neither icons nor sophisticated menus to perform analyses. Instead, commands should be typed in the appropriate place of R called the âcommand promptâ. It is marked with >. In this book, the commands for typing into the prompt are marked by fixedâwidth (monospaced) fonts:
> citation()
After typing in a command (and hitting Enter), the results turn up either under the command or, in case of graphics, in a separate window. If the result of a command is nothing, the string NULL appears as a result. Mistyping or making an error in the parameters of a command leads to an error message with some information about what was wrong.
> c()
NULL
> a * 5
Error: object 'a' not found
From now on, we will omit the > prompt character from the code samples so you can just copy/paste the commands. Leaving R happens with the quit() function.
quit(save='no')
q()
Graphical interfaces and integrated development environment (IDE) integration
A commandâline interface is enough for performing the practices. However, some prefer to have GUI. There are multiple choices depending on the operating system in use. The Windows and Mac versions of R starts with a very simple GUI, while Linux/UNIX versions start only with a commandâline interface. The Java GUI for R is available for any platform capable of running Java, and it sports simple, functional menus to perform the most basic tasks related to an analysis (Helbig, Urbanek, and Fellows 2013).
For a more advanced GUI, one can experiment with RStudio or R Commander (Fox 2005). There are several plugins to integrate R into the best coding production tools, such as Emacs (with the Emacs Speaks Statistics addâon), Eclipse (by StatET for R), and many others.
Scripting and sourcing
Doing data analysis in R means typing in commands and experimenting with parameters suitable for the given set of data. At a later stage, the procedure will be repeated either on the same data with slight modifications in the course o...