Gene Expression Data Analysis
eBook - ePub

Gene Expression Data Analysis

A Statistical and Machine Learning Perspective

Pankaj Barah, Dhruba Kumar Bhattacharyya, Jugal Kumar Kalita

Share book
  1. 392 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Gene Expression Data Analysis

A Statistical and Machine Learning Perspective

Pankaj Barah, Dhruba Kumar Bhattacharyya, Jugal Kumar Kalita

Book details
Book preview
Table of contents
Citations

About This Book

Development of high-throughput technologies in molecular biology during the last two decades has contributed to the production of tremendous amounts of data. Microarray and RNA sequencing are two such widely used high-throughput technologies for simultaneously monitoring the expression patterns of thousands of genes. Data produced from such experiments are voluminous (both in dimensionality and numbers of instances) and evolving in nature. Analysis of huge amounts of data toward the identification of interesting patterns that are relevant for a given biological question requires high-performance computational infrastructure as well as efficient machine learning algorithms. Cross-communication of ideas between biologists and computer scientists remains a big challenge.

Gene Expression Data Analysis: A Statistical and Machine Learning Perspective has been written with a multidisciplinary audience in mind. The book discusses gene expression data analysis from molecular biology, machine learning, and statistical perspectives. Readers will be able to acquire both theoretical and practical knowledge of methods for identifying novel patterns of high biological significance. To measure the effectiveness of such algorithms, we discuss statistical and biological performance metrics that can be used in real life or in a simulated environment. This book discusses a large number of benchmark algorithms, tools, systems, and repositories that are commonly used in analyzing gene expression data and validating results. This book will benefit students, researchers, and practitioners in biology, medicine, and computer science by enabling them to acquire in-depth knowledge in statistical and machine-learning-based methods for analyzing gene expression data.

Key Features:

  • An introduction to the Central Dogma of molecular biology and information flow in biological systems


  • A systematic overview of the methods for generating gene expression data


  • Background knowledge on statistical modeling and machine learning techniques


  • Detailed methodology of analyzing gene expression data with an example case study


  • Clustering methods for finding co-expression patterns from microarray, bulkRNA, and scRNA data


  • A large number of practical tools, systems, and repositories that are useful for computational biologists to create, analyze, and validate biologically relevant gene expression patterns


  • Suitable for multidisciplinary researchers and practitioners in computer science and biological sciences

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Gene Expression Data Analysis an online PDF/ePUB?
Yes, you can access Gene Expression Data Analysis by Pankaj Barah, Dhruba Kumar Bhattacharyya, Jugal Kumar Kalita in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over one million books available in our catalogue for you to explore.

Information

Year
2021
ISBN
9781000425758
Edition
1

Chapter 1 Introduction

DOI: 10.1201/9780429322655-1

1.1 Introduction

An exciting area of significant scientific and technological innovation of recent times is bioinformatics. The field integrates diverse disciplines, including computer science and informatics, biology, statistics, applied mathematics and artificial intelligence to provide solutions to crucial biological problems at the molecular level. With the help of machine learning techniques and statistical methods, it has become possible to organize, analyze and interpret voluminous biological data with an eye to uncovering interesting patterns of great consequence. One major area within bioinformatics is analysis of gene expression data of disparate kinds such as microarrays, gene ontologies, protein-protein interactions and various flavors of genome sequence data or combinations. The genome provides only static information whereas gene expression data analysis produced from the microarray and sequencing technologies provide dynamic information about cell function. The measurement of the activity (expression) of thousands of genes at once so as to create a global picture of cellular function is known as gene expression profiling. Analysis and interpretation of such gene expression data using appropriate machine learning or statistical methods can help extract intrinsic patterns or knowledge, which may be of use towards uncovering causes of critical diseases.
Bio-medical science has been battling against many deadly diseases including cancer for many years, and grand successes have been promised but have been limited, in general. The number of humans affected by such deadly diseases is increasing day by day. Early detection and treatment using modern medical technology has been beneficial in combating the scourge of such diseases and increasing survival rates. Machine learning is an exciting area of research and practice, which has been applied successfully in bioinformatics to uncover many interesting, yet previously unknown patterns towards identification of biomarker genes for such critical diseases.

1.2 Central Dogma

Genes are the primary factors that control traits of various characteristics in an organism. These characteristics may be associated with certain diseases or normal development processes. There are two major phases associated with the pathways through which genes control characteristics of an organism. In the first phase, genetic code is transferred from genes to proteins through a phenomenon called the Central Dogma.
The Central Dogma of Molecular Biology describes the formation of a protein molecule inside a living organism, as shown in Figure 1.1. The double-stranded DNA molecule is partially unzipped and an enzyme called RNA polymerase copies the gene's nucleotides one by one into an RNA molecule, called the messenger RNA or mRNA. This process is called transcription. The mRNA is a small, single-stranded sequence of nucleotides which moves out of the nucleus. Outside the nucleus, another set of proteins reads the sequence of the mRNA and gathers free floating amino acids to fuse them into a chain. The nucleotide sequence of the mRNA determines the order in which an amino acid is incorporated into the growing protein. The process of translating the mRNA sequence into a protein sequence is called translation.
Figure 1.1:
Figure 1.1:Central Dogma: An illustration.
The Central Dogma explains the biological process that results in the flow of genetic information into proteins from the information encoded in nucleotide sequences of DNA segments or genes. A protein is a biological macromolecule associated with almost all biological processes, typically governing the traits of various phenotypic and non-phenotypic characteristics in an organism. Hence, as shown in Figure 1.2, genes are the keys that drive protein structure and all biological processes, and thus traits of various characteristics in an organism.
Figure 1.2:
Figure 1.2:Flow of control from genes to traits in an organism.

1.3 Measuring Gene Expression

The magnitude of expression of a gene depends on a number of factors, including inter-gene regulatory relationships. The expression level of a gene is a major determinant of the presence of the corresponding governed characteristics in an organism. A number of technologies developed, including revolutionary microarray and sequencing technologies, help determine the expression levels of thousands of genes in a single experiment. Gene expression data generated by these technologies provide an ample resource from which useful biological knowledge can be extracted. Computational analysis of such data can be of great use to biologists. Figure 1.3 shows the exponential growth in the quantity of gene expression data collected over a period of ten years 1. Figure 1.4 is another example, showing the growth of protein data over a period of ten years 2.
Figure 1.3:
Figure 1.3:Growth statistics of gene expression data.
Figure 1.4:
Figure 1.4:Growth statistics of protein data.
Using appropriate microarray or sequencing technology, it is possible to simultaneously examine the expression levels of thousands of genes across developmental stages, clinical conditions or time points. The real-valued gene expression data are obtained in the form of a matrix where the rows refer to the genes and the columns represent the conditions, stages or time points. Genes, which are the primary repository of biological information, help the growth and maintenance of an organism's cells. Required activities include construction and regulation of proteins as well as other molecules that determine the growth and functioning of the living organism, and ultimately to the transfer of genetic traits to the next generation.
RNA-Seq is a recent and a robust sequencing technology to measure the expression levels of nucleotide sequences corresponding to genes [147]. Determination of how nucleotides are strung together in a DNA molecule is called DNA sequencing. The term next-generation sequencing refers to a number of modern advanced high-throughput DNA sequencing techniques [293]. Pyrosequencing [14], DNA colony sequencing [216], massively parallel signature sequencing [68], illumina sequencing [418], DNA nanoball sequencing [444], and heliscope-single-molecule sequencing [334] are some examples of this family of techniques. In RNA-Seq technology, mRNA molecules are sequenced to short nucleotide base sequences. These sequences are then aligned with known nucleotide sequences corresponding to genes to determine expression levels of the genes.

1.4 Representation of Gene Expression Data

The widespread use of the technologies mentioned above has led to generation of an enormous amount of gene expression data that are witnesses to numerous biological phenomena in living organisms. Various types of gene expression data correspond to how measurements are carried out and represented. If expression levels of genes are detected in multiple samples collected from different organisms, a two-dimensional gene expression dataset is produced where rows correspond to genes and columns correspond to samples or vice versa [329]. Certain expression datasets store expression levels of genes in one or more samples at various time points. Such specialized gene expression data are called time-series gene expression data [270]. A special form of time series gene expression data contains expression levels of multiple samples at multiple time points to form a three-dimensional structure. Such data are called gene sample time (GST) expression data [269]. Figure 1.5 presents the structure of a two-dimensional gene expression dataset, whereas three-dimensional GST expression is shown in Figure 1.6. There are numerous online repositories that store and maintain ever-growing gene expression datasets. ArrayTrack [407], ArrayExpress at EBI and Gene Expression Omnibus-NCBI [66] are three very widely used repositories of gene expression data.
Figure 1.5:
Figure 1.5:2-D gene expression data.
Figure 1.6:
Figure 1.6:3-D gene expression data.
_________________________
1http:/​/​www.ncbi.nlm.nih.gov/​geo/​
2http:/​/​www.rcsb.org/​

1.5 Gene Expression Data Analysis: Applications

Gene expression data witness biological phenomena taking place in an organism and hence, they represent a raw resource from which ample biological knowledge can potentially be unearthed. Proper analysis of such data extracts information about underlying biological phenomena. Some problems that can be addressed by gene expression data analysis are briefly discussed below.
  1. (a)P...

Table of contents