Gene Expression Data Analysis
eBook - ePub

Gene Expression Data Analysis

A Statistical and Machine Learning Perspective

Pankaj Barah, Dhruba Kumar Bhattacharyya, Jugal Kumar Kalita

Buch teilen
  1. 392 Seiten
  2. English
  3. ePUB (handyfreundlich)
  4. Über iOS und Android verfügbar
eBook - ePub

Gene Expression Data Analysis

A Statistical and Machine Learning Perspective

Pankaj Barah, Dhruba Kumar Bhattacharyya, Jugal Kumar Kalita

Angaben zum Buch
Buchvorschau
Inhaltsverzeichnis
Quellenangaben

Über dieses Buch

Development of high-throughput technologies in molecular biology during the last two decades has contributed to the production of tremendous amounts of data. Microarray and RNA sequencing are two such widely used high-throughput technologies for simultaneously monitoring the expression patterns of thousands of genes. Data produced from such experiments are voluminous (both in dimensionality and numbers of instances) and evolving in nature. Analysis of huge amounts of data toward the identification of interesting patterns that are relevant for a given biological question requires high-performance computational infrastructure as well as efficient machine learning algorithms. Cross-communication of ideas between biologists and computer scientists remains a big challenge.

Gene Expression Data Analysis: A Statistical and Machine Learning Perspective has been written with a multidisciplinary audience in mind. The book discusses gene expression data analysis from molecular biology, machine learning, and statistical perspectives. Readers will be able to acquire both theoretical and practical knowledge of methods for identifying novel patterns of high biological significance. To measure the effectiveness of such algorithms, we discuss statistical and biological performance metrics that can be used in real life or in a simulated environment. This book discusses a large number of benchmark algorithms, tools, systems, and repositories that are commonly used in analyzing gene expression data and validating results. This book will benefit students, researchers, and practitioners in biology, medicine, and computer science by enabling them to acquire in-depth knowledge in statistical and machine-learning-based methods for analyzing gene expression data.

Key Features:

  • An introduction to the Central Dogma of molecular biology and information flow in biological systems


  • A systematic overview of the methods for generating gene expression data


  • Background knowledge on statistical modeling and machine learning techniques


  • Detailed methodology of analyzing gene expression data with an example case study


  • Clustering methods for finding co-expression patterns from microarray, bulkRNA, and scRNA data


  • A large number of practical tools, systems, and repositories that are useful for computational biologists to create, analyze, and validate biologically relevant gene expression patterns


  • Suitable for multidisciplinary researchers and practitioners in computer science and biological sciences

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?
Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.
(Wie) Kann ich Bücher herunterladen?
Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.
Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?
Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.
Was ist Perlego?
Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.
Unterstützt Perlego Text-zu-Sprache?
Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.
Ist Gene Expression Data Analysis als Online-PDF/ePub verfügbar?
Ja, du hast Zugang zu Gene Expression Data Analysis von Pankaj Barah, Dhruba Kumar Bhattacharyya, Jugal Kumar Kalita im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Computer Science & Computer Science General. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Jahr
2021
ISBN
9781000425758

Chapter 1 Introduction

DOI: 10.1201/9780429322655-1

1.1 Introduction

An exciting area of significant scientific and technological innovation of recent times is bioinformatics. The field integrates diverse disciplines, including computer science and informatics, biology, statistics, applied mathematics and artificial intelligence to provide solutions to crucial biological problems at the molecular level. With the help of machine learning techniques and statistical methods, it has become possible to organize, analyze and interpret voluminous biological data with an eye to uncovering interesting patterns of great consequence. One major area within bioinformatics is analysis of gene expression data of disparate kinds such as microarrays, gene ontologies, protein-protein interactions and various flavors of genome sequence data or combinations. The genome provides only static information whereas gene expression data analysis produced from the microarray and sequencing technologies provide dynamic information about cell function. The measurement of the activity (expression) of thousands of genes at once so as to create a global picture of cellular function is known as gene expression profiling. Analysis and interpretation of such gene expression data using appropriate machine learning or statistical methods can help extract intrinsic patterns or knowledge, which may be of use towards uncovering causes of critical diseases.
Bio-medical science has been battling against many deadly diseases including cancer for many years, and grand successes have been promised but have been limited, in general. The number of humans affected by such deadly diseases is increasing day by day. Early detection and treatment using modern medical technology has been beneficial in combating the scourge of such diseases and increasing survival rates. Machine learning is an exciting area of research and practice, which has been applied successfully in bioinformatics to uncover many interesting, yet previously unknown patterns towards identification of biomarker genes for such critical diseases.

1.2 Central Dogma

Genes are the primary factors that control traits of various characteristics in an organism. These characteristics may be associated with certain diseases or normal development processes. There are two major phases associated with the pathways through which genes control characteristics of an organism. In the first phase, genetic code is transferred from genes to proteins through a phenomenon called the Central Dogma.
The Central Dogma of Molecular Biology describes the formation of a protein molecule inside a living organism, as shown in Figure 1.1. The double-stranded DNA molecule is partially unzipped and an enzyme called RNA polymerase copies the gene's nucleotides one by one into an RNA molecule, called the messenger RNA or mRNA. This process is called transcription. The mRNA is a small, single-stranded sequence of nucleotides which moves out of the nucleus. Outside the nucleus, another set of proteins reads the sequence of the mRNA and gathers free floating amino acids to fuse them into a chain. The nucleotide sequence of the mRNA determines the order in which an amino acid is incorporated into the growing protein. The process of translating the mRNA sequence into a protein sequence is called translation.
Figure 1.1:
Figure 1.1:Central Dogma: An illustration.
The Central Dogma explains the biological process that results in the flow of genetic information into proteins from the information encoded in nucleotide sequences of DNA segments or genes. A protein is a biological macromolecule associated with almost all biological processes, typically governing the traits of various phenotypic and non-phenotypic characteristics in an organism. Hence, as shown in Figure 1.2, genes are the keys that drive protein structure and all biological processes, and thus traits of various characteristics in an organism.
Figure 1.2:
Figure 1.2:Flow of control from genes to traits in an organism.

1.3 Measuring Gene Expression

The magnitude of expression of a gene depends on a number of factors, including inter-gene regulatory relationships. The expression level of a gene is a major determinant of the presence of the corresponding governed characteristics in an organism. A number of technologies developed, including revolutionary microarray and sequencing technologies, help determine the expression levels of thousands of genes in a single experiment. Gene expression data generated by these technologies provide an ample resource from which useful biological knowledge can be extracted. Computational analysis of such data can be of great use to biologists. Figure 1.3 shows the exponential growth in the quantity of gene expression data collected over a period of ten years 1. Figure 1.4 is another example, showing the growth of protein data over a period of ten years 2.
Figure 1.3:
Figure 1.3:Growth statistics of gene expression data.
Figure 1.4:
Figure 1.4:Growth statistics of protein data.
Using appropriate microarray or sequencing technology, it is possible to simultaneously examine the expression levels of thousands of genes across developmental stages, clinical conditions or time points. The real-valued gene expression data are obtained in the form of a matrix where the rows refer to the genes and the columns represent the conditions, stages or time points. Genes, which are the primary repository of biological information, help the growth and maintenance of an organism's cells. Required activities include construction and regulation of proteins as well as other molecules that determine the growth and functioning of the living organism, and ultimately to the transfer of genetic traits to the next generation.
RNA-Seq is a recent and a robust sequencing technology to measure the expression levels of nucleotide sequences corresponding to genes [147]. Determination of how nucleotides are strung together in a DNA molecule is called DNA sequencing. The term next-generation sequencing refers to a number of modern advanced high-throughput DNA sequencing techniques [293]. Pyrosequencing [14], DNA colony sequencing [216], massively parallel signature sequencing [68], illumina sequencing [418], DNA nanoball sequencing [444], and heliscope-single-molecule sequencing [334] are some examples of this family of techniques. In RNA-Seq technology, mRNA molecules are sequenced to short nucleotide base sequences. These sequences are then aligned with known nucleotide sequences corresponding to genes to determine expression levels of the genes.

1.4 Representation of Gene Expression Data

The widespread use of the technologies mentioned above has led to generation of an enormous amount of gene expression data that are witnesses to numerous biological phenomena in living organisms. Various types of gene expression data correspond to how measurements are carried out and represented. If expression levels of genes are detected in multiple samples collected from different organisms, a two-dimensional gene expression dataset is produced where rows correspond to genes and columns correspond to samples or vice versa [329]. Certain expression datasets store expression levels of genes in one or more samples at various time points. Such specialized gene expression data are called time-series gene expression data [270]. A special form of time series gene expression data contains expression levels of multiple samples at multiple time points to form a three-dimensional structure. Such data are called gene sample time (GST) expression data [269]. Figure 1.5 presents the structure of a two-dimensional gene expression dataset, whereas three-dimensional GST expression is shown in Figure 1.6. There are numerous online repositories that store and maintain ever-growing gene expression datasets. ArrayTrack [407], ArrayExpress at EBI and Gene Expression Omnibus-NCBI [66] are three very widely used repositories of gene expression data.
Figure 1.5:
Figure 1.5:2-D gene expression data.
Figure 1.6:
Figure 1.6:3-D gene expression data.
_________________________
1http:/​/​www.ncbi.nlm.nih.gov/​geo/​
2http:/​/​www.rcsb.org/​

1.5 Gene Expression Data Analysis: Applications

Gene expression data witness biological phenomena taking place in an organism and hence, they represent a raw resource from which ample biological knowledge can potentially be unearthed. Proper analysis of such data extracts information about underlying biological phenomena. Some problems that can be addressed by gene expression data analysis are briefly discussed below.
  1. (a)P...

Inhaltsverzeichnis