Structural Bioinformatics
eBook - ePub

Structural Bioinformatics

Jenny Gu, Philip E. Bourne, Philip E. Bourne, Jenny Gu

Share book
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Structural Bioinformatics

Jenny Gu, Philip E. Bourne, Philip E. Bourne, Jenny Gu

Book details
Book preview
Table of contents
Citations

About This Book

Structural Bioinformatics was the first major effort to show the application of the principles and basic knowledge of the larger field of bioinformatics to questions focusing on macromolecular structure, such as the prediction of protein structure and how proteins carry out cellular functions, and how the application of bioinformatics to these life science issues can improve healthcare by accelerating drug discovery and development. Designed primarily as a reference, the first edition nevertheless saw widespread use as a textbook in graduate and undergraduate university courses dealing with the theories and associated algorithms, resources, and tools used in the analysis, prediction, and theoretical underpinnings of DNA, RNA, and proteins.

This new edition contains not only thorough updates of the advances in structural bioinformatics since publication of the first edition, but also features eleven new chapters dealing with frontier areas of high scientific impact, including: sampling and search techniques; use of mass spectrometry; genome functional annotation; and much more.

Offering detailed coverage for practitioners while remaining accessible to the novice, Structural Bioinformatics, Second Edition is a valuable resource and an excellent textbook for a range of readers in the bioinformatics and advanced biology fields.

Praise for the previous edition:

"This book is a gold mine of fundamental and practical information in an area not previously well represented in book form."
—Biochemistry and Molecular Education

"... destined to become a classic reference work for workers at all levels in structural bioinformatics...recommended with great enthusiasm for educators, researchers, and graduate students."
—BAMBED

"...a useful and timely summary of a rapidly expanding field."
—Nature Structural Biology

"...a terrific job in this timely creation of a compilation of articles that appropriately addresses this issue."
—Briefings in Bioinformatics

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Structural Bioinformatics an online PDF/ePUB?
Yes, you can access Structural Bioinformatics by Jenny Gu, Philip E. Bourne, Philip E. Bourne, Jenny Gu in PDF and/or ePUB format, as well as other popular books in Biological Sciences & Molecular Biology. We have over one million books available in our catalogue for you to explore.

Information

Year
2011
ISBN
9781118210567
Edition
2
Section I
DATA COLLECTION, ANALYSIS, AND VISUALIZATION
1
DEFINING BIOINFORMATICS AND STRUCTURAL BIOINFORMATICS
Russ B. Altman and Jonathan M. Dugan
WHAT IS BIOINFORMATICS?
The precise definition of bioinformatics is a matter of debate. Some define it narrowly as the development of databases to store and manipulate genomic information. Others define it broadly as encompassing all of computational biology. Based on its current use in the scientific literature, bioinformatics can be defined as the study of two information flows in molecular biology (Altman, 1998). The first information flow is based on the central dogma of molecular biology: DNA sequences are transcribed into mRNA sequences; mRNA sequences are translated into protein sequences; and protein sequences fold into three-dimensional structures that have functions. These functions are selected, in a Darwinian sense, by the environment of the organism, which drives the evolution of the DNA sequence within a population. The first class of bioinformatics applications, then, can address the transfer of information at any stage in the central dogma, including the organization and control of genes in the DNA sequence, the identification of transcriptional units in DNA, the prediction of protein structure from sequence, and the analysis of molecular function. These applications include the emergence of system-wide analyses of biological phenomenon, now called systems biology. Systems biology aims to achieve quantitative understanding not only of the individual players in a biological system but also of the properties of the system itself that emerge from the interaction of all its parts. This field also includes the new field of metagenomics, where we study entire ecosystems of interacting organisms. In the same way that systems biology studies how the molecular entities in a cell combine to make the cell work, metagenomics studies how the individual organisms within an ecological system combine to create that ecology. The initial forays into metagenomics are based on high-throughput sequencing not of individual species (that generally cannot be isolated) but of the mixture of species that create an ecosystem.
The second information flow is based on the scientific method: we create hypotheses regarding biological activity, design experiments to test these hypotheses, evaluate the resulting data for compatibility with the hypotheses, and extend or modify the hypotheses in response to the data. The second class of bioinformatics applications addresses the transfer of information within this protocol, including systems that generate hypotheses, design experiments, store and organize the data from these experiments in databases, test the compatibility of the data with models, and modify hypotheses. The emergence and emphasis on systems-level modeling and interactions in both systems—biology and metagenomics— create major new challenges for our field.
The explosion of interest in bioinformatics has been driven by the emergence of experimental techniques that generate data in a high-throughput fashion—such as high-throughput DNA sequencing, mass spectrometry or microarray expression analysis (Miranker, 2000; Altman and Raychaudhuri, 2001; The Genome International Sequencing Consortium, 2001; Venter et al., 2001). Bioinformatics depends on the availability of large data sets that are too complex to allow manual analysis. The rapid increase in the number of three-dimensional macromolecular structures available in databases such as the Protein Data Bank (PDB,1 Chapter 11; Berman et al., 2000) has driven the emergence of a subdiscipline of bioinformatics: structural bioinformatics. Structural bioinformatics is the subdiscipline of bioinformatics that focuses on the representation, storage, retrieval, analysis, and display of structural information at the atomic and subcellular spatial scales.
Structural bioinformatics, like many other subdisciplines within bioinformatics,2 is characterized by two goals: the creation of general purpose methods for manipulating information about biological macromolecules and the application of these methods to solve problems in biology and create new knowledge. These two goals are intricately linked because part of the validation of new methods involves their successful use in solving real problems. At the same time, the current challenges in biology demand the development of new methods that can handle the volume of data now available and the complexity of models that scientists must create to explain these data.
Structural Bioinformatics Has Been Catalyzed by Large Amounts of Data
Biology has attracted computational scientists over the past 30 years in two distinct ways. First, the increasing availability of sequence data has been a magnet for those with an interest in string analysis, algorithms, and probabilistic models (Gusfield, 1997; Durbin et al., 1998). The major accomplishments have been the development of algorithms for pair-wise sequence alignment, multiple alignment, the definition and discovery of sequence motifs, and the use of probabilistic models, such as hidden Markov models to find genes (Burge and Karlin, 1997), align sequences (Hughey and Krogh, 1996), and summarize protein families (Bateman et al., 2000). Second, the increasing availability of structural data has been a magnet for those with an interest in computational geometry, computer graphics, and algorithms for analyzing crystallographic data (Chapter 4) and NMR data (Chapter 5) to create credible molecular models. Structural bioinformatics has its roots in this second group. The development of molecular graphics was one of the first applications of computer graphics (Langridge and Gomatos, 1963). The elucidation of the structure of DNA in the mid-1950s and the publication of the first protein crystal structures in the early 1960s created a demand for computerized methods for examining these complex molecules. At the same time, the need for computational algorithms to deconvolute X-ray crystallographic data and fit the resulting electron densities to the more manageable ball-and-stick models created a cadre of structural biologists who were very well versed in computational technologies. The challenges of interpreting NMR-derived distance constraints into three-dimensional structures further introduced computational technologies to biological structure. As the number of three-dimensional structures increased, the need to create methods for storing and disseminating this data led to the creation of the PDB, one of the earliest scientific databases.1 In the past 10 years, we have seen a third wave of interest in biological problems from a group that was not engaged by the availability of 1D sequence data or 3D structural data. This third wave has arisen in response to the increased availability of RNA expression data and has captured the interest of computational scientists with an interest in statistical analysis and machine learning, particularly in clustering methodologies and classification techniques. The problems posed by these data are different from those seen in both sequence and structural analysis data. The recent introduction of high-throughput DNA sequencing technologies that produce short-length (25-50) snippets of DNA sequence is re-energizing the sequence analysis community with new challenges.
Structural bioinformatics is now in a renaissance with the success of the genome sequencing projects, the emergence of high-throughput methods for expression analysis, and identification of compounds via mass spectrometry. There are now organized efforts in structural genomics (Chapter 40) to collect and analyze macromolecular structures in a high-throughput manner (Teichmann, Chothia, and Gerstein, 1999; Teichmann, Murzin, and Chothia, 2001). These efforts include challenges in the selection of molecules to study, the robotic preparation and manipulation of samples to find crystallization conditions, the analysis of X-ray diffraction data, and the annotation of these structures as they are stored in databases (Section II). In addition, there have been advancements in the capabilities of NMR structure determination, which previously could only study proteins in a limited range of sizes. The solution of the malate synthase G complex from E. coli with 731 residues has pushed the frontier for NMR spectroscopy and suggests that NMR is having its own renaissance (Tugarinov et al., 2005). The PDB now has a critical mass of structures that allow (indeed require!) statistical analysis to learn the rules of how active and binding sites are constructed which allow us to develop knowledge-based methods for the prediction of structure and function. Finally, the emergence ofthis structural information, when linked to the increasing amount of genomic information and expression data, provides opportunities for linking structural information to other data sources to understand how cellular pathways and processes work at a molecular level.
Toward a High-Resolution Understanding of Biology. The great promise of structural bioinformatics is predicated on the belief that the availability of high-resolution structural information about biological systems will allow us to precisely reason about the function of these systems and the effects of modifications or perturbations. The genetic analyses can only associate genetic sequences with their functional consequences, whereas the structural biological analyses offer the additional promise of ultimate insight into the mechanisms of these consequences, and therefore a more profound understanding of how biological function follows from the structure. The promise for structural bioinformatics lies in four areas: (1) creating an infrastructure for building up structural models from component parts, (2) gaining the ability to understand the design principles of proteins, so that new functionalities can be created, (3) learning how to design drugs efficiently based on structural knowledge of their target, and (4) catalyzing the development of simulation models that can give insight into function based on structural simulations. Each of these areas has already seen success, and the structural genomics projects promise to create data sets sufficient to catalyze accelerated progress in all these areas.
With respect to creating an infrastructure for modeling larger structural ensembles, we are already seeing the emergence of a new generation of structures larger by an order of magnitude than the structures submitted to the PDB a few years ago. Some achievements in recent years include (1) the elucidation of the structure of the bacterial ribosome (with more than 250,000 atoms) (Ban et al., 2000; Clemons Jr et al., 2001; Yusupov et al., 2001), (2) the publication of the RNA polymerase structure (with about 500,000 atoms) (Cramer et al., 2000), and (3) the increased ability to solve the structure of membrane proteins (transporters and receptors, in particular) that have proven technically difficult in the past. Each of these allows us to examine the principles of how a large number of component protein and nucleic acid structures can assemble to create macromolecular machines. With these successes, we can now target numerous other cellular ensembles for structural studies.
The design principles of proteins are now in reach both because we have a large “training set” of example proteins to study and because methods for structure prediction are beginning to allow us to identify structures that are unlikely to be stable. There have been preliminary successes in the design of four-helix bundle proteins (DeGrado, Regan, and Ho, 1987) and in the engineering of TIM barrels (Silverman, Balakrishnan, and Harbury, 2001). There has been interesting work in “reverse folding” in which a set of amino acid side chains is collected to stabilize a desired protein backbone conformation (Koehl and Levitt, 1999).
Rational drug design has not been the primary way for discovering major therapeutics (Chapters 27, 34 and 35). However, recent successes in this area give reason to expect that drug discovery projects will increasingly be structure based. One of the most famous examples of rational drug design was the creation of HIV protease inhibitors based on the known three-dimensional crystal structure (Kempf, 1994; Vacca, 1994). Methods for matching combinatorial libraries of chemicals against protein binding sites have matured and are in routine use at most pharmaceutical companies.
The simulation of biological macromolecular dynamics dates almost as far backas the elucidation of the first protein structure (Doniach and Eastman, 1999). These simulations are based on the integration of classical equations of motion and computation of electrostatic forces between atoms in a molecule. Methods for simulation now routinely include water molecules and are able to remain stable (the molecule does not fall apart) and reproduce experimental measurements with some fidelity. The simulation of larger ensembles and structural variants (such as based on known genetic variations in sequence) should lead to a more profound understanding of how structural properties produce functional behavior. The NIH has recognized the importance of simulation and created a national center devoted to physics-based simulation of biological structure (SIMBIOS, http://simbios.stanford.edu/).
Special Challenges in Computing with Structural Data
Structural bioinformatics must overcome some special challenges that are either not present or not dominant in other types of bioinformatics domains (such as the analysis of sequence or microarray data). It is important to remember these challenges when assessing the opportunities in the field. They include the following:
  • Structural data are not linear and therefore not easily amenable to algorithms based on strings. In addition to this obvious nonlinearity, there are nonlinear relationships between atoms (the forces are not linear). This means that most computations on structure need to either make approximations or be very expensive.
  • The search space for most structural problems is continuous. Structures are represented generally by atomic Cartesian coordinates (or internal angular coordinates) that are continuous variables. Thus, there are infinite search spaces for algorithms attempting to assign atomic coordinate values. Many simplifications can be applied (such as lattice models for 3D structure; Hinds and Levitt, 1994), but these are attempts to manage the inherent continuous nature of these problems.
  • There is a fundamental connection between molecular structure and physics. While this statement seems obvious and trivial, it means that when reduced representations, such as pseudoatoms (Wuthrich, Billeter, and Braun, 1983) or lattice models are applied, they become more difficult to relate to the underlying physics that governs the interactions. The need to keep structural calculations physically reasonable is an important constraint.
  • Reasoning about structure requires visualization. As mentioned above, the creation of computer graphics was driven, in part, by the need of structural biologists to look at molecules (Chapter 9). This is both a benefit and a detriment; structure is well defined, and well-designed visualizations can provide insight into structural problems. However, graphical displays have a human user as a target and are not easily parsed or understood by computers, and thus represent something of a computational “dead end.” The need to have expressive data structures underlying these visualizations allows the information to be understood and analyzed by computerprograms and thus opens the possibility of further downstream analysis.
  • Structural data, like all biological data, can be noisy and imperfect. Despite some amazing successes in the elucidation of very high-resolution structures, the precision of our knowledge about many structures is likely to be limited by their flexibility, dynamics, or experimental noise (Chapters 14, 15, 37, and 38). Understanding the protein structural disorder may be critical for understanding the protein’s function. Thus, we must be comfortable in reasoning about structures for which we only have partial knowledge.
  • Protein and nucleic acid structures are generally conserved more than theirassociated sequence. Thus, sequences will accumulate mutations over time that may make identification of their similarities more difficult, while their structures may remain essentially identical. This is a challenge because sequence information is still much more abundant than structural information, and so for many molecules it is the sequence information that is readily available. The need to identify distant sequential similarities to gain structural insight...

Table of contents