Introduction to Protein Structure Prediction
eBook - ePub

Introduction to Protein Structure Prediction

Methods and Algorithms

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Introduction to Protein Structure Prediction

Methods and Algorithms

About this book

A look at the methods and algorithms used to predict protein structure

A thorough knowledge of the function and structure of proteins is critical for the advancement of biology and the life sciences as well as the development of better drugs, higher-yield crops, and even synthetic bio-fuels. To that end, this reference sheds light on the methods used for protein structure prediction and reveals the key applications of modeled structures. This indispensable book covers the applications of modeled protein structures and unravels the relationship between pure sequence information and three-dimensional structure, which continues to be one of the greatest challenges in molecular biology.

With this resource, readers will find an all-encompassing examination of the problems, methods, tools, servers, databases, and applications of protein structure prediction and they will acquire unique insight into the future applications of the modeled protein structures. The book begins with a thorough introduction to the protein structure prediction problem and is divided into four themes: a background on structure prediction, the prediction of structural elements, tertiary structure prediction, and functional insights. Within those four sections, the following topics are covered:

  • Databases and resources that are commonly used for protein structure prediction
  • The structure prediction flagship assessment (CASP) and the protein structure initiative (PSI)
  • Definitions of recurring substructures and the computational approaches used for solving sequence problems
  • Difficulties with contact map prediction and how sophisticated machine learning methods can solve those problems
  • Structure prediction methods that rely on homology modeling, threading, and fragment assembly
  • Hybrid methods that achieve high-resolution protein structures
  • Parts of the protein structure that may be conserved and used to interact with other biomolecules
  • How the loop prediction problem can be used for refinement of the modeled structures
  • The computational model that detects the differences between protein structure and its modeled mutant

Whether working in the field of bioinformatics or molecular biology research or taking courses in protein modeling, readers will find the content in this book invaluable.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Introduction to Protein Structure Prediction by Huzefa Rangwala, George Karypis, Huzefa Rangwala,George Karypis in PDF and/or ePUB format, as well as other popular books in Biological Sciences & Molecular Biology. We have over one million books available in our catalogue for you to explore.

Information

CHAPTER 1
INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
HUZEFA RANGWALA
Department of Computer Science
George Mason University
Fairfax, VA
GEORGE KARYPIS
Department of Computer Science
University of Minnesota
Minneapolis, MN
Proteins have a vast influence on the molecular machinery of life. Stunningly complex networks of proteins perform innumerable functions in every living cell. Knowing the function and structure of proteins is crucial for the development of improved drugs, better crops, and even synthetic biofuels. As such, knowledge of protein structure and function leads to crucial advances in life sciences and biology.
With recent advances in large-scale sequencing technologies, we have seen an exponential growth in protein sequence information. Protein structures are primarily determined using X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, but these methods are time consuming, expensive, and not feasible for all proteins. The experimental approaches to determine protein function (e.g., gene knockout, targeted mutation, and inhibitions of gene expression studies) are low-throughput in nature [1,2]. As such, our ability to produce sequence information far outpaces the rate at which we can produce structural and functional information.
Consequently, researchers are increasingly reliant on computational approaches to extract useful information from experimentally determined three-dimensional (3D) structures and functions of proteins. Unraveling the relationship between pure sequence information and 3D structure and/or function remains one of the fundamental challenges in molecular biology.
Function prediction is generally approached by using inheritance through homology [2], that is, proteins with similar sequences (common evolutionary ancestry) frequently carry out similar functions. However, several studies [2–4] have shown that a stronger correlation exists between structure conservation and function, that is, structure implies function, and a higher correlation exists between sequence conservation and structure, that is, sequence implies structure (sequence → structure → function).
1.1. INTRODUCTION TO PROTEIN STRUCTURES
In this section we introduce the basic definitions and facts about protein structure, the four different levels of protein structure, as well as provide details about protein structure databases.
1.1.1. Protein Structure Levels
Within each structural entity called a protein lies a set of recurring substructures, and within these substructures are smaller substructures still. As an example, consider hemoglobin, the oxygen-carrying molecule in human blood. Hemoglobin has four domains that come together to form its quaternary structure. Each domain assembles (i.e., folds) itself independently to form a tertiary structure. These tertiary structures are comprised of multiple secondary structure elements—in hemoglobin’s case α-helices. α-Helices (and their counterpart β-sheets) have elegant repeating patterns dependent upon sequences of amino acids.
1.1.1.1. Primary Structure.
Amino acids form the basic building blocks of proteins. Amino acids consists of a central carbon atom (Cα) attached by an amino (NH2), a carboxyl (COOH) group, and a side chain (R) group. The side chain group differentiates the various amino acids. In case of proteins, there are primarily 20 different amino acids that form the building blocks. A protein is a chain of amino acids linked with peptide bonds. Pairs of amino acid form a peptide bond between the amino group of one and the carboxyl group of the other. This polypeptide chain of amino acids is known as the primary structure or the protein sequence.
1.1.1.2. Secondary Structure.
A sequence of characters representing the secondary structure of a protein describes the general 3D form of local regions. These regions organize themselves independently from the rest of the protein into patterns of repeatedly occurring structural fragments. The most dominant local conformations of polypeptide chains are α-helices and β-sheets. These local structures have a certain regularity in their form, attributed to the hydrogen bond interactions between various residues. An α-helix has a coil-like structure, whereas a β-sheet consists of parallel strands of residues. In addition to regular secondary structure elements, irregular shapes form an important part of the structure and function of proteins. These elements are typically termed coil regions.
Secondary structure can be divided into several types, although usually at least three classes (α-helix, coils, and β-sheet) are used. No unique method of assigning residues to a particular secondary structure state from atomic coordinates exists, although the most widely accepted protocol is based on the Dictionary of Protein Secondary Structure (DSSP) algorithm [5]. DSSP uses the following structural classes: H (α-helix), G (310-helix), I (π-helix), E (β-strand), B (isolated β-bridge), T (turn), S (bend), and – (other). Several other secondary structure assignment algorithms use a reduction scheme that converts this eight-state assignment down to three states by assigning H and G to the helix state (H), E and B to a the strand state (E), and the rest (I, T, S, and –) to a coil state (C). This is the format generally used in structure databases.
1.1.1.3. Tertiary Structure.
The tertiary structure of the protein is defined as the global 3D structure, represented by 3D coordinates for each atoms. These tertiary structures are comprised of multiple secondary structure elements, and the 3D structure is a function of the interacting side chains between the different amino acids. Hence, the linear ordering of amino acids forms secondary structure; arranging secondary structures yields tertiary structure.
1.1.1.4. Quaternary Structure.
Quaternary structures represent the interaction between multiple polypeptide chains. The interaction between the various chains is due to the non-covalent interactions between the atoms of the different chains. Examples of these interactions include hydrogen bonding, van Der Walls interactions, ionic bonding, and disulfide bonding.
Research in computational structure prediction concerns itself mainly with predicting secondary and tertiary structures from known experimentally determined primary structure or sequence. This is due to the relative ease of determining primary structure and the complexity involved in quaternary structure.
1.1.2. Protein Sequence and Structure Databases
The large amount of protein sequence information, experimentally determined structure information, and structural classification information is stored in publicly available databases. In this section we review some of the databases that are used in this field, and provide their availability information in Table 1.1.
TABLE 1.1 Protein Sequence and Structure Databases
DatabaseInformationAvailability Link
UniProtSequencehttp://www.pir.uniprot.org/
UniRefCluster sequenceshttp://www.pir.uniprot.org/
NCBI nrNonredundant sequencesftp://ftp.ncbi.nlm.nih.gov/blast/db/
PDBStructurehttp://www.rcsb.org/
SCOPStructure classificationhttp://scop.mrc-lmb.cam.ac.uk/scop/
CATHStructure classificationhttp://www.cathdb.info/
FSSPStructure classificationhttp://www.ebi.ac.uk/dali/fssp/
ASTRALCompendiumhttp://astral.berkeley.edu/
The databases referred to in this table are most popular for protein structure-related information.
1.1.2.1. Sequence Databases.
The Universal Protein Resource (UniProt) [6] is the most comprehensive warehouse containing information about protein sequences and their annotation. It is a database of protein sequences and their function that is formed by aggregating the information present in the Swiss-Prot, TrEMBL, and Protein Information Resources (PIR) databases. The UniProtKB 13.2 version of database (released on April 8, 2008) consists of 5,939,836 protein sequence entries (Swiss-Prot providing 362,782 entries and TrEMBL providing 5,577,054 entries).
However, several proteins have high pairwise sequence identity, and as such lead to redundant information. The UniProt database [6] creates a subset of sequences such that the sequence identity between all pairs of sequences within the subset is less than a predetermined threshold. In essence, UniProt contains the UniRef100, UniRef90, and UniRef50 subsets where within each group the sequence identity between a pair of sequences is less than 100%, 90%, and 50%, respectively.
The National Center for Biotechnology Information (NCBI) also provides a nonredundant (NCBI nr) database of protein sequences using sequences from a wide variety of sources. This database will have pairs of proteins with high sequence identity, but removes all the duplicates. The NCBI nr version 2.2.18 (released on March 2, 2008) contains 6,441,864 protein sequences.
1.1.2.2. Protein Data Bank (PDB).
The Research Collaboratory for Structural Bioinformatics (RSCB) PDB [7] stores experimentally determined 3D structure of biological macromolecules including nucleotides and proteins. As of April 20, 2008 this database consists of 46,287 protein structures that are determined using X-ray crystallography (90%), NMR (9%), and other methods like Cryo-electron microscopy (Cryo-EM). These experimental methods are time-consuming, expensive, and need protein to crystallize.
1.1.2.3. Structure Classification Databases.
Various methods have been proposed to categorize protein structures. These methods are based on the pairwise structural similarity between the protein structures, as well as the topological and geometric arrangement of atoms and predominant secondary structure like subunits. Structural Classification of Proteins (SCOP) [8], Class, Architecture, Topology, and Homologous superfamily (CATH) [9], and Families of Structurally Similar Proteins (FSSP) [10] are three widely used structure classification databases. The classification methodology involves breaking a protein chain or complex into independent folding units called domains, and then classifying these domains into a set of hierarchical classes sharing similar structural characteristics.
SCOP Database.
SCOP [8] is a manually curated database that provides a detailed and comprehensive description of the evolutionary and structural relationships between proteins whose structure is known (present in the PDB). SCOP classifies proteins structures using visual inspection as well as structural comparison using a suite of automated tools. The basic unit of classification is generally a domain. SCOP classification is based on four hierarchical levels that encompass evolutionary and structural relationships [8]. In particular, proteins with clear evolutionary relationship are classified to be within the same family. Generally, protein pairs within the same family have pairwise residue identities greater than 30%. Protein pairs with low sequence identity, but whose structural and functional features imply probably common evolutionary information, are classified to be within the same superfamily. Protein pairs with similar major secondary structure elements and topological arrangement of substructures (as well as favoring certain packing geometries) are classified to be within the same fold. Finally, protein pairs having a predominant set of secondary structures (e.g., all α-helices proteins) lie within the same class. The four hierarchical levels, that is, family, superfamily, fold, and class define the structure of the SCOP database.
The SCOP 1.73 version database (released on September 26, 2007) classifies 34,494 PDB entries (97,178 domains) into 1086 unique folds, 1777 unique superfamilies, and 3464 unique families.
CATH Database.
CATH [9] database is a semi-automated protein structure classification database like the SCOP database. CATH uses a consensus of three automated classification techniques to break a chain into domains and classify them in the various structural categories [11]. Domains for proteins that are not resolved by the consensus approach are determined manually. These domains are then classified into the following hierarchical categories using both manual and automated methods in conjunction.
The first level membership, class, is determined based on the secondary structure composition and packing within the structure. The second level, architecture, clusters proteins sharing the same orientation of the secondary structure elem...

Table of contents

  1. Cover
  2. WILEY SERIES ON BIOINFORMATICS: COMPUTATIONAL TECHNIQUES AND ENGINEERING
  3. Title page
  4. Copyright page
  5. PREFACE
  6. CONTRIBUTORS
  7. CHAPTER 1 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
  8. CHAPTER 2 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
  9. CHAPTER 3 THE PROTEIN STRUCTURE INITIATIVE
  10. CHAPTER 4 PREDICTION OF ONE-DIMENSIONAL STRUCTURAL PROPERTIES OF PROTEINS BY INTEGRATED NEURAL NETWORKS
  11. CHAPTER 5 LOCAL STRUCTURE ALPHABETS
  12. CHAPTER 6 SHEDDING LIGHT ON TRANSMEMBRANE TOPOLOGY
  13. CHAPTER 7 CONTACT MAP PREDICTION BY MACHINE LEARNING
  14. CHAPTER 8 A SURVEY OF REMOTE HOMOLOGY DETECTION AND FOLD RECOGNITION METHODS
  15. CHAPTER 9 INTEGRATIVE PROTEIN FOLD RECOGNITION BY ALIGNMENTS AND MACHINE LEARNING
  16. CHAPTER 10 TASSER-BASED PROTEIN STRUCTURE PREDICTION
  17. CHAPTER 11 COMPOSITE APPROACHES TO PROTEIN TERTIARY STRUCTURE PREDICTION: A CASE-STUDY BY I-TASSER
  18. CHAPTER 12 HYBRID METHODS FOR PROTEIN STRUCTURE PREDICTION
  19. CHAPTER 13 MODELING LOOPS IN PROTEIN STRUCTURES
  20. CHAPTER 14 MODEL QUALITY ASSESSMENT USING A STATISTICAL PROGRAM THAT ADOPTS A SIDE CHAIN ENVIRONMENT VIEWPOINT
  21. CHAPTER 15 MODEL QUALITY PREDICTION
  22. CHAPTER 16 LIGAND-BINDING RESIDUE PREDICTION
  23. CHAPTER 17 MODELING AND VALIDATION OF TRANSMEMBRANE PROTEIN STRUCTURES
  24. CHAPTER 18 STRUCTURE-BASED MACHINE LEARNING MODELS FOR COMPUTATIONAL MUTAGENESIS
  25. CHAPTER 19 CONFORMATIONAL SEARCH FOR THE PROTEIN NATIVE STATE
  26. CHAPTER 20 MODELING MUTATIONS IN PROTEINS USING MEDUSA AND DISCRETE MOLECULE DYNAMICS
  27. Index
  28. Color Plates