![]()
CHAPTER 1
INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
HUZEFA RANGWALA
Department of Computer Science
George Mason University
Fairfax, VA
GEORGE KARYPIS
Department of Computer Science
University of Minnesota
Minneapolis, MN
Proteins have a vast influence on the molecular machinery of life. Stunningly complex networks of proteins perform innumerable functions in every living cell. Knowing the function and structure of proteins is crucial for the development of improved drugs, better crops, and even synthetic biofuels. As such, knowledge of protein structure and function leads to crucial advances in life sciences and biology.
With recent advances in large-scale sequencing technologies, we have seen an exponential growth in protein sequence information. Protein structures are primarily determined using X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, but these methods are time consuming, expensive, and not feasible for all proteins. The experimental approaches to determine protein function (e.g., gene knockout, targeted mutation, and inhibitions of gene expression studies) are low-throughput in nature [1,2]. As such, our ability to produce sequence information far outpaces the rate at which we can produce structural and functional information.
Consequently, researchers are increasingly reliant on computational approaches to extract useful information from experimentally determined three-dimensional (3D) structures and functions of proteins. Unraveling the relationship between pure sequence information and 3D structure and/or function remains one of the fundamental challenges in molecular biology.
Function prediction is generally approached by using inheritance through homology [2], that is, proteins with similar sequences (common evolutionary ancestry) frequently carry out similar functions. However, several studies [2–4] have shown that a stronger correlation exists between structure conservation and function, that is, structure implies function, and a higher correlation exists between sequence conservation and structure, that is, sequence implies structure (sequence → structure → function).
1.1. INTRODUCTION TO PROTEIN STRUCTURES
In this section we introduce the basic definitions and facts about protein structure, the four different levels of protein structure, as well as provide details about protein structure databases.
1.1.1. Protein Structure Levels
Within each structural entity called a protein lies a set of recurring substructures, and within these substructures are smaller substructures still. As an example, consider hemoglobin, the oxygen-carrying molecule in human blood. Hemoglobin has four domains that come together to form its quaternary structure. Each domain assembles (i.e., folds) itself independently to form a tertiary structure. These tertiary structures are comprised of multiple secondary structure elements—in hemoglobin’s case α-helices. α-Helices (and their counterpart β-sheets) have elegant repeating patterns dependent upon sequences of amino acids.
1.1.1.1. Primary Structure.
Amino acids form the basic building blocks of proteins. Amino acids consists of a central carbon atom (Cα) attached by an amino (NH2), a carboxyl (COOH) group, and a side chain (R) group. The side chain group differentiates the various amino acids. In case of proteins, there are primarily 20 different amino acids that form the building blocks. A protein is a chain of amino acids linked with peptide bonds. Pairs of amino acid form a peptide bond between the amino group of one and the carboxyl group of the other. This polypeptide chain of amino acids is known as the primary structure or the protein sequence.
1.1.1.2. Secondary Structure.
A sequence of characters representing the secondary structure of a protein describes the general 3D form of local regions. These regions organize themselves independently from the rest of the protein into patterns of repeatedly occurring structural fragments. The most dominant local conformations of polypeptide chains are α-helices and β-sheets. These local structures have a certain regularity in their form, attributed to the hydrogen bond interactions between various residues. An α-helix has a coil-like structure, whereas a β-sheet consists of parallel strands of residues. In addition to regular secondary structure elements, irregular shapes form an important part of the structure and function of proteins. These elements are typically termed coil regions.
Secondary structure can be divided into several types, although usually at least three classes (α-helix, coils, and β-sheet) are used. No unique method of assigning residues to a particular secondary structure state from atomic coordinates exists, although the most widely accepted protocol is based on the Dictionary of Protein Secondary Structure (DSSP) algorithm [5]. DSSP uses the following structural classes: H (α-helix), G (310-helix), I (π-helix), E (β-strand), B (isolated β-bridge), T (turn), S (bend), and – (other). Several other secondary structure assignment algorithms use a reduction scheme that converts this eight-state assignment down to three states by assigning H and G to the helix state (H), E and B to a the strand state (E), and the rest (I, T, S, and –) to a coil state (C). This is the format generally used in structure databases.
1.1.1.3. Tertiary Structure.
The tertiary structure of the protein is defined as the global 3D structure, represented by 3D coordinates for each atoms. These tertiary structures are comprised of multiple secondary structure elements, and the 3D structure is a function of the interacting side chains between the different amino acids. Hence, the linear ordering of amino acids forms secondary structure; arranging secondary structures yields tertiary structure.
1.1.1.4. Quaternary Structure.
Quaternary structures represent the interaction between multiple polypeptide chains. The interaction between the various chains is due to the non-covalent interactions between the atoms of the different chains. Examples of these interactions include hydrogen bonding, van Der Walls interactions, ionic bonding, and disulfide bonding.
Research in computational structure prediction concerns itself mainly with predicting secondary and tertiary structures from known experimentally determined primary structure or sequence. This is due to the relative ease of determining primary structure and the complexity involved in quaternary structure.
1.1.2. Protein Sequence and Structure Databases
The large amount of protein sequence information, experimentally determined structure information, and structural classification information is stored in publicly available databases. In this section we review some of the databases that are used in this field, and provide their availability information in Table 1.1.
TABLE 1.1 Protein Sequence and Structure Databases
|
| UniProt | Sequence | http://www.pir.uniprot.org/ |
| UniRef | Cluster sequences | http://www.pir.uniprot.org/ |
| NCBI nr | Nonredundant sequences | ftp://ftp.ncbi.nlm.nih.gov/blast/db/ |
| PDB | Structure | http://www.rcsb.org/ |
| SCOP | Structure classification | http://scop.mrc-lmb.cam.ac.uk/scop/ |
| CATH | Structure classification | http://www.cathdb.info/ |
| FSSP | Structure classification | http://www.ebi.ac.uk/dali/fssp/ |
| ASTRAL | Compendium | http://astral.berkeley.edu/ |
1.1.2.1. Sequence Databases.
The Universal Protein Resource (UniProt) [6] is the most comprehensive warehouse containing information about protein sequences and their annotation. It is a database of protein sequences and their function that is formed by aggregating the information present in the Swiss-Prot, TrEMBL, and Protein Information Resources (PIR) databases. The UniProtKB 13.2 version of database (released on April 8, 2008) consists of 5,939,836 protein sequence entries (Swiss-Prot providing 362,782 entries and TrEMBL providing 5,577,054 entries).
However, several proteins have high pairwise sequence identity, and as such lead to redundant information. The UniProt database [6] creates a subset of sequences such that the sequence identity between all pairs of sequences within the subset is less than a predetermined threshold. In essence, UniProt contains the UniRef100, UniRef90, and UniRef50 subsets where within each group the sequence identity between a pair of sequences is less than 100%, 90%, and 50%, respectively.
The National Center for Biotechnology Information (NCBI) also provides a nonredundant (NCBI nr) database of protein sequences using sequences from a wide variety of sources. This database will have pairs of proteins with high sequence identity, but removes all the duplicates. The NCBI nr version 2.2.18 (released on March 2, 2008) contains 6,441,864 protein sequences.
1.1.2.2. Protein Data Bank (PDB).
The Research Collaboratory for Structural Bioinformatics (RSCB) PDB [7] stores experimentally determined 3D structure of biological macromolecules including nucleotides and proteins. As of April 20, 2008 this database consists of 46,287 protein structures that are determined using X-ray crystallography (90%), NMR (9%), and other methods like Cryo-electron microscopy (Cryo-EM). These experimental methods are time-consuming, expensive, and need protein to crystallize.
1.1.2.3. Structure Classification Databases.
Various methods have been proposed to categorize protein structures. These methods are based on the pairwise structural similarity between the protein structures, as well as the topological and geometric arrangement of atoms and predominant secondary structure like subunits. Structural Classification of Proteins (SCOP) [8], Class, Architecture, Topology, and Homologous superfamily (CATH) [9], and Families of Structurally Similar Proteins (FSSP) [10] are three widely used structure classification databases. The classification methodology involves breaking a protein chain or complex into independent folding units called domains, and then classifying these domains into a set of hierarchical classes sharing similar structural characteristics.
SCOP Database.
SCOP [8] is a manually curated database that provides a detailed and comprehensive description of the evolutionary and structural relationships between proteins whose structure is known (present in the PDB). SCOP classifies proteins structures using visual inspection as well as structural comparison using a suite of automated tools. The basic unit of classification is generally a domain. SCOP classification is based on four hierarchical levels that encompass evolutionary and structural relationships [8]. In particular, proteins with clear evolutionary relationship are classified to be within the same family. Generally, protein pairs within the same family have pairwise residue identities greater than 30%. Protein pairs with low sequence identity, but whose structural and functional features imply probably common evolutionary information, are classified to be within the same superfamily. Protein pairs with similar major secondary structure elements and topological arrangement of substructures (as well as favoring certain packing geometries) are classified to be within the same fold. Finally, protein pairs having a predominant set of secondary structures (e.g., all α-helices proteins) lie within the same class. The four hierarchical levels, that is, family, superfamily, fold, and class define the structure of the SCOP database.
The SCOP 1.73 version database (released on September 26, 2007) classifies 34,494 PDB entries (97,178 domains) into 1086 unique folds, 1777 unique superfamilies, and 3464 unique families.
CATH Database.
CATH [9] database is a semi-automated protein structure classification database like the SCOP database. CATH uses a consensus of three automated classification techniques to break a chain into domains and classify them in the various structural categories [11]. Domains for proteins that are not resolved by the consensus approach are determined manually. These domains are then classified into the following hierarchical categories using both manual and automated methods in conjunction.
The first level membership, class, is determined based on the secondary structure composition and packing within the structure. The second level, architecture, clusters proteins sharing the same orientation of the secondary structure elem...