Historians of science may disagree about when computational evolutionary genomics started in earnest. Some may associate the starting point with the work of geneticists Alfred Sturtevant and Theodosius Dobrzhansky or statistician Robert Fisher. Others may say that genomics is incomplete without the molecular-level analysis and mark the beginning of the era with the following citation from Francis Crick (1958)
Biologists should realize that before long we shall have a subject which might be called “protein taxonomy”—the study of amino acids sequences of proteins of an organism and the comparison of them between species. It can be argued that these sequences are the most delicate expression possible of the phenotype of an organism and that vast amounts of evolutionary information may be hidden away from them.
However, I believe that most people would agree that several papers published from 1962 to 1965 by Linus Pauling and Emile Zuckerkandl were extremely important. One article in particular, “Molecules as Documents of Evolutionary History” (Zuckerkandl and Pauling, 1965
), set the scene for most of the future work that is described in this book. The circumstances of its publication are also of some interest: Although written in 1963, it first appeared in 1964 as a Russian translation in a monograph dedicated to Alexei Nikolaevich Oparin, a true pioneer of experimental study of abiotic protein synthesis (Oparin, 1953
) who, sadly, also endorsed and helped enforce Lysenkoist pseudo-science during his service at the Soviet Academy of Sciences from the 1940s to 1960s (Lewontin and Levins, 1976
; Jukes, 1997
The research first announced in that unlikely place (the original English language version of Zuckerkandl and Pauling’s paper followed in 1965) sounds prophetic. If we outline the main ideas of that work, the density of novel ideas in that 10-page article is staggering:
The authors use the root “semantics” 72 times when speaking of genes and gene products. They called DNA, RNA, and proteins “semantides,” or sense-carrying units. Unlike some of the modern uses of this word, which essentially equates semantics with postmodern relativism (e.g., “let us discuss the substance and not argue about semantics”), Pauling and Zuckerkandl took semantics seriously. So should we: By definition (and as understood by their readers in the early 1960s), semantics
is the study of the meaning of sense-carrying units in a language or in other code. The meaning of words—and of genes—is exactly what we want to know.
There are dissimilarities between even closely related sense-carrying molecules. These dissimilarities are produced by genetic processes, such as nucleotide substitutions,
insertions, deletions, and rearrangements of large DNA fragments. Sense, or meaning, of genes and their products may be extracted by comparing related molecules, detecting the differences between them, and computing something about these differences.
Biopolymers contain information about evolution. It is threefold: (1) the time of existence of the ancestral molecule,(2) what the sequence was, and (3) the line of descent from the ancestor to each of the contemporary molecules.
Some sense-carrying units carry less sense than others. For example, simple biopolymers, build by repetition of a few blocks (nucleotides or amino acids), may not be a good source of information about complex evolutionary processes.
Changes in biopolymers may be of different types. Some of the changes are beneficial and favored by selection, whereas others have no phenotype and are “cryptic polymorphisms.” One reason why some genetic changes have no phenotype is the degeneracy of genetic code: The same amino acid can be coded by different combinations of nucleotides. Another reason is degeneracy of protein sequence with regard to the three-dimensional structure and, ultimately, to the protein function: The same structure and function can be achieved by different combinations of amino acids. Analysis of these different solutions to the same problem may result in a better understanding of the relationships between genotype and phenotype.
Gene mutations and duplications of whole genes may put some genes into a “dormant” state. It is plausible that dormant genes may be reactivated after they accumulate changes, and this reactivation may be an important source of evolutionary novelty.
Sequences outside the protein-coding regions may have a regulatory function and may evolve differently than in the coding regions. Other noncoding regions may have no function, and mutations in these regions will be free of selection.
Chemical compounds may be synthesized by more than one biochemical pathway. Thus, functional convergence at the molecular level is expected, both at the level of the pathways and at the level of individual biochemical reactions.
Thus, the authors cast evolutionary molecular biology as information science and thought that particular attention should be given to distinguishing signals from noise in the sense-carrying units. Biologists, chemists, engineers, mathematicians, and computer scientists who work on in genome analysis today are in fact implementing the research program that, unbeknownst to some of them, was started by Zuckerkandl and Pauling.
This book is no exception. Nearly every chapter addresses an issue that can be traced back to an idea set forth in Zuckerkandl and Pauling’s seminal paper. Chapters 2
discuss practical approaches to sequence comparison (point 2 as outlined previously). Evolutionary inferences from these comparisons (point 3) and the relationship between signal and noise in sequence comparison (point 4) are discussed in nearly every chapter. The issues of functional convergence (point 8) are of central importance in Chapters 6
, and 9
. Cryptic polymorphism (point 5) is discussed in Chapters 9
in connection with sequence-structure-function degeneracy. Finally, “what the ancestors were” (point 3) is the central theme of Chapters 11
. Even Chapter 14
, which deals with genome-wide numerical data, draws inspiration from approaches to comparative sequence analysis foreseen by Pauling and Zuckerkandl.
The techniques of biological sequence comparison were not discussed at any length in “Molecules as Documents of Evolutionary History,” but the central goal of finding pairs of similar sequence fragments
was stated very clearly.
Sequence similarity lies at the heart of all biology, not just comparative genomics. The following statement has even been called “the first fact of biological sequence analysis” by Dan Gusfield (1997)
at the University of California at Davis:
In biomolecular sequences high sequence similarity usually implies significant functional or structural similarity.
This “first fact” may qualify as one of the most fundamental facts of our understanding of life. Most biologists, however, would not hesitate to add the following:
In biomolecular sequences, high sequence similarity also usually implies evolutionary relationship.
The two statements, though similar in form, are actually distinct, and in a quite fundamental way. The structure of a biological molecule, such as a protein, is something that can be physically defined. If we have a pure sample of this protein, a quiet place for growing crystals, and a synchrotron beamline, we can determine a structure of a protein molecule, at least in principle. Technical details aside, the same equipment would generally do the job for all proteins. Indeed, as I write this, the challenges of high-throughput protein structure determination are being met by the structural genomics projects (Chandonia and Brenner, 2006
). Function, however, is not a physical characteristic but, rather, a description of some process, so function can be defined only in a biological context. At the bare minimum, function of a protein involves interactions with other molecules, which have to be identified and included in the description of function. Often, in order to define the biological function of a sequence, we need to monitor the interactions of many components in a cellular extract, in the whole cell, in a living organism, or in an ecosystem of which this organism is a part. As the protein function is performed, its structure may change. Thus, when we casually say “structure and function,” in fact we are talking about many different things already. And the fact that sequence similarity can be used to make inferences about all those different properties of a sense-carrying unit—from physical properties of the molecule to its relationships with its environment—is not at all trivial. The “second fact” is also nontrivial: Unlike more or less directly observable structural and functional properties, the common ancestor of two molecules cannot be directly observed (with the exception of rare cases in which the ancestral DNA or protein have survived in ancient proteins or in biopsies), and yet we do not hesitate to infer such an ancestor from the sequence similarity.
Thus, on the basis of sequence similarity, we make conclusions about (1) similar structure, (2) similar function, and (3) common ancestry. These inferences are at the heart of computational biology; most biologists make them every day, and almost every theme in this book is based on such inferences. But how do we make them in practice?
At first glance, the statements about structure and function seem to follow from sequence similarity quite naturally. And without doubt, these statements are amenable to direct experimental corroboration. But in fact, structural and functional inference is inseparable from evolutionary inference. Indeed, when comparing sequences of two biopolymers, our path from sequence similarity to the conclusion about structural or functional similarity is never direct. Instead, we always infer common ancestry of these sequences first, and only from there can we proceed to making structural and functional inferences. This logic is not obvious when the similarity is very high, but if the two sequences are more distantly related to each other (as is the case with most sequence comparisons today), this chain of thought becomes explicit. Indeed, we measure similarity between sequences and immediately use statistics to compare the observed similarity with what would be expected by chance (discussed in Chapter 2
). If the similarity is too high to occur by chance, this is usually sufficient for making predictions about protein function (discussed in Chapters 5
) and structure (see Chapter 9
). But the only reason why such reasoning works is because the only way for nonrandom sequence similarity to occur is by descent from a common ancestor of the two sequences. This is the homology inference (see Chapter 3
). Thus, the inference of evolutionary relationship, which seems to be the least observable of all, turns out to be a prerequisite of proposing other, directly observable, relationships, such as similarity of structure and function.
Consider the alignment of three sequences, A′, A″, and A′″ (here and elsewhere in this book, I use capital letters in regular font to indicate genes and italicized capitals to indicate
species in which these genes are found). Suppose that three sequences come from three different species, one from each, and only the function of A′ has been studied. Suppose that A′ and A″ are almost identical, and the third sequence, A′″, is less similar but still quite close to A′ and A″. Do we use the same information to infer common ancestry and common function of all these sequences? It seems that we do not really need every amino acid residue that is conserved between A′ and A″ to determine that they share a common ancestor; for example, we may not care about the sites conserved exclusively between A′ and A...