1.1 Introduction
Human genome informatics is the application of information theory, including computer science and statistics, to the field of human genomics. Informatics enlists computation to augment our capacity to form models of reality with diverse sources of information. When forming a model of reality, one engages in a process of abstraction. The word āabstractionā comes from the Latin abstrahere, which means to ādraw away,ā which is a metaphor, based in human vision, that as we back away from something, the details fall away and we form mental constructs about what we can discern from the more distant vantage point. That more distant vantage point both encompasses a greater portion of reality and yet holds in mind a smaller amount of detail about that larger space.
Given the human mind's limit on the number of variables it can manage, as we form our mental models of reality, we pay attention to certain facets of reality and ignore others, perhaps leaving them to subconscious or unconscious processing mechanisms. When we form models of reality, we have a field of perception that encompasses a subset of reality at a particular scale and a particular time horizon and that includes a subset of the variables at that spatio-temporal scale. Those variables are recursively composed using abstractive processes, for instance, by scale: an atom, a base pair, a gene, a chromosome, a strand of DNA, the nucleus, a cell, a tissue, an organ, an organ system, the human body, a family, a racial group defined by geography and heredity, or all of humanity. Note this abstraction sequence was only spatial and ignored time. Because our perceivable universe is seen through the lens of three spatial and one apparently nonreversible temporal dimension, the mental models we compose describe the transformations of matter-energy forwards through space-time. Let us relate this to information theory and computer science, then bring it back to genomics.
In the 1930s, Alan Turing introduced an abstract model of computation, called the Turing machine (Turing, 1937). The machine is comprised of an infinite linear blank tape with a tape head that can read/write/erase only the current symbol and can move one space to the left or right or remain stationary. This tape head is controlled by a controller that contains a finite set of states and contains the rules for operating the tape head, based only on the current state and the current symbol on the tape (the algorithm or program). Despite the simplicity of this model, it turns out that it can represent the full power of every algorithm that a computer can perform and is thus a universal model of computation.
Suppose we wanted an algorithm to write down the first billion digits of the irrational number Ļ. We could create a Turing machine that had the billion digits embedded in the finite controller (the program) and we could run that program to write the digits to the tape one at a time. In this case, the length of the program would be proportional to the billion digits of output. This might be coded in a language like C ++ as: printf(ā3.1415926[ā¦]7,504,551ā), with ā[ā¦]ā filled in with the remaining digits. If a billion-digit number was truly random and had no regularity, this would approach being the shortest program that we could write (the information-theoretic definition of randomness). However, Ļ is not a random number, but can be computed to an arbitrary number of digits via a truncated infinite series. An algorithm to perform a series approximation of Ļ could thus be represented as a much shorter set of instructions.
In algorithmic information theory, the Kolmogorov complexity or descriptive complexity of a string is the length of the shortest Turing machine instruction set (i.e., shortest computer program) that can produce that string (Kolmogorov, 1963). We can think of the problem of modeling a subset of reality as generating a parsimonious algorithm that prints out a representation of the trajectory of a set of variables representing an abstraction of that subset of reality to some level of approximation. That is, we say, āunder such and such conditions, thus and such will happen over a prescribed time periodā. The idea of Kolmogorov complexity motivates the use of Occam's razor, where, given two alternate explanations of reality that explain it comparably well, we will choose the simpler one.
In our modeling of reality, we are not generally trying to express the state space transitions of the universe down to the level of every individual atom or quark in time intervals measured by Planck time units, but rather at some level of abstraction that is useful with respect to the outcomes we value in a particular context. Also, because reality has constraints (i.e., laws), and thus regularity, we can observe a small spatial-temporal subset of reality from models that not only describe that observed behavior, but also that generalize to predict the behavior of a broader subset of reality. That is, we donāt just model specific concrete observables in the here and now, but we model abstract notions of observables that can be applied beyond the here and now.
The most powerful models are the most universal, such as laws of physics, which are hypothesized to hold over all of reality and can thus be falsified if any part of reality fails to behave according to those laws, and yet, cannot be proven because all reality would have to be observed over all time. This then forms the basis of the scientific method where we form and falsify hypotheses but can never prove them. Unlike with hydrogen atoms or billiard balls where the units of observation may be considered in most contexts as near-identical, when we operate on abstractions such as cells, or people, we create units of observation that may have enormous differences.
1.2 From Informatics to Bioinformatics and Genome Informatics
In biology, we often blithely assume that the notion of ceteris paribus (all things being equal) holds, but it can lead us astray (Lambert and Black, 2012; Meehl, 1990). For instance, while genetics exists at a scale where ceteris paribus generally holds, we are nevertheless trying to draw relations with genetic variations at the molecular scale, with fuzzy phenotypes at the level of populations of nonidentical people.
So unlike our previous example of writing a program to generate the first billion digits of Ļ, which has a very precise answer, our use of abstraction to model biology involves leaving out variables of small effect, which nevertheless, when left unaccounted for, may result in error when we extrapolate our projections of the future with abstract models. We would do well to mind the words of George Box, āall models are wrong, but some are usefulā:
Since all models are wrong, the scientist cannot obtain a ācorrectā one by excessive elaboration. On the contrary, following William of Occam, he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist, overelaboration and overparameterization is often the mark of mediocrity (Box, 1976).
How then do we choose what variables to study at what level of abstraction over what time scale? To begin to answer this question, it is useful to talk about control in the context of goal-directedness and to turn to a field that preceded and contributed to the development of computer science, namely Cybernetics. In 1958, Ross Ashby introduced the Law of Requisite Variety (Ashby, 1958). Variety is measured as the logarithm of the number of states available to a system. Control, when stripped of its negative connotations of coercion, can be defined as restricting the variety of a system to a subset of states that are valued and preventing the other states from being visited. For instance, an organism will seek to restrict its state space to healthy and alive ones. For every disturbance that can move a system from its current state to an undesirable one, the system must have a means of acting upon or regulating that disturbance. Ashby's example of a fencer staving off attack is helpful:
Again, if a fencer faces an opponent who has various modes of attack available, the fencer must be provided with at least an equal number of modes of defense if the outcome is to have the single value: attack parried.
(Ashby, 1958)
The law of requisite variety says that āvariety absorbs variety,ā and thus that the number of states of the regulator or control mechanism whose job is to keep a system in desirable states (i.e., absorb or reduce the variety of outcomes) must be at least as large as the number of disturbances that could put the system in an undesirable state. All organisms engage in goal-directed activity, the primary one being sustaining existence or survival. The fact that humanity has dominated as a species reflects our capacity to control our environmentāto both absorb and enlist the variety of our environment in the service of sustaining health and life.
In computing, a universal Turing machine is a Turing machine that can simulate any Turing machine on arbitrary input. If DNA is the computer program for the āTuring machine of life,ā the field of human genome informatics is metaphorically moving towards the goal of a universal Turing machine that can answer āwhat-ifā questions about modifying the governing variables of life. Note, the computer science concept of self-modifying code also enriches this metaphor. In particular, cancer genomics addresses the situation where the DNA program goes haywire, creating cancer cells with distorted copies where portions of the genome are deleted, copied extra times, and/or rearranged. Self-modifying code in computer science is enormously difficult to debug and is usually discouraged. Similarly, in cancer, we acknowledge that it is too difficult to repair rapidly replicating agents of chaos, and thus, most treatments involve killing or removing the offending cancer cells. Also, with the advent of emerging technologies such as CRISPR genome editing, humanity is now poised on the threshold of directly modifying our genome (Cong et al., 2013). Such technologies, guided by understanding of the genome, have the potential to recode portions of the program of life in order to cure genetic diseases.
With the human genome having a state space of three billion base pairs times two sets of chromosomes, compounded by ...