Nicole T. Perna, Department of Animal Health and Biomedical Sciences, University of Wisconsin–Madison, Madison, Wisconsin
Jeremy D. Glasner, Department of Genetics, University of Wisconsin–Madison, Madison, Wisconsin
Valerie Burland, Department of Genetics, University of Wisconsin–Madison, Madison, Wisconsin
Guy Plunkett, III, Department of Genetics, University of Wisconsin–Madison, Madison, Wisconsin
INTRODUCTION
Escherichia coli K-12 holds a special place in the hearts and laboratories of experimental molecular biologists and microbiologists. It is perhaps not surprising that this model organism–laboratory reagent–industrial workhorse was among the first microorganisms targeted for genome sequencing. Given the sheer volume of biologic research targeted at or using E. coli K-12, what is perhaps more surprising is the amount learned through relatively simple analysis of genome sequence and the number of questions generated by the evidence that so much remains unknown about what features are encoded within this genome, how networks of regulation orchestrate the genes to produce a dynamic living organism, and the relationship between the E. coli K-12 genome and those of the phenotypically diverse group of organisms we know as E. coli. Fortunately, many of these questions are already under investigation using genome-scale approaches to ascertaining function, comparative genomics to reveal the evolutionary and population history, and in laboratories worldwide now using the genome sequence as a resource to accelerate progress on more directed projects. Here we attempt to provide an overview of the history, current state, and future of E. coli genomics.
HISTORY OF E. COLI GENOMICS
Currently, there are published and publicly available E. coli genome sequences for E. coli K-12 [1], a benign laboratory model organism, and E. coli O157:H7 [2,3], an enterohemorrhagic pathogen. Additional genome sequencing projects are ongoing at the University of Wisconsin addressing E. coli strains that exhibit distinct phenotypes, at least with respect to their potential to cause human diseases. A genome sequence of an E. coli strain associated with urinary tract infections, CFT073, is nearly complete, and data have been available via a Web server for over 1 year (www.genome.wisc.edu). Data collection is nearing completion for RS218, a strain associated with neonatal sepsis and meningitis. Other closely related genomes also are being explored, ranging from Shigella flexneri, the causative agent of dysentery, to Yersinia pestis, the causative agent of the black plague, a more distantly related member of the family Enterobacteriaceae. Other enterobacterial genome sequencing projects either have been completed or are underway elsewhere, including several Salmonella isolates (Washington University–St. Louis, University of Illinois, and Sanger Centre), another Y. pestis strain (Sanger Centre), Y. pseudotuberculosis (Lawrence Livermore National Laboratory/Institute Pasteur), and Klebsiella pneumoniae (Washington University–St. Louis). For a list of completed and ongoing microbial genome projects, see the TIGR Web site at www.tigr.org. In all probability, this group will remain one of the best-sampled clades of closely-related organisms.
The E. coli K-12 Strain MG1655 Genome-Sequencing Project at the University of Wisconsin–Madison
A systematic E. coli K-12 genome-sequencing project began with genome mapping in the late 1980s in the laboratory of Dr. Frederick R. Blattner at the University of Wisconsin–Madison. The 4,639,221-basepair (bp) chromosome of strain MG1655 published by this group in 1997 represented the largest genome sequence completed in a single laboratory at that point in time [1]. The fully annotated sequence is deposited in GenBank (U00096) and is available as a single sequence through the Entrez Genomes Division; it is split into 400 records of approximately 11,500 bp each through the Entrez Nucleotide Database (Accession Numbers AE000111–AE000500). These sequence data are also available online directly from the Wisconsin group at www.genome.wisc.edu.
The original E. coli K-12 strain was isolated by Lederberg in 1922 from a convalescent patient suffering from an unrelated pathology. As an experimental model organism, derivatives of this strain underwent extensive handling in countless laboratories. E. coli K-12 strains have been subjected to repeated experimental mutagenic strategies, including treatment with ultraviolet light, EMS, x-rays, and acridine dyes to cure them of the F-plasmid, excise phage lambda, and procure variants with phenotypes of value in the laboratory environment [4]. After examining the records of Barbara Bachmann and the E. coli Genetic Stock Center at Yale University, MG1655 was selected as the available strain most similar to the original isolate that had undergone a minimum of mutagenic assaults [4].
The basic strategy employed in this project varied over its history as sequencing technologies developed. Approximately one-third of the genome was determined from a series of mapped overlapping lambda clones (∼10 kbp each) proceeding counterclockwise from the 0/100 minute region of the chromosome. Analyses of subsections of completed genome sequence were released in a series of publications beginning in 1992 [5–10] and deposited into GenBank (Accession Numbers U00039, L10328, M87049, L19201, U00006, U14003, and U18997). The remaining two-thirds of the genome sequence was determined from random-shotgun libraries of larger fragments (100–200 kbp) recovered by pulsed-field gel electrophoresis (PFGE) of restriction endonuclease–digested genomic DNA from MG1655-derived strains constructed by introducing novel rare restriction sites into the chromosome using a mini-Tn10-derivative transposon vector [11]. The PFGE fragment isolation was subject to approximately 15% contamination from other areas of the genome, and sequences collected from these extraneous DNA templates were used to bolster coverage genome-wide.
Other E. coli K-12 Genome-Sequencing Projects
Even prior to any concerted efforts to complete a genome sequence of E. coli K-12, a substantial amount of sequence data was already available. By 1989, about 20% of the total chromosome sequence was known from the piecemeal contributions from different independent laboratories [12]. In contrast to the several large-scale sequencing efforts initiated after this point, these sequences are derived from a number of different K-12 strains and are distributed throughout the chromosome according to the research interests of each individual group.
An evolving consortium of Japanese laboratories has collected most of the genome sequence of a second E. coli K-12 strain, W3110 [13–22]. The bulk of this sequence was determined from the Kohara lambda clone library of E. coli K-12 strain W3110 [23]. This group began sequencing mapped clones beginning from the 0/100 minute mark and moving clockwise around the chromosome, focusing on filling gaps in the already available sequence data. Anal...