Escherichia coli K-12 MG1655 sequence and annotations
U00096.3 (aka version 3)

Sequence Updates Annotation Updates Nomenclature

Sequence update (September 26, 2013)

    The original Escherichia coli K-12 strain MG1655 genome sequence from 1997 (U00096.1) was updated in 2004 (U00096.2), based on additional data from a comparison with the newly sequenced K-12 strain W3110. These pre-WGS era data were from a variety of clones and subclones, but as MG1655 was used as a testbed/control in various next-generation sequencing studies it became obvious that the GenBank entry did not precisely correspond to any specific MG1655 isolate, a situation best described by Freddolino, et al. [1]. After much discussion, we decided to update the sequence to correspond to the isolate we had deposited with both ATCC (ATCC 700926) and the Coli Genetic Stock Center (CGSC 7740) as the sequenced strain. These corrections are due not to sequencing errors per se, but rather to moving from a chimeric sequence to the sequence of a single isolate. All areas of disagreement were re-sequenced [2] and the final 4,641,652 bp consensus sequence (ASAP v3) was the source of U00096.3, deposited on September 26, 2013.

The changes in U00096.3 [see spreadsheet for details] consist of:

  • an IS1 insertion (reverse orientation; 8 bp target site duplication) in crl, interrupting the gene (new pseudogene)
  • a 1 bp SNP and a 1 bp insertion in ylbE, restoring a gene previously thought to be a pseudogene
  • an IS5 insertion (forward orientation; 4 bp target site duplication) in the oppA-ychE intergenic region; no feature was impacted
  • a 2 bp insertion in gatC, interrupting the gene (new pseudogene)
  • a 1 bp deletion in glpR, interrupting the gene (new pseudogene)
  • a 1 bp SNP in the ppiC-yifN intergenic region; no feature was impacted


Annotation updates (most recent September 23, 2020)

    Subsequent annotation updates from ASAP and EcoGene were submitted on November 15, 2013 and on July 30, 2014. Acknowledging that annotation of the genome is an ongoing task that benefits from the work of all end-users of the sequence, on February 11, 2016 EcoCyc was designated as the submitter of record for further updates to the GenBank entry. Periodic updates are being generated from a collaboration that includes EcoCyc, our group here at UW-Madison, UniProtKB/Swiss-Prot, and the National Center for Biotechnology Information (NCBI). After an extended series of discussions between all parties involved, the GenBank entry was updated on September 24, 2018. The most recent annotations to the GenBank entry were released September 23, 2020. Suggestions for updates can be sent to EcoCyc.

A note on nomenclature: gene names, b-numbers, ECK numbers

    The standard genetic nomenclature for E. coli is that of Demerec et al. 1966, as subsequently amended through use, and as described in Instructions to Authors for the Journal of Bacteriology. Provisional y-names for uncharacterized ORFs are based on a systematic nomenclature described by Kenn Rudd (Rudd 1998; also see this archived page from EcoGene). Briefly, the first three letters of a "y" name are based on the map position of an ORF at the time the name was assigned, in a manner analogous to the "z" naming system for transposon insertions (Chumley et al. 1979). The y-names are not reused if an ORF is given a new gene name or if an ORF becomes defunct. According to the original scheme, once a function was established for an E. coli gene the provisional y-name would be abandoned and a new gene name chosen. We strongly encourage experimental scientists to continue this practice. However, since the y-names have been widely used in the literature, the annotations do retain them as synonyms when a gene is renamed.

    Beginning with our publication of the complete genome sequence we have assigned each gene (protein- or RNA-encoding) a unique systematic numeric identifier (locus tag) beginning with a "b" -- the so-called b-numbers or Blattner numbers. These designations remain constant through further updates, gene identifications, etc., if a gene is not substantially changed. The initial set of locus tags were assigned sequentially according to their order in the linearized genome sequence. Subsequent assignments are made using the next available cardinal number. De-accessioned (retired) locus tags are not reused.

    One aim of a series of annotation workshops was to produce a gene identification system for E. coli K-12 genes that was consistent between strains over the vast regions where they are essentially identical while also making accessible those genes that are strain specific or have different map locations. The solution was to provide a multi-part system of identifiers for each annotated feature: strain-specific locus tags for each isolate (e.g., b-numbers for MG1655, JW-numbers for W3110), and ECK (E. coli K-12) numbers for reference to E. coli K-12 as a composite strain.

    When describing a new gene in E. coli K-12 we strongly encourage researchers to contact EcoCyc so that the annotation group can coordinate naming and avoid duplications, etc. This group will also assign b-numbers, ECK numbers, and y-names if no gene name is suggested.

