FlyBase:Gene Model Annotation Guidelines

From FlyBase Wiki
Jump to navigation Jump to search

G.8. Gene Model Annotation Guidelines (2012)

Note on annotation of different gene classes Originally, manual annotation efforts concentrated on models of protein-coding genes. With the availability of RNA-Seq coverage data, long non-coding RNAs (lncRNAs) are a growing class of new gene models requiring manual annotation.

FlyBase relies on outside expert annotations for the various small non-protein-coding classes. Annotations of tRNAs have been stable since r3.2 and were based on tRNAscan (Lowe and Eddy, 1997, NAR 25:955-964). Annotations of miRNAs are based on data compiled by miRBase (www.mirbase.org) and periodically updated. Annotations of other small RNA classes are based primarily on published sources; these include snoRNAs (FBrf0199239 and others), snRNAs (FBrf0128209, FBrf0193533 and others), and 5SrRNA genes (FBrf0041596).

Note on transcript and protein ID changes

The following policy was established in 2008: Any change to an existing transcript that results in a change to the CDS will be accompanied by changes to the transcript and protein symbols and IDs.

Implementation of changes to annotation guidelines

Aside from minor changes, the previous annotation guidelines were written in 2007 (see section G.7., above) and stood us in good stead until various new classes of high throughput data started to be made available. Reannotation of existing gene models based on the new 2012 guidelines is a work in progress. The release number corresponding to the last time a gene model was reviewed is now noted in the Gene Model Comment section; these guidelines are reflected in gene models reviewed during release 5.45 and later.

Curator comments (see section G.8.5.).

The Apollo annotation tool allows for the inclusion of comments associated with an annotated gene or a specific transcript of an annotated gene. We make extensive use of this capability, including standardized comments as well as free text comments. In the sections that follow, there are frequent references to curator comments; further explanation and a complete list of current standardized comments may be found in section G.8.5.

G.8.1. Types of data that inform gene model annotation (2012)

Initial annotation efforts relied upon three primary data types:

cDNA sequence data (high-throughput cDNA/EST data or data generated by the community)
Gene prediction data, including conserved protein signatures
BLASTX homologies

Starting in 2010, new high-throughput data sets, primarily from the modENCODE project, have had a significant impact on gene model annotation:

RNA-Seq coverage data
Stranded RNA-Seq coverage data
RNA-Seq exon junction data
Transcription start site (TSS) data
Translation stop-codon read-through predictions

These data can be viewed in GBrowse in the aligned evidence tracks; more information about a specific dataset may be found via links in the GBrowse data tracks listings and in the FlyBase collection reports. Some smaller-scale datasets may not be presented in GBrowse; in these cases, an explanation and reference are provided in the gene or transcript comment section (see ‘Curator comments’). Exceptions include data from publications, including supplementary data, which have not been submitted to a sequence database.

G.8.2. Rules and criteria for annotation (2012)

Classification of a new gene as coding or non-coding

Most new annotations are based on RNA-Seq junction or coverage data. These data were used by modENCODE to isolate cDNAs by inverse-PCR, so there may also be new cDNA data.

Many new annotations are small genes that may be non-coding (lncRNA genes) or encode small polypeptides. Current knowledge of both of these categories is rudimentary, thus FlyBase annotators often must make judgment calls. A primary consideration in this process is whether a potential ORF shows a pattern of conservation among the species in the melanogaster subgroup. New annotations that are difficult to categorize are flagged with a comment stating that the opposite case may be true; see ‘Curator comments,’ below.

A cDNA or an RNA-Seq junction may support the possibility of an antisense RNA gene. If there is also support from stranded RNA-Seq data, a non-coding gene annotation is created and a comment identifying it as antisense is appended; see ‘Curator comments,’ below.

Structure of the transcript(s)

Transcription start site: TSS’s rarely consist of a single definitive nucleotide location.

  • For cases with data from modENCODE mapping the TSS frequency distributions (FBrf0213250), the 5’ ends of all overlapping transcripts are set to the 90% TSS point. This is the point at which a summation algorithm hits 0.9 (starting from the 3’-most TSS and moving 5’).
  • If no modENCODE TSS data are available, the 5’ extent of the 5’-most EST or cDNA is used for all overlapping transcripts.
  • Short-capped RNA data may be used (FBrf0209722); if so, a comment is appended (see ‘Curator comments’).
  • If none of the above are available, but there are robust RNA-Seq data, an estimate based on RNA-Seq coverage data is made.

Internal intron/exon structure:

  • cDNAs are the primary data source for internal gene structure, with alternative transcripts based also on EST and RNA-Seq junction data (see ‘Alternative transcripts and the permutation problem,’ below).
  • Some gene models are still primarily supported by gene prediction or protein alignment data, but these have significantly dropped in number.
  • Non-canonical splices require a high level of support. With the exception of the AT-AC splice pair, thus far all supported non-canonical splice sites vary by only one nucleotide from the most common, GT-AG, and the first nucleotide of all donor sites is ‘G”. Whenever splice sites other than GT/AG or GC/AG are annotated, a comment is appended to the transcript.

3’ terminus: 3’ UTRs may have many polyadenylation sites; no attempt is made to annotate transcripts representing all possibilities.

  • If a polyadenylated cDNA is available, most transcripts are extended 3' to the last non-A nucleotide of the cDNA. A comment is added to each transcript so defined (see ‘Curator comments’).
  • If RNA-Seq coverage data support 3’ UTR sequences beyond that present in a cDNA, at least one transcript is extended 3’ to the approximate terminus supported by the RNA-Seq data and an explanatory comment is added (see ‘Curator comments’).

Annotation of the coding region

The Apollo annotation tool sets the translation start site to the 5'-most in-frame ATG. In cases supported by the literature (including conservation patterns across Drosophila species), a non-ATG translation start site or a downstream ATG may be used; an explanatory comment is appended (see ‘Curator comments’).

The “Exceptional cases’ section below discusses non-ATG starts, stop-codon readthrough and other atypical phenomena affecting the defined coding region.

Partial annotations are avoided except in heterochromatic regions where there may be sequence gaps or genomic sequence mis-assembly. In the past, there were some cases for which it was not possible to identify a likely ATG start codon; translation was started at the 5'-most internal in-frame codon and an explanatory comment added. All such cases in the euchromatin have now been resolved.

Alternative transcripts and the permutation problem

Alternative transcripts are annotated based on cDNA/EST data, RNA-Seq data, and community data. Originally, we also annotated an alternative transcript if there was convincing gene prediction evidence and/or BLASTX evidence, however, almost all alternative transcripts are now supported by RNA-based data.

Frequently, RNA-Seq junction data support many alternative splices within the 5’ UTR of a gene. For a given TSS, all such splices may not be annotated. If this is the case, a comment is included in the Gene Model Comment section (see ‘Curator comments’).

RNA-Seq junctions that are of much lower frequency than alternative junctions may not be annotated (see ‘Assessment of supporting data,’ below). If this is the case, a comment is included in the Gene Model Comment section (see ‘Curator comments’).

If non-contiguous data, such as RNA-Seq junction, EST, and TSS data, support alternative exons in several regions of a gene, it is usually not possible to determine which of all possible combinations actually exist in vivo. We call this the “permutation problem.” Combinations supported by full-length cDNAs are annotated. The number of additional transcripts to be created is at the discretion of the annotator. Excluding low-frequency junctions, all alternative splices within the CDS and all promoters are represented, but not necessarily all possible combinations. If all combinations are not represented, a comment is included in the Gene Model Comment section (see ‘Curator comments’).

Alternative transcripts: cases that disrupt the CDS

Due to a retained intron or alternative splice, cDNA, EST, or RNA-Seq junction data may support an alternative transcript that would result in a premature stop codon or a downstream start – it usually cannot be determined which. A higher level of support is required for the annotation of such a transcript (multiple cDNA/ESTs or a high-frequency junction), and a comment is appended to the transcript (see ‘Curator comments’). However, rather than continue to annotate truncated proteins corresponding to these transcripts, FlyBase is developing a proposal to reclassify them as non-coding transcripts produced from protein-coding gene loci.

Merges, splits, and “splerges”

Gene splits or merges are a common annotation correction and are based upon RNA-Seq coverage or junction data, cDNA/EST data, BLASTX homologies, or corrections submitted by the community. A comment is placed in the gene record indicating that a merge or split has occurred, in which release, along with an indication of the type of data supporting the change.

Generally, FlyBase considers transcripts for which any portion of the predicted protein is in common to represent a single gene. Examples are considered on a case-by-case basis; some exceptions to this rule have been made for well-characterized genes that exist as separate unrelated entities in other phylogenetic groups.

G.8.3. Exceptional cases (2012)

Many of the curator comments (see section G.8.5.) flag exceptional types of annotations.

Dicistronic and polycistronic annotations: Proteins encoded by a multicistronic transcript are considered to represent different genes. Preferrably, a polycistronic transcript is supported by more than one spanning cDNA or EST; care must be taken not to misinterpret cases of overlapping UTR's. Alternative explanations, such as a mutant in the strain or stop-codon suppression, must be ruled out. Each postulated protein should have additional support. This is not a particularly rare “exceptional” class: there are currently more than 130 dicistronic pairs annotated, 4 tricistronic sets and 2 tetracistronic sets.

Stop-codon suppression/readthrough: This class requires support from the literature, including evolutionary comparative data. As a result of work from one of the modENCODE groups (FBrf0216845), this is no longer a rare class: there are currently more than 300 genes with a transcript annotated with a stop-codon readthrough.

Non-ATG starts and translational frameshifts: These require a high level of support, such as detailed treatment in the literature or unambiguous homology data. There are currently 11 genes annotated with a non-ATG start and one with a translational frameshift (Oda). To date, all non-canonical starts vary from ‘AUG’ by one nucleotide.

Trans-spliced transcripts: One gene in Dmel undergoes extensive trans-splicing (mdg4); others may undergo lower levels. If there is sufficient evidence, the trans-splicing precursors should also be annotated.

Pseudogenes: A non-functional gene that (1) has a related gene in the genome and (2) has more than one compromising lesion, is classified as a pseudogene. If there is only one lesion, it is described as a mutant in the strain (see below); polymorphic pseudogenes are treated as mutations in the strain. Retrotransposed pseudogenes are relatively rare: 5 are currently flagged as such. (Retrogenes exist -- at least 98 have been identified -- but nearly all appear to be functional.)

Mutations in the strain: These are now relatively easily assessed, since there is sequence information for multiple Dmel wild-type strains and for closely related species. Cases for which some wild-type strains carry a functional allele and others carry the mutant allele are flagged with a comment that they represent polymorphic pseudogenes.

Chimeric genes: These are occasionally created at the site of an aberration (usually a tandem duplication). Evidence for expression of such a gene is often ambiguous, since ESTs and RNA-Seq data corresponding to the component genes also may align to the chimeric copy. If a CDS appears to be supported, the gene is classified an coding, but is flagged as “Gene model uncertain.” If it appears unlikely that a protein product is produced, the gene is classified as a pseudogene, and also flagged as “Gene model uncertain.”

Stretching the definition of a gene: As described above, transcripts for which any portion of the predicted protein is in common are usually considered to be products of a single gene. A number of genes with very complex alternative splicing patterns have pairs of transcripts with coding regions that fail to overlap each other, but both of which overlap the coding region of a third transcript. We flag these cases with the following comment: "Gene model includes transcripts encoding non-overlapping portions of the full CDS."

Ambiguous cases: Generally, if a case is ambiguous, rather than creating an exceptional annotation a comment is added.

G.8.4. Assessment of supporting data (2012)

One of the most difficult aspects of annotation is assessing the validity of the supporting data. Most problems are recognized only after a period of time, during which a pattern emerges. Examples: the RE and RH cDNA libraries have a higher frequency of genomically primed clones; the GenScan prediction algorithm tends to overpredict, creating long annotations with many small exons (it is more appropriate for vertebrates); unspliced ESTs must be viewed with caution, especially those from the Exelixis libraries; the IP cDNAs exhibit a higher frequency of chimeric or atypically spliced clones.

The new high throughput RNA-Seq, RNA-Seq junction, and TSS data must be viewed with the same caveats. We are still developing annotation guidelines for dealing with these datasets; it is an advantage that they include some quantitative measures of validity and/or frequency. Low-frequency junctions, for example, may not be used in a gene model; if this is the case a comment is added (see ‘Curator comments’). Any type of aligned data is problematic in regions of repeats; this is true of RNA-Seq coverage and junction data.

G.8.5. Curator comments: Current list of standardized annotation comments (2012)

If curator comments exist for a particular annotation, they can be found on the Gene Report, in the GENE MODEL AND FEATURES section, in a field labeled Comments on Gene Model. If comments exist for a particular transcript, they can be found on the annotated transcript report in a section called COMMENTS. This section will appear on the transcript report only if there are comments associated with the transcript. SO terms and ID numbers are included whenever appropriate (see TermLink). Annotation Comments (apply to the whole gene model)

   Annotated transcripts do not represent all possible combinations of alternative exons and/or alternative promoters.
   Annotated transcripts do not represent all supported alternative splices within 5' UTR.
   Low-frequency RNA-Seq exon junction(s) not annotated.
   Supported by RNA-Seq data.
   Supported by strand-specific RNA-Seq data.
   Probable lncRNA gene; may encode small polypeptide(s).
   Possible non-coding RNA gene.
   Antisense: overlaps [] on opposite strand.
   Antisense (in part): overlaps [] on opposite strand.
   gene_with_dicistronic_mRNA ; SO:0000722
   gene_with_polycistronic_transcript ; SO:0000690
   May be component of a dicistronic gene; available data inconclusive.
   Shares 5' UTR with upstream gene.
   Shares 5' UTR with downstream gene.
   Gene model includes transcripts encoding non-overlapping portions of the full CDS.
   Pseudogene similar to []; proximate; partial; created by tandem duplication.
   Pseudogene similar to []; transposed.
   Apparent introns not annotated: probable artifact due to repetitive sequence.
   miRNA(s) located within the transcribed region of this non-coding RNA gene.
   Alternative translation stop created by use of multiphasic reading frames within coding region.
   Variable use of small exon; supported combination results in frameshift and premature stop in downstream exon.
   Multiphase exon postulated: exon reading frame differs in alternative transcripts.
   Multiphase exon postulated: reading frame of first coding exon differs in alternative transcripts.
   Mutation in sequenced strain: [*].
   Polymorphic pseudogene: intact in some individuals or strains, disrupted by mutation in others.
   Gene model uncertain: []
   Gene model uncertain: chimeric gene.
   Gene model is incomplete due to []
   gene_with_transcript_with_translational_frameshift ; SO:0000712
   Translational frameshifting postulated (FBrfnnnnnnn): -1 [+1] frameshift reflected in aa sequence of predicted polypeptide[s].
   gene_with_stop_codon_redefined_as_selenocysteine ; SO:0000710
   Stop-codon suppression (UGA as Sec) postulated (FBrfnnnnnnn).
   gene_with_stop_codon_read_through ; SO:0000697
   Stop-codon suppression (Uxx) postulated; FBrfnnnnnnn.
   gene_with_unconventional_translation_start_codon ; SO:0001739
   gene_with_translation_start_codon_CUG ; SO:0001740
   Unconventional translation start (XYZ) postulated; FBrfnnnnnnn.
   gene_with_trans_spliced_transcript ; SO:0000459
   Multiphase exon postulated: this gene shares a region of coding sequence with an overlapping gene, but different reading frames are utilized in the overlapping coding region.
   Bidrectional region of coding sequence postulated: a portion of the CDS of this gene overlaps a portion of the CDS of a gene on opposite strand.

Transcript Comments (apply to a specific transcript)

   Transcript terminates at site supported by polyadenylated cDNA.
   Extended 3' UTR based on RNA-Seq and/or EST data.
   UTR(s) based on RNA-Seq data.
   Transcriptional initiation is supported by short-capped RNA data (FBrf0209722).
   Evidence supports alternative splice leading to premature stop codon and/or downstream start; may or may not produce functional polypeptide.
   Based on cDNA(s) with retained intron; results in premature stop codon and/or downstream start; may or may not produce functional polypeptide.
   Unconventional splice site postulated (XY-WZ).
   Non-coding alternative transcript supported, [retained intron/alternative splice] (FBrfnnnnnnn).
   Truncated polypeptide supported, [retained intron/alternative splice] (FBrfnnnnnnn).
   Truncated polypeptide supported, [alternative downstream AUG/alternative terminal exon] (FBrfnnnnnnn).
   Monocistronic transcript; alternative dicistronic transcript(s) exist.
   Dicistronic transcript; alternative monocistronic transcript(s) exist.
   Dicistronic transcript.
   Polycistronic transcript.
   Unconventional splice site invoked (XY-WZ); sequence altered due to transposon insertion; this splice may not occur in vivo.
   Unconventional splice site(s) invoked due to gap in genomic sequence; this splice does not occur in vivo.
   Stop-codon suppression (UGA as Sec) postulated (FBrfnnnnnnn); reflected in aa sequence of predicted polypeptide.
   Stop-codon suppression (Uxx) postulated (FBrfnnnnnnn); reflected in aa sequence of predicted polypeptide.
   Unconventional translation start postulated (XYZ encoding Met); FBrfnnnnnnn.
   Downstream translation start supported by comparative analysis across Drosophila species.
   Downstream translation start supported by [FBrfnnnnnnn].
   Transcript postulated to overlap transposable element.