Difference between revisions of "FlyBase:Gene Model Annotation Guidelines"
Line 126: | Line 126: | ||
One of the most difficult aspects of annotation is assessing the validity of the supporting data. Most problems are recognized only after a period of time, during which a pattern emerges. Examples: the RE and RH cDNA libraries have a higher frequency of genomically primed clones; the GenScan prediction algorithm tends to overpredict, creating long annotations with many small exons (it is more appropriate for vertebrates); unspliced ESTs must be viewed with caution, especially those from the Exelixis libraries; the IP cDNAs exhibit a higher frequency of chimeric or atypically spliced clones. | One of the most difficult aspects of annotation is assessing the validity of the supporting data. Most problems are recognized only after a period of time, during which a pattern emerges. Examples: the RE and RH cDNA libraries have a higher frequency of genomically primed clones; the GenScan prediction algorithm tends to overpredict, creating long annotations with many small exons (it is more appropriate for vertebrates); unspliced ESTs must be viewed with caution, especially those from the Exelixis libraries; the IP cDNAs exhibit a higher frequency of chimeric or atypically spliced clones. | ||
− | The new high throughput RNA-Seq, RNA-Seq junction, and TSS data must be viewed with the same caveats. We are still developing annotation guidelines for dealing with these datasets; it is an advantage that they include some quantitative measures of validity and/or frequency. Low-frequency junctions, for example, may not be used in a gene model; if this is the case a comment is added (see | + | The new high throughput RNA-Seq, RNA-Seq junction, and TSS data must be viewed with the same caveats. We are still developing annotation guidelines for dealing with these datasets; it is an advantage that they include some quantitative measures of validity and/or frequency. Low-frequency junctions, for example, may not be used in a gene model; if this is the case a comment is added (see [[FlyBase:Gene Model Annotation Guidelines#Curator comments | Curator comments]]). Any type of aligned data is problematic in regions of repeats; this is true of RNA-Seq coverage and junction data. |
==Curator comments== | ==Curator comments== |
Revision as of 19:38, 20 December 2017
Gene Model Annotation Guidelines (2012)
Two publications in 2015 describe the use of these annotation guidelines and analyses of the resulting R6.04 D. melanogaster gene model set:
Matthews BB, dos Santos G, Crosby MA, Emmert DB, St Pierre SE, Gramates LS, Zhou P, Schroeder AJ, Falls K, Strelets V, Russo SM, Gelbart WM, and the FlyBase Consortium. (2015) Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data. G3 (Bethesda) 5:1721-1736. (FBrf0229216)
Crosby MA, Gramates LS, dos Santos G, Matthews BB, St Pierre SE, Zhou P, Schroeder AJ, Falls K, Emmert DB, Russo SM, Gelbart WM., and the FlyBase Consortium. (2015) Gene Model Annotations for Drosophila melanogaster: The Rule-Benders. G3 (Bethesda) 5:1737-1749. (FBrf0229217)
Note on annotation of different gene classes Originally, manual annotation efforts concentrated on models of protein-coding genes. With the availability of RNA-Seq coverage data, long non-coding RNAs (lncRNAs) are a growing class of new gene models requiring manual annotation.
FlyBase relies on outside expert annotations for the various small non-protein-coding classes. Annotations of tRNAs have been stable since r3.2 and were based on tRNAscan (Lowe and Eddy, 1997, NAR 25:955-964). Annotations of miRNAs are based on data compiled by miRBase and periodically updated. Annotations of other small RNA classes are based primarily on published sources; these include snoRNAs (FBrf0199239 and others), snRNAs (FBrf0128209, FBrf0193533 and others), and 5SrRNA genes (FBrf0041596).
Note on transcript and protein ID changes
The following policy was established in 2008: Any change to an existing transcript that results in a change to the CDS will be accompanied by changes to the transcript and protein symbols and IDs.
Implementation of changes to annotation guidelines
Aside from minor changes, the previous annotation guidelines were written in 2007 (see section 2 below) and stood us in good stead until various new classes of high throughput data started to be made available. Reannotation of existing gene models based on the new 2012 guidelines is a work in progress. The release number corresponding to the last time a gene model was reviewed is now noted in the Gene Model Comment section; these guidelines are reflected in gene models reviewed during release 5.45 and later.
Curator Comments
The Apollo annotation tool allows for the inclusion of Curator comments associated with an annotated gene or a specific transcript of an annotated gene. We make extensive use of this capability, including standardized comments as well as free text comments. In the sections that follow, there are frequent references to curator comments; further explanation and a complete list of current standardized comments may be found at Curator comments.
Types of data that inform gene model annotation (2012)
Initial annotation efforts relied upon three primary data types:
- cDNA sequence data (high-throughput cDNA/EST data or data generated by the community)
- Gene prediction data, including conserved protein signatures
- BLASTX homologies
Starting in 2010, new high-throughput data sets, primarily from the modENCODE project, have had a significant impact on gene model annotation:
- RNA-Seq coverage data
- Stranded RNA-Seq coverage data
- RNA-Seq exon junction data
- Transcription start site (TSS) data
- Translation stop-codon read-through predictions
These data can be viewed in GBrowse in the aligned evidence tracks; more information about a specific dataset may be found via links in the GBrowse data tracks listings and in the FlyBase collection reports. Some smaller-scale datasets may not be presented in GBrowse; in these cases, an explanation and reference are provided in the gene or transcript comment section (see Curator comments). Exceptions include data from publications, including supplementary data, which have not been submitted to a sequence database.
Rules and criteria for annotation (2012)
Classification of a new gene as coding or non-coding
Most new annotations are based on RNA-Seq junction or coverage data. These data were used by modENCODE to isolate cDNAs by inverse-PCR, so there may also be new cDNA data.
Many new annotations are small genes that may be non-coding (lncRNA genes) or encode small polypeptides. Current knowledge of both of these categories is rudimentary, thus FlyBase annotators often must make judgment calls. A primary consideration in this process is whether a potential ORF shows a pattern of conservation among the species in the melanogaster subgroup. New annotations that are difficult to categorize are flagged with a comment stating that the opposite case may be true; see Curator comments.
A cDNA or an RNA-Seq junction may support the possibility of an antisense RNA gene. If there is also support from stranded RNA-Seq data, a non-coding gene annotation is created and a comment identifying it as antisense is appended; see Curator comments.
Structure of the transcript(s)
Transcription start site: TSS’s rarely consist of a single definitive nucleotide location.
- For cases with data from modENCODE mapping the TSS frequency distributions Hoskins et al., the 5’ ends of all overlapping transcripts are set to the 90% TSS point. This is the point at which a summation algorithm hits 0.9 (starting from the 3’-most TSS and moving 5’).
- If no modENCODE TSS data are available, the 5’ extent of the 5’-most EST or cDNA is used for all overlapping transcripts.
- Short-capped RNA data may be used Nechaev et al.; if so, a comment is appended (see Curator comments).
- If none of the above are available, but there are robust RNA-Seq data, an estimate based on RNA-Seq coverage data is made.
Internal intron/exon structure:
- cDNAs are the primary data source for internal gene structure, with alternative transcripts based also on EST and RNA-Seq junction data (see ‘Alternative transcripts and the permutation problem,’ below).
- Some gene models are still primarily supported by gene prediction or protein alignment data, but these have significantly dropped in number.
- Non-canonical splices require a high level of support. With the exception of the AT-AC splice pair, thus far all supported non-canonical splice sites vary by only one nucleotide from the most common, GT-AG, and the first nucleotide of all donor sites is ‘G”. Whenever splice sites other than GT/AG or GC/AG are annotated, a comment is appended to the transcript.
3’ terminus: 3’ UTRs may have many polyadenylation sites; no attempt is made to annotate transcripts representing all possibilities.
- If a polyadenylated cDNA is available, most transcripts are extended 3' to the last non-A nucleotide of the cDNA. A comment is added to each transcript so defined (see Curator comments).
- If RNA-Seq coverage data support 3’ UTR sequences beyond that present in a cDNA, at least one transcript is extended 3’ to the approximate terminus supported by the RNA-Seq data and an explanatory comment is added (see Curator comments).
Annotation of the coding region
The Apollo annotation tool sets the translation start site to the 5'-most in-frame ATG. In cases supported by the literature (including conservation patterns across Drosophila species), a non-ATG translation start site or a downstream ATG may be used; an explanatory comment is appended (see Curator comments).
The “Exceptional cases’ section below discusses non-ATG starts, stop-codon readthrough and other atypical phenomena affecting the defined coding region.
Partial annotations are avoided except in heterochromatic regions where there may be sequence gaps or genomic sequence mis-assembly. In the past, there were some cases for which it was not possible to identify a likely ATG start codon; translation was started at the 5'-most internal in-frame codon and an explanatory comment added. All such cases in the euchromatin have now been resolved.
Alternative transcripts and the permutation problem
Alternative transcripts are annotated based on cDNA/EST data, RNA-Seq data, and community data. Originally, we also annotated an alternative transcript if there was convincing gene prediction evidence and/or BLASTX evidence, however, almost all alternative transcripts are now supported by RNA-based data.
Frequently, RNA-Seq junction data support many alternative splices within the 5’ UTR of a gene. For a given TSS, all such splices may not be annotated. If this is the case, a comment is included in the Gene Model Comment section (see Curator comments).
RNA-Seq junctions that are of much lower frequency than alternative junctions may not be annotated (see ‘Assessment of supporting data,’ below). If this is the case, a comment is included in the Gene Model Comment section (see Curator comments).
If non-contiguous data, such as RNA-Seq junction, EST, and TSS data, support alternative exons in several regions of a gene, it is usually not possible to determine which of all possible combinations actually exist in vivo. We call this the “permutation problem.” Combinations supported by full-length cDNAs are annotated. The number of additional transcripts to be created is at the discretion of the annotator. Excluding low-frequency junctions, all alternative splices within the CDS and all promoters are represented, but not necessarily all possible combinations. If all combinations are not represented, a comment is included in the Gene Model Comment section (see Curator comments).
Alternative transcripts: cases that disrupt the CDS
Due to a retained intron or alternative splice, cDNA, EST, or RNA-Seq junction data may support an alternative transcript that would result in a premature stop codon or a downstream start – it usually cannot be determined which. A higher level of support is required for the annotation of such a transcript (multiple cDNA/ESTs or a high-frequency junction), and a comment is appended to the transcript (see Curator comments). However, rather than continue to annotate truncated proteins corresponding to these transcripts, FlyBase is developing a proposal to reclassify them as non-coding transcripts produced from protein-coding gene loci.
Merges, splits, and “splerges”
Gene splits or merges are a common annotation correction and are based upon RNA-Seq coverage or junction data, cDNA/EST data, BLASTX homologies, or corrections submitted by the community. A comment is placed in the gene record indicating that a merge or split has occurred, in which release, along with an indication of the type of data supporting the change.
Generally, FlyBase considers transcripts for which any portion of the predicted protein is in common to represent a single gene. Examples are considered on a case-by-case basis; some exceptions to this rule have been made for well-characterized genes that exist as separate unrelated entities in other phylogenetic groups.
Exceptional cases (2012)
Many of the Curator comments flag exceptional types of annotations.
Dicistronic and polycistronic annotations: Proteins encoded by a multicistronic transcript are considered to represent different genes. Preferrably, a polycistronic transcript is supported by more than one spanning cDNA or EST; care must be taken not to misinterpret cases of overlapping UTR's. Alternative explanations, such as a mutant in the strain or stop-codon suppression, must be ruled out. Each postulated protein should have additional support. This is not a particularly rare “exceptional” class: there are currently more than 130 dicistronic pairs annotated, 4 tricistronic sets and 2 tetracistronic sets.
Stop-codon suppression/readthrough: This class requires support from the literature, including evolutionary comparative data. As a result of work from one of the modENCODE groups Jungreis et al., this is no longer a rare class: there are currently more than 300 genes with a transcript annotated with a stop-codon readthrough.
Non-ATG starts and translational frameshifts: These require a high level of support, such as detailed treatment in the literature or unambiguous homology data. There are currently 11 genes annotated with a non-ATG start and one with a translational frameshift (Oda). To date, all non-canonical starts vary from ‘AUG’ by one nucleotide.
Trans-spliced transcripts: One gene in Dmel undergoes extensive trans-splicing (mdg4); others may undergo lower levels. If there is sufficient evidence, the trans-splicing precursors should also be annotated.
Pseudogenes: A non-functional gene that (1) has a related gene in the genome and (2) has more than one compromising lesion, is classified as a pseudogene. If there is only one lesion, it is described as a mutant in the strain (see below); polymorphic pseudogenes are treated as mutations in the strain. Retrotransposed pseudogenes are relatively rare: 5 are currently flagged as such. (Retrogenes exist -- at least 98 have been identified -- but nearly all appear to be functional.)
Mutations in the strain: These are now relatively easily assessed, since there is sequence information for multiple Dmel wild-type strains and for closely related species. Cases for which some wild-type strains carry a functional allele and others carry the mutant allele are flagged with a comment that they represent polymorphic pseudogenes.
Chimeric genes: These are occasionally created at the site of an aberration (usually a tandem duplication). Evidence for expression of such a gene is often ambiguous, since ESTs and RNA-Seq data corresponding to the component genes also may align to the chimeric copy. If a CDS appears to be supported, the gene is classified an coding, but is flagged as “Gene model uncertain.” If it appears unlikely that a protein product is produced, the gene is classified as a pseudogene, and also flagged as “Gene model uncertain.”
Stretching the definition of a gene: As described above, transcripts for which any portion of the predicted protein is in common are usually considered to be products of a single gene. A number of genes with very complex alternative splicing patterns have pairs of transcripts with coding regions that fail to overlap each other, but both of which overlap the coding region of a third transcript. We flag these cases with the following comment: "Gene model includes transcripts encoding non-overlapping portions of the full CDS."
Ambiguous cases: Generally, if a case is ambiguous, rather than creating an exceptional annotation a comment is added.
Assessment of supporting data (2012)
One of the most difficult aspects of annotation is assessing the validity of the supporting data. Most problems are recognized only after a period of time, during which a pattern emerges. Examples: the RE and RH cDNA libraries have a higher frequency of genomically primed clones; the GenScan prediction algorithm tends to overpredict, creating long annotations with many small exons (it is more appropriate for vertebrates); unspliced ESTs must be viewed with caution, especially those from the Exelixis libraries; the IP cDNAs exhibit a higher frequency of chimeric or atypically spliced clones.
The new high throughput RNA-Seq, RNA-Seq junction, and TSS data must be viewed with the same caveats. We are still developing annotation guidelines for dealing with these datasets; it is an advantage that they include some quantitative measures of validity and/or frequency. Low-frequency junctions, for example, may not be used in a gene model; if this is the case a comment is added (see Curator comments). Any type of aligned data is problematic in regions of repeats; this is true of RNA-Seq coverage and junction data.
Curator comments
Current list of standardized annotation comments (2012)
If curator comments exist for a particular annotation, they can be found on the Gene Report, in the GENE MODEL AND FEATURES section, in a field labeled Comments on Gene Model. If comments exist for a particular transcript, they can be found on the annotated transcript report in a section called COMMENTS. This section will appear on the transcript report only if there are comments associated with the transcript. SO terms and ID numbers are included whenever appropriate (see Vocabularies).
Annotation Comments (apply to the whole gene model)
- Annotated transcripts do not represent all possible combinations of alternative exons and/or alternative promoters.
- Annotated transcripts do not represent all supported alternative splices within 5' UTR.
- Low-frequency RNA-Seq exon junction(s) not annotated.
- Supported by RNA-Seq data.
- Supported by strand-specific RNA-Seq data.
- Probable lncRNA gene; may encode small polypeptide(s).
- Possible non-coding RNA gene.
- Antisense: overlaps [] on opposite strand.
- Antisense (in part): overlaps [] on opposite strand.
- gene_with_dicistronic_mRNA ; SO:0000722
- gene_with_polycistronic_transcript ; SO:0000690
- May be component of a dicistronic gene; available data inconclusive.
- Shares 5' UTR with upstream gene.
- Shares 5' UTR with downstream gene.
- Gene model includes transcripts encoding non-overlapping portions of the full CDS.
- Pseudogene similar to []; proximate; partial; created by tandem duplication.
- Pseudogene similar to []; transposed.
- Apparent introns not annotated: probable artifact due to repetitive sequence.
- miRNA(s) located within the transcribed region of this non-coding RNA gene.
- Alternative translation stop created by use of multiphasic reading frames within coding region.
- Variable use of small exon; supported combination results in frameshift and premature stop in downstream exon.
- Multiphase exon postulated: exon reading frame differs in alternative transcripts.
- Multiphase exon postulated: reading frame of first coding exon differs in alternative transcripts.
- Mutation in sequenced strain: [*].
- Polymorphic pseudogene: intact in some individuals or strains, disrupted by mutation in others.
- Gene model uncertain: []
- Gene model uncertain: chimeric gene.
- Gene model is incomplete due to []
- gene_with_transcript_with_translational_frameshift ; SO:0000712
- Translational frameshifting postulated (FBrfnnnnnnn): -1 [+1] frameshift reflected in aa sequence of predicted polypeptide[s].
- gene_with_stop_codon_redefined_as_selenocysteine ; SO:0000710
- Stop-codon suppression (UGA as Sec) postulated (FBrfnnnnnnn).
- gene_with_stop_codon_read_through ; SO:0000697
- Stop-codon suppression (Uxx) postulated; FBrfnnnnnnn.
- gene_with_unconventional_translation_start_codon ; SO:0001739
- gene_with_translation_start_codon_CUG ; SO:0001740
- Unconventional translation start (XYZ) postulated; FBrfnnnnnnn.
- gene_with_trans_spliced_transcript ; SO:0000459
- Multiphase exon postulated: this gene shares a region of coding sequence with an overlapping gene, but different reading frames are utilized in the overlapping coding region.
- Bidrectional region of coding sequence postulated: a portion of the CDS of this gene overlaps a portion of the CDS of a gene on opposite strand.
Transcript Comments (apply to a specific transcript)
- Transcript terminates at site supported by polyadenylated cDNA.
- Extended 3' UTR based on RNA-Seq and/or EST data.
- UTR(s) based on RNA-Seq data.
- Transcriptional initiation is supported by short-capped RNA data (FBrf0209722).
- Evidence supports alternative splice leading to premature stop codon and/or downstream start; may or may not produce functional polypeptide.
- Based on cDNA(s) with retained intron; results in premature stop codon and/or downstream start; may or may not produce functional polypeptide.
- Unconventional splice site postulated (XY-WZ).
- Non-coding alternative transcript supported, [retained intron/alternative splice] (FBrfnnnnnnn).
- Truncated polypeptide supported, [retained intron/alternative splice] (FBrfnnnnnnn).
- Truncated polypeptide supported, [alternative downstream AUG/alternative terminal exon] (FBrfnnnnnnn).
- Monocistronic transcript; alternative dicistronic transcript(s) exist.
- Dicistronic transcript; alternative monocistronic transcript(s) exist.
- Dicistronic transcript.
- Polycistronic transcript.
- Unconventional splice site invoked (XY-WZ); sequence altered due to transposon insertion; this splice may not occur in vivo.
- Unconventional splice site(s) invoked due to gap in genomic sequence; this splice does not occur in vivo.
- Stop-codon suppression (UGA as Sec) postulated (FBrfnnnnnnn); reflected in aa sequence of predicted polypeptide.
- Stop-codon suppression (Uxx) postulated (FBrfnnnnnnn); reflected in aa sequence of predicted polypeptide.
- Unconventional translation start postulated (XYZ encoding Met); FBrfnnnnnnn.
- Downstream translation start supported by comparative analysis across Drosophila species.
- Downstream translation start supported by [FBrfnnnnnnn].
- Transcript postulated to overlap transposable element.
Gene Model Annotation Guidelines (2007)
Updated annotation guidelines are described above.
Criteria for Annotation
Purpose: To determine whether existing gene models are correct and complete and to determine if there is evidence for additional genes or transcripts not already represented by the existing models.
Determine whether a protein-coding gene exists in a region.
Gene prediction algorithms are sufficiently robust that this is rarely an issue for larger genes (200aa or greater), unless the gene consists of many small dispersed exons. To make a judgment in cases of small genes or genes comprised of small exons, available evidence is examined further. Three types of evidence are considered:
- Matches to cDNA sequence data (BDGP cDNA/EST data or data generated by the community). Considered more significant if it includes an intron with consensus splice sites.
- Gene prediction data, including conserved protein signatures.
- BLASTX homology; matches with expected value less than 1 x e-7 are considered.
For gene models with only one of these three types of supporting data, models with a predicted CDS greater than 100aa are created or retained. If there are two or more types of supporting data, a gene model is created if the predicted CDS exceeds 50aa. If there is BLASTX homology to a similar small gene in other species, a smaller size limit is accepted.
Is there one gene or several?
Gene splits or merges are a common annotation correction and are based upon cDNA/EST data, BLASTX homologies, or corrections submitted by the community. A comment indicating that a merge or split has occurred, along with an indication of the type of data supporting the change, is placed in the annotation record.
FlyBase considers transcripts for which any portion of the predicted protein is in common to represent a single gene. An overlap of a single amino acid is sufficient, but overlap of a multiphasic coding exon (different ORFs used) is not.
Determine the structure of the transcript(s).
Internal intron-exon structures are based primarily upon EST/cDNA data. If these data are absent, we rely on gene prediction data. In a few cases, approximate gene structures are inferred from BLASTX alignments. In practice, many annotations are based upon a combination of these data types. Examples:
- When cDNA/ESTs only cover the termini, internal structures will be based upon gene prediction data. The 5' terminus of a transcript is extended to the start of the overlapping EST that extends furthest 5'. Unspliced ESTs generally are not considered.
- If there is no 5' cDNA/EST data, the transcript is extended to the first in-frame ATG consistent with the gene prediction or BLASTX data. Similarly, if an annotation supported by 5' EST data does not contain an in-frame start codon, the annotation is extended to the first such start.
- The 3' terminus is extended to the 3' end of a complete cDNA, if available, or to the 3' end of an overlapping 3' EST. Unspliced ESTs generally are not considered.
- Starting in 2008: full-length cDNAs are checked for terminal polyA's. Annotated transcripts are extended 3' to the last non-A nucleotide and the following comment appended: "Transcript terminates at site supported by polyadenylated cDNA."
- If there is no 3' cDNA/EST data, the transcript is extended to the first stop codon consistent with the gene prediction data or BLASTX alignment.
- Whenever splice sites other than GT/AG are annotated, a comment is appended to the transcript.
Determine the extent of the coding region.
The Apollo annotation tool sets the translation start site to the 5'-most in-frame ATG. But, in cases supported by the literature (including conservation patterns across Drosophila species), a non-ATG translation start site, or a downstream ATG may be used.
In some cases, especially for annotations supported only by BLASTX data, it is not possible to identify a likely ATG start codon. In such cases, translation is started at the 5'-most internal in-frame codon and an explanatory comment is added.
The following policy was established in 2008: Any change to an existing transcript that results in a change to the CDS will be accompanied by changes to the transcript and protein symbols and IDs.
How many alternative transcripts exist?
We annotate as many alternative transcripts as are supported by cDNA/EST and community data. We will also annotate an alternative transcript if there is convincing gene prediction evidence and/or BLASTX evidence.
If non-contiguous EST data support alternative exons in several regions of the gene, it is not always possible to determine which of all possible combinations actually exist in vivo. The number of such alternative transcripts to be created is at the discretion of the annotator; generally, when there are more than 6-8 transcripts, all alternative exons are represented, but not all possible combinations. In such cases, alterative termini are usually associated with the most common transcript pattern.
Protein conservation data may support additional internal exons. These are assessed for homology to adjacent exons, thus indicating a pattern of exon shuffling.
Partial annotations are avoided except in extreme circumstances; the exception is failure to find a likely ATG start codon if it is encoded in a small or distant exon (see above).
Curator comments (see section 1.5).
The Apollo annotation tool allows for the inclusion of comments associated with an annotated gene or a specific transcript of an annotated gene. We make extensive use of this capability, including controlled comments as well as free text comments. The collection of controlled comments was developed during the initial re-annotation stages, and is used as often as possible to facilitate consistency and to provide a means of tracking or querying for various atypical gene structures. For example, all predicted splices that fail to use the canonical GT/AG donor and acceptor splice site dinucleotides are noted, as are genes that have been reported to make use of non-ATG translation starts, genes that have a dicistronic transcript, and genes known to be or appearing to be mutant in the sequenced strain.
Many of the controlled comments address the weaknesses or anomalies in the annotation: an unusual alternative transcript supported by a single EST, incomplete supporting data requiring extension of a gene model to the nearest translation start or stop, or than an ATG translation start codon could not be identified. Genes that are split or merged are commented and the type of evidence supporting the change indicated. Finally, cDNA clones that failed to accurately reflect the annotation (typically those that are incomplete or appear to include intronic sequences) are designated as problematic and have a comment attached.
If such comments exist for a particular annotation, they can be found on the Gene Report, in the GENE MODEL AND FEATURES section, in a field labeled Comments on Gene Model. If comments exist for a particular transcript, they can be found on the annotated transcript report in a section called COMMENTS. This section will only appear on the report if there are comments attached to the transcript.
Evidence used for gene model annotation as of March 2007
Since the publication of the description of the r3.1 reannotation effort (Misra, et al., 2002), a number of new and expanded data sets allow much more accurate assessment of gene models in D. melanogaster. These include:
- Expanded collection of high quality full-length cDNA sequences provided by the BDGP
- Expanded collection of predicted protein sequences, especially from insects (provided to and aligned for FlyBase by NCBI).
- Additional EST collections, including from Exelixis
- Additional gene prediction algorithms including Augustus (Stanke, et la. 2006, BMC Bioinformatics 7:62), Contrast (Gross, Do, and Batzoglou, 2005, BCATS 2005 Symposium Proceedings, p. 82), GeneID (Parra, Blanco, and Guigo, 2000, Genome Research, 10: 511-515), NCBI gnomon (Souvorov, et al. 2006), and SNAP (Korf, 2004, BMC Bioinformatics 5:59).
- The exon prediction algorithm, CONGO, based on conserved protein-coding signatures. (submitted by M. Lin and M. Kellis)
- Proteomics analysis contributed by the Center for Model Organism Proteomes, SystemsX and Research Priority Project of the University of Zurich, Switzerland.
- Community submissions of corrections and other data, including non-coding RNA gene models.
Exceptional cases
For genes with data supporting atypical collections of transcripts, a useful rule of thumb is to consider the data for each transcript in isolation, ignoring other transcripts annotated for the same gene and adjacent genes. This helps reduce our bias against the new and unusual.
The majority of annotation comments (see section 7.5.) flag exceptional types of annotations
Dicistronic annotations: Adhering to the definition described above, the proteins encoded by a multicistronic transcript are considered to represent different genes. Preferrably, a dicistronic transcript is supported by more than one spanning cDNA or EST; care must be taken not to misinterpret cases of overlapping UTR's. Alternative explanations, such as a mutant in the strain, must be ruled out. Each postulated protein should have additional support, especially if it is small.
Atypical cDNAs with retained introns: Initially, cDNAs with retained introns were flagged as problematic clones and a corresponding transcript annotation was not created. But given the unexpected frequency with which we observe such cDNAs, and the fact that there are well characterized systems [such as su(w[a])]with experimental support for such transcripts, we have changed our treatment of these cases. A transcript is created and the following comment appended: "Based on cDNA(s) that contain premature stop codon; may or may not produce functional polypeptide."
Non-ATG starts, stop-codon suppression, and translational frameshifts: These require a high level of support, such as detailed treatment in the literature or unambiguous homology data.
Mutations in the strain: These are now relatively easily assessed, since comparisons to closely related species are informative.
Pseudogenes: A non-functional gene that (1) has a related gene in the genome and (2) has more than one compromising lesion, is classified as a pseudogene. (If there is only one lesion, it is described as a mutant in the strain.) Retrotransposed pseudogenes appear to be rare. (Retrogenes exist -- at least 98 have been identified -- but nearly all appear to be functional.)
Chimeric genes: These are occasionally created at the site of an aberration (usually a tandem duplication). These are currently annotated as protein-coding genes, usually based on gene prediction algorithms; need to be reassessed and a consistent comment applied.
Stretching the definition of a gene: As described above, transcripts for which any portion of the predicted protein is in common, even a single amino acid, are considered to be products of a single gene. A number of genes with very complex alternative splicing patterns have pairs of transcripts with coding regions that fail to overlap each other, but both of which overlap the coding region of a third transcript. We flag these cases with the following comment: "Gene model includes transcripts encoding non-overlapping portions of the full CDS."
Ambiguous cases: Generally, if the case is ambiguous, rather than creating an exceptional annotation a comment is added. Examples: "May be component of a dicistronic gene; available data inconclusive." "3' UTR contains an ORF that is conserved among close Drosophila species (potential peptide [*] aa in length); possible stop-codon suppression."
Assessment of supporting data
One of the most difficult aspects of annotation is assessing the validity of the supporting data. Most problems are recognized only after a period of time, during which a pattern emerges. Examples: the RE and RH cDNA libraries have a higher frequency of genomically primed clones; the GenScan prediction algorithm tends to overpredict, creating long annotations with many small exons (it is more appropriate for vertebrates); unspliced ESTs must be viewed with caution, especially those from the Exelixis libraries; the IP cDNAs exhibit a higher frequency of chimeric or atypically spliced clones.
We make note of problematic clones, most commonly those that are chimeric, that contain an RTase error resulting in a frameshift, or that appear to be genomically primed. We use the description "Suspect" for clones that appear to be aberrantly spliced (or are unspliced) and do not support a CDS of any size. This information may be found on the FlyBase cDNA clone reports in the "Known Problems" field.
Current list of "canned" annotation comments
An "AnnotationComment" is one that applies to the whole gene model; A "TranscriptComment" applies to a specific transcript. AnnotationComments
"gene_with_dicistronic_processed_transcript ; SO:0000722" "May be component of a dicistronic gene; available data inconclusive." "gene_with_dicistronic_primary_transcript ; SO:0000721" "Shares 5' UTR with upstream gene. "Shares 5' UTR with downstream gene. "Gene merge based on protein alignment (BLASTX) data." "Gene merge based on EST/cDNA data." "Gene split based on protein alignment (BLASTX) data." "Gene split based on EST/cDNA data." "Known mutation in sequenced strain." "Probable mutation in sequenced strain: premature stop." "Probable mutation in sequenced strain: [*]." "Gene model includes transcripts encoding non-overlapping portions of the full CDS." "3' UTR contains an ORF that is conserved among close Drosophila species (potential peptide [*] aa in length)." "3' UTR contains an ORF that is conserved among close Drosophila species (potential peptide [*] aa in length); possible stop-codon suppression." "Multiphase exon postulated: exon reading frame varies in alternative transcripts."
TranscriptComments
"GC splice donor site postulated." "Unconventional splice site postulated." "5' exon not determined (no ATG translation start identified)." "Transcript terminates at site supported by polyadenylated cDNA." "Unconventional (non-ATG) translation start supported by [*]." "Stop codon suppression supported by [*]." "Occurrence of translational frameshift supported by [*]." "Downstream translation start supported by [*]." "Transcript model based on protein alignment (BLASTX); no experimental evidence for splice sites." "Monocistronic transcript; alternative dicistronic transcript(s) exist." "Dicistronic transcript; alternative monocistronic transcript(s) exist." "Dicistronic transcript." "Based on cDNA(s) that contain premature stop codon; may or may not produce functional polypeptide." "Transcript postulated to overlap transposable element."