Difference between revisions of "FlyBase:Gene Ontology (GO) Annotation"

From FlyBase Wiki
Jump to navigation Jump to search
 
(63 intermediate revisions by 5 users not shown)
Line 1: Line 1:
=G.2. Controlled vocabularies used by FlyBase=
+
=G.3. Classification of Gene Products using Gene Ontology (GO) terms=
  
For many reasons several of the fields in FlyBase use structured controlled vocabularies (aka ontologies). This makes it much easier (and more robust) to make links within the database, as well as making it much easier to search the database for information. Moreover, several of these controlled vocabularies are shared with other databases, and this provides a degree of integration between them. The controlled vocabularies are only implemented in certain fields in FlyBase.
+
FlyBase uses '''Gene Ontology (GO)''' controlled vocabulary (CV) terms for cellular component, biological process and molecular function to describe properties of gene products. Although GO terms are intended to describe the properties of gene products, FlyBase currently assigns GO terms to genes rather than protein or RNA.
  
The controlled vocabularies currently used by FlyBase are:
+
FlyBase is one of the founding members of the [http://geneontology.org/ Gene Ontology (GO) Consortium] and follows the general guidelines for GO annotation as described in the GO documentation.
  
* '''The Gene Ontology (GO)'''. This provides structured controlled vocabularies for the annotation of gene products (although FlyBase at present annotates genes with GO terms, as a surrogate for their products). The GO has three domains: the molecular function of gene products, the biological process (i.e. roles) in which they are involved and their cellular component (location).
+
Also, see the related '''video tutorial''' [https://www.youtube.com/watch?v=XFCR8BRfGp0 Finding related genes in FlyBase: The Gene Ontology].
* '''Anatomy'''. A structured controlled vocabulary of the anatomy of Drosophila melanogaster, used, for example, for the description of phenotypes and where a gene is expressed.
 
* '''Development'''. A structured controlled vocabulary of the development of Drosophila melanogaster, used, for example, for the description of phenotypes and when a gene is expressed.
 
* '''The Sequence Ontology (SO)'''. A structured controlled vocabulary for sequence annotation, for the exchange of annotation data and for the description of sequence objects in databases. Its use by FlyBase means that the various components of the genome are described in a consistent and rigorous manner.
 
* '''FlyBase controlled vocabulary'''. A structured controlled vocabulary used for the annotation of various objects in FlyBase, including publications (by their type), alleles (for their mutagen etc). Although some of these domains will probably always remain local to FlyBase, in time, community ontologies will be available for others (e.g. chemical compounds for mutagens) and FlyBase will then use these.
 
  
All of these structured controlled vocabularies are in the same format, that used by the Open Biomedical Ontology group. This format is called the OBO format and files using it have the suffix '.obo', e.g. gene_ontology.obo. The OBO format is designed to be used with the freely-downloadable OBO-Edit tool.
+
==G.3.1. FlyBase GO data==
 
 
Users should be aware that controlled vocabularies undergo continual development; terms and definitions are refined, added, merged, split and obsoleted in an effort to improve the way they represent their various subjects.
 
 
 
Both the current 'live' versions of each controlled vocabulary and the static versions taken at the time data for this FlyBase release was frozen are available to download from the Precomputed files download page under the Files menu of the Navigation bar.
 
 
 
The detail of each controlled vocabulary term is displayed in a CV Term Report in FlyBase. Individual CV Term Reports can be reached either by clicking on the controlled vocabulary term where it is displayed in a report page (e.g. the GENE ONTOLOGY: Function, Process, and Cellular component section of the Gene Report), or by using the TermLink tool, which allows users to search directly for controlled vocabulary terms from any of the controlled vocabularies used by FlyBase.
 
  
Controlled vocabulary terms can also be searched using the QueryBuilder tool, via their links to objects (such as genes) in FlyBase. If you wish to search using a controlled vocabulary term in QueryBuilder, you should select the GO/Anatomy CV DB dataset in the query segment box (see the QUERY BUILDER HELP section at the bottom of the QueryBuilder page for more details.
+
GO data is displayed in the Gene Ontology: Function, Process, and Cellular component section of individual [[FlyBase:Gene_Report|Gene Reports]].
  
=G.3. Classification of Gene Products using Gene Ontology (GO) terms=
+
In addition, the current release of GO data for all Drosophila melanogaster FlyBase genes can be found in the tab delimited text file gene_association.fb.  
  
FlyBase uses Gene Ontology (GO) controlled vocabulary (CV) terms for cellular component, biological process and molecular function to describe properties of gene products. Although GO terms are intended to describe the properties of gene products, FlyBase currently assigns GO terms to genes rather than protein or RNA.
+
The following provides a brief description of the columns in the [http://{{flybaseorg}}/static_pages/downloads/current/go/gene_association.fb.gz gene_association.fb] file.
  
FlyBase is one of the founding members of the Gene Ontology (GO) Consortium and follows the general guidelines for GO annotation as described in the GO documentation. FlyBase also participates in the GO reference genome project.
+
'''Gene Association file columns'''
  
==G.3.1. FlyBase GO data==
 
 
GO data is displayed in the GENE ONTOLOGY: Function, Process, and Cellular component section of individual Gene Reports.
 
 
In addition, the current release of GO data for all Drosophila melanogaster FlyBase genes can be found in the tab delimited text file gene_association.fb. The following provides a brief description of the columns in the [http://flybase.org/static_pages/downloads/current/go/gene_association.fb.gz gene_association.fb] file.
 
 
   
 
 
:'''1. DB''' The database contributing the gene_association file
 
:'''1. DB''' The database contributing the gene_association file
 
:FB File: always "FB" for gene_association.fb.
 
:FB File: always "FB" for gene_association.fb.
Line 46: Line 29:
 
:Example: dpp
 
:Example: dpp
  
:'''4. Qualifier''' (this field is optional)
+
:'''4. Qualifier'''
:One or more of 'NOT', 'contributes_to' or 'colocalizes_with' as qualifier(s) for a GO annotation.
+
:For each GO annotation, one of the following as [[FlyBase:Gene_Ontology_(GO)_Annotation#G.3.2.2._Use_of_Gene_Product_To_Term_Relations|gene product to term relations]] are used:
:Multiple qualifiers are separated by a pipe (|).
+
:  'acts_upstream_of', 'acts_upstream_of_negative_effect',
:FB File: 'contributes_to' or 'colocalizes_with' are not currently
+
:   'acts_upstream_of_positive_effect', 'colocalizes_with',
:displayed in gene_association.fb, but they will be displayed in the
+
'contributes_to', 'enables', 'involved_in', 'is_active_in', 'located_in', 'part_of'.
:next release of the FlyBase gene_association file.
+
:This column may also contain the 'NOT' qualifier, separated by a pipe (|) from the gene product to term relation, which makes the annotation statement a negation.
  
 
:'''5. GO ID'''
 
:'''5. GO ID'''
Line 67: Line 50:
  
 
:'''7. Evidence'''
 
:'''7. Evidence'''
:The evidence code for the GO annotation; one of IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA
+
:The evidence code for the GO annotation; one of IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA, HDA, HMP, HGI, HEP, IBA
  
 
:'''8. With (or) From'''
 
:'''8. With (or) From'''
Line 75: Line 58:
 
:symbol and identifier, or a sequence (protein or nucleic acid)
 
:symbol and identifier, or a sequence (protein or nucleic acid)
 
:identifier. For IC, the GO identifier of the term used as the basis of
 
:identifier. For IC, the GO identifier of the term used as the basis of
:a curator inference is given. With statements for IC are not currently
+
:a curator inference is given.
:displayed in gene_association.fb, but they will be displayed in the
 
:next release of the FlyBase gene_association file.
 
 
:IGI example: FLYBASE:rpr; FB:FBgn0011706
 
:IGI example: FLYBASE:rpr; FB:FBgn0011706
 
:ISS example: UniProt:P35569
 
:ISS example: UniProt:P35569
Line 118: Line 99:
 
:'''15. Assigned_by'''
 
:'''15. Assigned_by'''
 
:The source of the GO annotation.
 
:The source of the GO annotation.
:FB File: One of either FB or UniProtKB.
 
  
 
The latest version of this data is also available for download here from the Gene Ontology consortium site. The accompanying README document includes a detailed description of the file format, FlyBase GO annotation policy and sources used for FlyBase GO annotations.
 
The latest version of this data is also available for download here from the Gene Ontology consortium site. The accompanying README document includes a detailed description of the file format, FlyBase GO annotation policy and sources used for FlyBase GO annotations.
Line 124: Line 104:
 
Note that the GO data available from FlyBase will not necessarily be identical to that found on the GO website. GO validate the data FlyBase submits and remove lines of data that are no longer valid e.g. when a GO term becomes obsolete.
 
Note that the GO data available from FlyBase will not necessarily be identical to that found on the GO website. GO validate the data FlyBase submits and remove lines of data that are no longer valid e.g. when a GO term becomes obsolete.
  
QueryBuilder can be used to identify all the genes associated with a particular GO term. The [http://www.godatabase.org/cgi-bin/amigo/go.cgi?search_constraint=terms&action=replace_tree AmiGO] and [http://www.ebi.ac.uk/ego/search.html QuickGO] browsing tools can be used to find GO terms of interest.
+
QueryBuilder can be used to identify all the genes associated with a particular GO term. The [http://amigo.geneontology.org/amigo AmiGO] and [http://www.ebi.ac.uk/ego/search.html QuickGO] browsing tools can be used to find GO terms of interest. See '''video tutorial''' [https://www.youtube.com/watch?v=XFCR8BRfGp0 Finding related genes in FlyBase: The Gene Ontology].
  
 
==G.3.2. Evidence==
 
==G.3.2. Evidence==
  
Evidence for a GO term consists of an evidence code that describes the type of analysis carried out together with, in some cases, a reference to another database object in that supports the evidence (see with/from Supporting Evidence below).
+
Evidence for a GO term consists of an evidence code that describes the type of analysis carried out together with, in some cases, a reference to another database object in that supports the evidence.
  
Evidence codes The Gene Ontology [http://www.geneontology.org/GO.evidence.shtml Guide to GO Evidence Codes] contains comprehensive descriptions of the evidence codes used in GO annotation. FlyBase uses the following evidence codes when assigning GO data:
+
===G.3.2.1. GO Evidence Codes===
  
 +
Evidence codes The Gene Ontology [http://geneontology.org/docs/guide-go-evidence-codes/ Evidence Codes] contains comprehensive descriptions of the evidence codes used in GO annotation. FlyBase uses the following evidence codes when assigning GO data:
 +
:    inferred from experiment (EXP)
 
:    inferred from mutant phenotype (IMP)
 
:    inferred from mutant phenotype (IMP)
 
:    inferred from genetic interaction (IGI)
 
:    inferred from genetic interaction (IGI)
Line 137: Line 119:
 
:    inferred from physical interaction (IPI)
 
:    inferred from physical interaction (IPI)
 
:    inferred from expression pattern (IEP)
 
:    inferred from expression pattern (IEP)
 +
:    inferred from high throughput experiment (HTP)
 +
:    inferred from high throughput mutant phenotype (HMP)
 +
:    inferred from high throughput genetic interaction (HGI)
 +
:    inferred from high throughput direct assay (HDA)
 +
:    inferred from expression pattern (HEP)
 
:    inferred from sequence or structural similarity (ISS)
 
:    inferred from sequence or structural similarity (ISS)
 +
:    inferred from sequence orthology (ISO)
 +
:    inferred from sequence alignment (ISA)
 +
:    inferred from biological aspect of ancestor (IBA)
 
:    inferred from electronic annotation (IEA)
 
:    inferred from electronic annotation (IEA)
 
:    inferred from reviewed computational analysis (RCA)
 
:    inferred from reviewed computational analysis (RCA)
Line 145: Line 135:
 
:    no biological data available (ND)
 
:    no biological data available (ND)
  
===G.3.2.1. Use of evidence codes===
+
===G.3.2.2. Use of evidence codes===
  
Consistent with the aims of the GO reference genome project, FlyBase prefers to assign GO terms based on experimental evidence codes (IMP, IGI, IDA, IPI, IEP). Of these five codes, FlyBase uses IEP relatively infrequently since expression patterns generally provide less direct evidence for GO terms than the other four codes. FlyBase does use IEP where an author explicitly states that expression data is the evidence for a term.
+
Consistent with the aims of the GO reference genome project, FlyBase prefers to assign GO terms based on experimental evidence codes (IMP, IGI, IDA, IPI, IEP). Infrequently, GO annotations will be associated with genes frim high throughput (HTP) experiments. The HTP experiments are only annotated when the satisfy a number of rules, to ensure that there are minimal false positives propagated by GO annotation. These experimental annotations are marked by the evidence codes HTP, HDA, HMP, HGI and HEP, so that they can be distinguished from low-throughput, hypothesis-driven experiments.
  
Evidence codes based on computer predictions (ISS, IEA, RCA), author statements (NAS, TAS) and curator inference (IC) will continue to be used in the absence of experimental data for the same or a more specific GO term. However, we aim to remove GO data with these codes when experimental evidence for the term is curated.
+
Evidence codes based on computer predictions (IEA, RCA), author statements (NAS, TAS) and curator inference (IC) will continue to be used in the absence of experimental data for the same or a more specific GO term. However, we aim to remove GO data with these codes when experimental evidence for the term is curated.
  
The evidence code ND (no biological data available) is used for annotations to the three unknown GO terms: "molecular_function unknown ; GO:0005554", "biological_process unknown ; GO:0000004" and "cellular_component unknown ; GO:0008372". In FlyBase the use of any of these three GO terms, attributed to reference FBrf0159398 and supported by the ND evidence code, signifies that a curator has examined the available literature and sequence for this gene and that, as of the date of the annotation to the term, there is no information supporting an annotation to any more specific GO term in that ontology. Recently, GO removed the unknown terms and changed to using the root terms "molecular_function ; GO:0003674", "biological_process ; GO:0008150" or "cellular_component ; GO:0008372" with the ND evidence code; this provides a more accurate ontological representation of the current knowledge about the gene products. FlyBase will implement this change in the next release.
+
The evidence code ND (no biological data available) is used in combination with the root GO terms "molecular_function ; GO:0003674", "biological_process ; GO:0008150" or "cellular_component ; GO:0008372". In FlyBase the use of any of these three GO terms, attributed to reference FBrf0159398 and supported by the ND evidence code, signifies that a curator has examined the available literature and sequence for this gene and that, as of the date of the annotation to the term, there is no information supporting an annotation to a specific GO term.
 
 
Additional information about the way FlyBases uses evidence codes can be found in the README document.
 
 
 
with/from Supporting Evidence
 
  
 +
'''with/from Supporting Evidence'''
 
Some evidence codes (IGI, IPI, ISS, IEA, IC) are used in conjunction 'with' supporting data in the form of a reference to another database object. These objects are identified by their database abbreviation followed by a colon and the unique identifier for the object in that database. A list of current database abbreviations can be found in the GO.xrf_abbs file. See the GO Annotation Guide for more details.
 
Some evidence codes (IGI, IPI, ISS, IEA, IC) are used in conjunction 'with' supporting data in the form of a reference to another database object. These objects are identified by their database abbreviation followed by a colon and the unique identifier for the object in that database. A list of current database abbreviations can be found in the GO.xrf_abbs file. See the GO Annotation Guide for more details.
  
ISS and IEA 'with'
+
'''ISS and IEA 'with''''
  
 
FlyBase captures GO data based on similarity to other gene products that are known to have that attribute. Since October 1st 2006, it has been mandatory for ISS annotations to include an identifier for the sequence used to make the annotation; earlier FlyBase ISS annotations that do not include identifiers will be updated gradually. In line with current guidelines for reference genomes, curators now check that the similar sequence can be annotated to the GO term with experimental evidence (IDA, IMP, IGI, IPI, IEP) before making an ISS annotation. This policy was adopted to avoid circular similarity-based annotations. Consequently, GO terms are not curated based multiple sequence alignments if none of the sequences in the alignment have been experimentally verified. Annotations made before October 2006 have not necessarily been checked in this way.
 
FlyBase captures GO data based on similarity to other gene products that are known to have that attribute. Since October 1st 2006, it has been mandatory for ISS annotations to include an identifier for the sequence used to make the annotation; earlier FlyBase ISS annotations that do not include identifiers will be updated gradually. In line with current guidelines for reference genomes, curators now check that the similar sequence can be annotated to the GO term with experimental evidence (IDA, IMP, IGI, IPI, IEP) before making an ISS annotation. This policy was adopted to avoid circular similarity-based annotations. Consequently, GO terms are not curated based multiple sequence alignments if none of the sequences in the alignment have been experimentally verified. Annotations made before October 2006 have not necessarily been checked in this way.
 
For example, the Drosophila gene bigmax is annotated with the GO term 'regulation of transcription' based on sequence similarity to Max. This annotation is legitimate because Max has been shown to regulate transcription in a direct assay.
 
  
 
The combined evidence appears on the gene report in the format:
 
The combined evidence appears on the gene report in the format:
Line 171: Line 156:
 
In this case we have give two identifiers (symbol and gene ID) for the same sequence; identifiers for the same sequence are separated by a semi-colon. If more than one sequence is used to make the annotation then the identifiers for the different sequences are separated by a comma. Note that this use of multiple identifiers is a different to that for IGI and IPI.
 
In this case we have give two identifiers (symbol and gene ID) for the same sequence; identifiers for the same sequence are separated by a semi-colon. If more than one sequence is used to make the annotation then the identifiers for the different sequences are separated by a comma. Note that this use of multiple identifiers is a different to that for IGI and IPI.
  
Where the database object used to to make IEA annotations can be identified then this is included in the same way. However, the majority of FlyBase annotations with IEA do not yet include such a reference. Most IEA annotations in FlyBase are based on the presence of protein domains that are mapped to GO terms. The identifiers for the protein domains will be included in future releases.
+
IEA annotations in FlyBase are based on the presence of InterPro protein domains that are mapped to GO terms provided by [https://www.ebi.ac.uk/GOA/InterPro2GO EMBL-EBI Gene Ontology Annotation InterPro2GO] or for non-coding RNAs, RNA sequence families that are mapped to GO terms provided by [https://rfam.org/ Rfam].  
  
IGI and IPI 'with'
+
'''IGI, HGI and IPI 'with''''
  
 
For both IGI and IPI there is a special meaning and All annotations inferred from genetic interaction (IGI) include an identifier for the interacting gene. If the GO term is inferred based on multiple genes interacting simultaneously then all interacting genes are identified using 'with' (separated by commas). However, if the GO term is inferred from multiple pairwise interactions these are treated as separate pieces of experimental evidence and appear with separate evidence codes on the gene report.
 
For both IGI and IPI there is a special meaning and All annotations inferred from genetic interaction (IGI) include an identifier for the interacting gene. If the GO term is inferred based on multiple genes interacting simultaneously then all interacting genes are identified using 'with' (separated by commas). However, if the GO term is inferred from multiple pairwise interactions these are treated as separate pieces of experimental evidence and appear with separate evidence codes on the gene report.
Line 187: Line 172:
 
Similar notation is used for IPI where the interacting gene product is identified using 'with'. Where several gene products interact simultaneously they are recorded in a single annotation (separated by commas after the evidence code). Pairwise physical interactions are recorded independently with using separate evidence codes.
 
Similar notation is used for IPI where the interacting gene product is identified using 'with'. Where several gene products interact simultaneously they are recorded in a single annotation (separated by commas after the evidence code). Pairwise physical interactions are recorded independently with using separate evidence codes.
  
IC 'from' Evidence inferred by curator (IC) is the case that includes 'from'. Curators use this code for those cases where an annotation is not supported by any evidence, but can be reasonably inferred by from other GO annotations, for which evidence is available. The object identified in the IC evidence is always a GO term identifier.
+
'''IC 'from''''
 +
 
 +
Evidence inferred by curator (IC) is the case that includes 'from'. Curators use this code for those cases where an annotation is not supported by any evidence, but can be reasonably inferred by from other GO annotations, for which evidence is available. The object identified in the IC evidence is always a GO term identifier.
  
 
For example, a protein shown to have transcription factor activity in a direct assay could be annotated with the GO term 'general RNA polymerase II transcription factor' (GO:0016251). In the absence of any evidence for the cellular location of that protein, if would be reasonable for the the curator to infer that it is (at least sometimes) located in the nucleus. This would lead the the annotation, nucleus inferred by curator from GO:0016251; the annotation is attributed to the reference that contains evidence for transcription factor activity.
 
For example, a protein shown to have transcription factor activity in a direct assay could be annotated with the GO term 'general RNA polymerase II transcription factor' (GO:0016251). In the absence of any evidence for the cellular location of that protein, if would be reasonable for the the curator to infer that it is (at least sometimes) located in the nucleus. This would lead the the annotation, nucleus inferred by curator from GO:0016251; the annotation is attributed to the reference that contains evidence for transcription factor activity.
  
===G.3.2.2. Use of Qualifiers===
+
==G.3.3. Gene Product To Term Relations==
 +
 
 +
Gene product to term relations are used to modify the interpretation of an annotation by adding contextual information. On the gene report page, qualifiers precede the GO term in the CV column. More information about using qualifiers is available in the [http://geneontology.org/docs/go-annotations/ GO Annotation Guide].
 +
 
 +
 
 +
 
 +
'''Gene product to term relations used in FlyBase'''
 +
 
  
Qualifiers are used as flags that modify the interpretation of an annotation. Allowable values are NOT, contributes_to, and colocalizes_with. On the gene report page, qualifiers precede the GO term in the CV column. More information about using qualifiers is available in the GO Annotation Guide.
+
{| class="wikitable" style="width: 100%;"
 +
|-
 +
! gp2term relation
 +
! GO aspect
 +
! Meaning
 +
! Relations Ontology Mapping ID
 +
! Made available from release:
 +
|-
 +
| enables
 +
| molecular function
 +
| gene product directly performs this molecular function
 +
| RO:0002327
 +
| FB2020_06
 +
|-
 +
| contributes_to
 +
| molecular function
 +
| gene product is part of an indivisible molecular machine that performs this molecular function
 +
| RO:0002326
 +
| FB2020_06
 +
|-
 +
| involved_in
 +
| biological process
 +
| gene product directly participates in a particular biological program
 +
| RO:0002331
 +
| FB2020_06
 +
|-
 +
| acts_upstream_of
 +
| biological process
 +
| gene product takes part in a process that precedes a particular biological program
 +
| RO:0002263
 +
| FB2021_01
 +
|-
 +
| acts_upstream_of_positive_effect
 +
| biological process
 +
| gene product takes part in a process that precedes and up-regulates the activity of a particular biological program
 +
| RO:0004034
 +
| FB2021_01
 +
|-
 +
| acts_upstream_of_negative_effect
 +
| biological process
 +
| gene product takes part in a process that precedes and down-regulates the activity of a particular biological program
 +
| RO:0004035
 +
| FB2021_01
 +
|-
 +
| located_in
 +
| cellular component
 +
| gene product localizes to a particular cellular compartment (may be active or inactive in this component)
 +
| RO:0001025
 +
| FB2020_06
  
NOT
+
|-
 +
| part_of
 +
| cellular component
 +
| gene product is a subunit of a protein-containing complex
 +
| BFO:0000050
 +
| FB2020_06
 +
|-
 +
| is_active_in
 +
| cellular component
 +
| gene product localizes to a particular cellular compartment and is active here
 +
| RO:0002432
 +
| FB2021_01
 +
|-
 +
| colocalizes_with
 +
| cellular component
 +
| gene product localization is proximal to a cellular component
 +
| RO:0002325
 +
| FB2020_06
 +
|}
  
NOT may be used with terms from any of the three GO ontologies (cellular component, biological process, molecular function).
+
==G.3.4. Use of negation==
  
NOT is used to make an explicit note that the gene product is not associated with the GO term. This is particularly important in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method).
+
'''NOT''' is a qualifier used to indicate that a gene or it product doesn't do something that it would be assumed to do.  
  
For example, if a protein has sequence similarity to an enzyme such as galactosyltransferase, but has been shown experimentally not to have the galactosyltransferase activity, it can be annotated as NOT galactosyltransferase activity (GO molecular function term: GO:0008378).
+
'''NOT''' is used when a GO term might otherwise be expected to apply to a gene product, but an experiment, sequence analysis, etc. proves otherwise; it is not generally used for negative or inconclusive experimental results.
  
NOT can also be used when a cited reference explicitly says (e.g. "our favorite protein is not found in the nucleus"). Prefixing a GO term with the string NOT allows curators to state that a particular gene product is NOT associated with a particular GO term. This usage of NOT was introduced to allow curators to document conflicting claims in the literature.
+
e.g. if a protein has sequence similarity to galactosyltransferases, but has been shown experimentally not to have the galactosyltransferase activity, it can be annotated with '''NOT''' galactosyltransferase activity (GO:0008378).
  
Note that NOT is used when a GO term might otherwise be expected to apply to a gene product, but an experiment, sequence analysis, etc. proves otherwise; it is not generally used for negative or inconclusive experimental results.
+
'''NOT''' is used to make an explicit note that the gene product is not associated with the GO term. This is particularly important in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method).  
  
colocalizes_with
+
In FlyBase, annotations where negation is used are excluded from searches.
  
colocalizes_with is used only with cellular component terms.
+
'''NOT''' can be used to document conflicting claims in the literature (i.e. this paper says this enzyme is a galactosyltransferase and this paper shows that this enzyme does not possess this activity) and curators cannot use other means (i.e. other supporting evidence) to resolve the conflict.
  
Gene products that are transiently or peripherally associated with an organelle or complex are annotated to the relevant cellular component term, using the colocalizes_with qualifier. This qualifier is also be used in cases where the resolution of an assay is not accurate enough to say that the gene product is a bona fide component member.
+
'''NOT''' may be used with terms from any of the three GO ontologies (cellular component, biological process, molecular function).
  
contributes_to
+
==G.3.5. Topic-specific GO annoation guidelines==
  
contributes_to is used only with molecular function terms.
+
Guidelines used by FlyBase curators in GO curation.
  
An individual gene product that is part of a complex is annotated to terms that describe the function of the complex. Many such function annotations include the qualifier contributes_to:
+
[[Media:FB_Pathway_Curation_Manual.pdf|<b>Signaling Pathway Curation Manual</b>]]
  
Annotating individual gene products according to attributes of a complex is especially useful for molecular function annotations in cases where a complex has an activity, but not all of the individual subunits do. (For example, there may be a known catalytic subunit and one or more additional subunits, or the activity may only be present when the complex is assembled.) Molecular function annotations of complex subunits that are not known to possess the activity of the complex include the qualifier contributes_to.
+
[[Media:PiRNA_processing_annotation_clues.pdf|<b>Guide to annotation of piRNA processing factors to the most specific GO term<b>]]
  
Note that contributes_to is not used to annotate a catalytic subunit. Furthermore, contributes_to may be used for any non-catalytic subunit, whether the subunit is essential for the activity of the complex or not.
+
[[Category:DONE]]

Latest revision as of 15:52, 28 September 2023

G.3. Classification of Gene Products using Gene Ontology (GO) terms

FlyBase uses Gene Ontology (GO) controlled vocabulary (CV) terms for cellular component, biological process and molecular function to describe properties of gene products. Although GO terms are intended to describe the properties of gene products, FlyBase currently assigns GO terms to genes rather than protein or RNA.

FlyBase is one of the founding members of the Gene Ontology (GO) Consortium and follows the general guidelines for GO annotation as described in the GO documentation.

Also, see the related video tutorial Finding related genes in FlyBase: The Gene Ontology.

G.3.1. FlyBase GO data

GO data is displayed in the Gene Ontology: Function, Process, and Cellular component section of individual Gene Reports.

In addition, the current release of GO data for all Drosophila melanogaster FlyBase genes can be found in the tab delimited text file gene_association.fb.

The following provides a brief description of the columns in the gene_association.fb file.

Gene Association file columns

1. DB The database contributing the gene_association file
FB File: always "FB" for gene_association.fb.
2. DB_Object_ID A unique identifier in the database for the item being annotated.
FB File: This is always the primary FlyBase identifier number for a Drosophila gene.
Example: FBgn0000490
3. DB_Object_Symbol
A (unique and valid) symbol to which the DB_Object_ID is matched.
FB File: This is always the valid gene symbol for a Drosophila gene.
Example: dpp
4. Qualifier
For each GO annotation, one of the following as gene product to term relations are used:
'acts_upstream_of', 'acts_upstream_of_negative_effect',
'acts_upstream_of_positive_effect', 'colocalizes_with',
'contributes_to', 'enables', 'involved_in', 'is_active_in', 'located_in', 'part_of'.
This column may also contain the 'NOT' qualifier, separated by a pipe (|) from the gene product to term relation, which makes the annotation statement a negation.
5. GO ID
The unique GO identifier for the GO term attributed to the DB_Object_ID.
Example: GO:0005160
6. DB:Reference
The unique identifier for the reference to which the GO annotation is attributed.
FB File: Each FlyBase reference including published literature,
conference abstracts, personal communications, sequence records and
computer files has a unique 7 digit identifier (an FBrf). Where this
reference is a published paper with a PubMed identifier, the PubMed ID
is also listed in column 6, separated from the FBrf with a pipe (|).
Example: FB:FBrf0136863|PMID:11432817
7. Evidence
The evidence code for the GO annotation; one of IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA, HDA, HMP, HGI, HEP, IBA
8. With (or) From
FB File: This column contains the identifier for annotations where the
evidence code is IGI, IPI, ISS, IEA or IC. For IGI the database gene
symbol and identifier is listed. For ISS and IPI the identifier can be a gene
symbol and identifier, or a sequence (protein or nucleic acid)
identifier. For IC, the GO identifier of the term used as the basis of
a curator inference is given.
IGI example: FLYBASE:rpr; FB:FBgn0011706
ISS example: UniProt:P35569
ISS example: EMBL:AF064523
ISS example: SGD_LOCUS:COP1; SGD:S0002304
IC example: GO:0045298
9. Aspect
Which ontology the GO term belongs to: Function (F), Process (P) or Component (C).
Example: P
10. DB_Object_Name
FB File: The full name of the FlyBase gene.
Example: decapentaplegic
Where a FlyBase gene has no full name (eg Pten), this field is left blank.
11. DB_Object_Synonym
Alternative names by which the database object is known.
FB File: Multiple synonyms of a FlyBase gene are separated by a pipe (|).
Example: M(2)LS1|shortvein|Dm-DPP|dpp|Dpp|DPP|CG9885|
TGF-beta|TGF-&bgr;|TGF-b|Hin-d|l(2)10638|shv|
DPP-C|ho|M(2)23AB|blk|l(2)22Fa|l(2)k17036|Tg|TGF&bgr;
12. DB_Object_Type
The type of object being annotated. Always a gene for FlyBase data.
FB file: always "gene" for gene_association.fb.
13. taxon
The taxonomic identifier of the species encoding the gene product
Example: taxon:7227
14. Date
The date of last annotation update, in the format 'YYYYMMDD'. At
present this date is the same for all annotations and corresponds to
the date of the latest FlyBase update; we are in the process of
changing our system so that dates more accurately reflect the date the
annotation is made.
Example: 20040821
15. Assigned_by
The source of the GO annotation.

The latest version of this data is also available for download here from the Gene Ontology consortium site. The accompanying README document includes a detailed description of the file format, FlyBase GO annotation policy and sources used for FlyBase GO annotations.

Note that the GO data available from FlyBase will not necessarily be identical to that found on the GO website. GO validate the data FlyBase submits and remove lines of data that are no longer valid e.g. when a GO term becomes obsolete.

QueryBuilder can be used to identify all the genes associated with a particular GO term. The AmiGO and QuickGO browsing tools can be used to find GO terms of interest. See video tutorial Finding related genes in FlyBase: The Gene Ontology.

G.3.2. Evidence

Evidence for a GO term consists of an evidence code that describes the type of analysis carried out together with, in some cases, a reference to another database object in that supports the evidence.

G.3.2.1. GO Evidence Codes

Evidence codes The Gene Ontology Evidence Codes contains comprehensive descriptions of the evidence codes used in GO annotation. FlyBase uses the following evidence codes when assigning GO data:

inferred from experiment (EXP)
inferred from mutant phenotype (IMP)
inferred from genetic interaction (IGI)
inferred from direct assay (IDA)
inferred from physical interaction (IPI)
inferred from expression pattern (IEP)
inferred from high throughput experiment (HTP)
inferred from high throughput mutant phenotype (HMP)
inferred from high throughput genetic interaction (HGI)
inferred from high throughput direct assay (HDA)
inferred from expression pattern (HEP)
inferred from sequence or structural similarity (ISS)
inferred from sequence orthology (ISO)
inferred from sequence alignment (ISA)
inferred from biological aspect of ancestor (IBA)
inferred from electronic annotation (IEA)
inferred from reviewed computational analysis (RCA)
traceable author statement (TAS)
non-traceable author statement (NAS)
inferred by curator (IC)
no biological data available (ND)

G.3.2.2. Use of evidence codes

Consistent with the aims of the GO reference genome project, FlyBase prefers to assign GO terms based on experimental evidence codes (IMP, IGI, IDA, IPI, IEP). Infrequently, GO annotations will be associated with genes frim high throughput (HTP) experiments. The HTP experiments are only annotated when the satisfy a number of rules, to ensure that there are minimal false positives propagated by GO annotation. These experimental annotations are marked by the evidence codes HTP, HDA, HMP, HGI and HEP, so that they can be distinguished from low-throughput, hypothesis-driven experiments.

Evidence codes based on computer predictions (IEA, RCA), author statements (NAS, TAS) and curator inference (IC) will continue to be used in the absence of experimental data for the same or a more specific GO term. However, we aim to remove GO data with these codes when experimental evidence for the term is curated.

The evidence code ND (no biological data available) is used in combination with the root GO terms "molecular_function ; GO:0003674", "biological_process ; GO:0008150" or "cellular_component ; GO:0008372". In FlyBase the use of any of these three GO terms, attributed to reference FBrf0159398 and supported by the ND evidence code, signifies that a curator has examined the available literature and sequence for this gene and that, as of the date of the annotation to the term, there is no information supporting an annotation to a specific GO term.

with/from Supporting Evidence Some evidence codes (IGI, IPI, ISS, IEA, IC) are used in conjunction 'with' supporting data in the form of a reference to another database object. These objects are identified by their database abbreviation followed by a colon and the unique identifier for the object in that database. A list of current database abbreviations can be found in the GO.xrf_abbs file. See the GO Annotation Guide for more details.

ISS and IEA 'with'

FlyBase captures GO data based on similarity to other gene products that are known to have that attribute. Since October 1st 2006, it has been mandatory for ISS annotations to include an identifier for the sequence used to make the annotation; earlier FlyBase ISS annotations that do not include identifiers will be updated gradually. In line with current guidelines for reference genomes, curators now check that the similar sequence can be annotated to the GO term with experimental evidence (IDA, IMP, IGI, IPI, IEP) before making an ISS annotation. This policy was adopted to avoid circular similarity-based annotations. Consequently, GO terms are not curated based multiple sequence alignments if none of the sequences in the alignment have been experimentally verified. Annotations made before October 2006 have not necessarily been checked in this way.

The combined evidence appears on the gene report in the format:

inferred from sequence or structural similarity with FLYBASE:Max; FB:FBgn0017578

In this case we have give two identifiers (symbol and gene ID) for the same sequence; identifiers for the same sequence are separated by a semi-colon. If more than one sequence is used to make the annotation then the identifiers for the different sequences are separated by a comma. Note that this use of multiple identifiers is a different to that for IGI and IPI.

IEA annotations in FlyBase are based on the presence of InterPro protein domains that are mapped to GO terms provided by EMBL-EBI Gene Ontology Annotation InterPro2GO or for non-coding RNAs, RNA sequence families that are mapped to GO terms provided by Rfam.

IGI, HGI and IPI 'with'

For both IGI and IPI there is a special meaning and All annotations inferred from genetic interaction (IGI) include an identifier for the interacting gene. If the GO term is inferred based on multiple genes interacting simultaneously then all interacting genes are identified using 'with' (separated by commas). However, if the GO term is inferred from multiple pairwise interactions these are treated as separate pieces of experimental evidence and appear with separate evidence codes on the gene report.

For example, Bruce is annotated with the GO term 'programmed cell death' based on two different pairwise genetic interaction experiments; the evidence appears on the gene report as:

inferred from genetic interaction with FLYBASE:grim; FB:FBgn0015946 AND inferred from genetic interaction with FLYBASE:rpr; FB:FBgn0011706

Contrast this with, the following which would imply that all three genes had to interact together to provide evidence for the annotation:

inferred from genetic interaction with FLYBASE:grim; FB:FBgn0015946, FLYBASE:rpr; FB:FBgn0011706

Similar notation is used for IPI where the interacting gene product is identified using 'with'. Where several gene products interact simultaneously they are recorded in a single annotation (separated by commas after the evidence code). Pairwise physical interactions are recorded independently with using separate evidence codes.

IC 'from'

Evidence inferred by curator (IC) is the case that includes 'from'. Curators use this code for those cases where an annotation is not supported by any evidence, but can be reasonably inferred by from other GO annotations, for which evidence is available. The object identified in the IC evidence is always a GO term identifier.

For example, a protein shown to have transcription factor activity in a direct assay could be annotated with the GO term 'general RNA polymerase II transcription factor' (GO:0016251). In the absence of any evidence for the cellular location of that protein, if would be reasonable for the the curator to infer that it is (at least sometimes) located in the nucleus. This would lead the the annotation, nucleus inferred by curator from GO:0016251; the annotation is attributed to the reference that contains evidence for transcription factor activity.

G.3.3. Gene Product To Term Relations

Gene product to term relations are used to modify the interpretation of an annotation by adding contextual information. On the gene report page, qualifiers precede the GO term in the CV column. More information about using qualifiers is available in the GO Annotation Guide.


Gene product to term relations used in FlyBase


gp2term relation GO aspect Meaning Relations Ontology Mapping ID Made available from release:
enables molecular function gene product directly performs this molecular function RO:0002327 FB2020_06
contributes_to molecular function gene product is part of an indivisible molecular machine that performs this molecular function RO:0002326 FB2020_06
involved_in biological process gene product directly participates in a particular biological program RO:0002331 FB2020_06
acts_upstream_of biological process gene product takes part in a process that precedes a particular biological program RO:0002263 FB2021_01
acts_upstream_of_positive_effect biological process gene product takes part in a process that precedes and up-regulates the activity of a particular biological program RO:0004034 FB2021_01
acts_upstream_of_negative_effect biological process gene product takes part in a process that precedes and down-regulates the activity of a particular biological program RO:0004035 FB2021_01
located_in cellular component gene product localizes to a particular cellular compartment (may be active or inactive in this component) RO:0001025 FB2020_06
part_of cellular component gene product is a subunit of a protein-containing complex BFO:0000050 FB2020_06
is_active_in cellular component gene product localizes to a particular cellular compartment and is active here RO:0002432 FB2021_01
colocalizes_with cellular component gene product localization is proximal to a cellular component RO:0002325 FB2020_06

G.3.4. Use of negation

NOT is a qualifier used to indicate that a gene or it product doesn't do something that it would be assumed to do.

NOT is used when a GO term might otherwise be expected to apply to a gene product, but an experiment, sequence analysis, etc. proves otherwise; it is not generally used for negative or inconclusive experimental results.

e.g. if a protein has sequence similarity to galactosyltransferases, but has been shown experimentally not to have the galactosyltransferase activity, it can be annotated with NOT galactosyltransferase activity (GO:0008378).

NOT is used to make an explicit note that the gene product is not associated with the GO term. This is particularly important in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method).

In FlyBase, annotations where negation is used are excluded from searches.

NOT can be used to document conflicting claims in the literature (i.e. this paper says this enzyme is a galactosyltransferase and this paper shows that this enzyme does not possess this activity) and curators cannot use other means (i.e. other supporting evidence) to resolve the conflict.

NOT may be used with terms from any of the three GO ontologies (cellular component, biological process, molecular function).

G.3.5. Topic-specific GO annoation guidelines

Guidelines used by FlyBase curators in GO curation.

Signaling Pathway Curation Manual

Guide to annotation of piRNA processing factors to the most specific GO term