Difference between revisions of "FlyBase:Downloads Overview"

From FlyBase Wiki
Jump to navigation Jump to search
 
(118 intermediate revisions by 9 users not shown)
Line 1: Line 1:
 
==Introduction==
 
==Introduction==
  
The "[http://{{flybaseorg}}/cgi-bin/get_static_page.pl?file=bulkdata7.html&title=Current%20Release Current Release page]" (referred to herein as the "Precomputed files" page) lists data files generated from the current release of FlyBase and links to the [ftp://ftp.flybase.net/releases/current/ current FTP repository]. Data from the previous five releases can be found on the "[http://{{flybaseorg}}/cgi-bin/get_static_page.pl?file=archivedata3.html&title=Archived%20Releases Archived data files for recent releases]" page (referred to herein as the "Archived data" page), as well as links to servers hosting older releases of FlyBase, and all the release notes and news archives from FB2006_01. If you are looking for old data and cannot find it on the "Archived data" page please follow a link to the [ftp://ftp.flybase.net/releases/ FTP repository].
+
===Browse Current Release Page===
 +
The [http://{{flybaseorg}}/downloads/bulkdata Current Release page] is a web interface allowing easy access to the main directories and the individual bulk data files available at the [ftp://ftp.flybase.net/releases/current/ current FlyBase FTP repository]. Files can be downloaded directly through the web interface.
  
===Known issues===
+
===Browse FTP Files===
 +
Users can also browse files on our FTP site, either for the [http://ftp.flybase.net/releases/current current release] or for [http://ftp.flybase.net/releases/ past releases]. It's also possible to browse the FTP file by [http://ftp.flybase.net/genomes genomes].</br>
 +
'''Note that the Safari browser does not support browsing of FTP site directories, though it does allow download of individual files.'''
  
Safari does not connect properly to the FlyBase FTP site. You should be able to download individual files using Safari but you will not be able to browse the FTP repository.
+
===Programmatic Download===
 +
The ftp client wget accepts wild card patterns which means you can use a query to obtain the latest file without having to specify the FlyBase release number. However, you will need to know the sub-directory in which the file resides: e.g., "genes", "orthologs", etc.</br>
  
==The FTP archive==
+
Here are some examples:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz</nowiki></code></br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/orthologs/dmel_human_orthologs_disease_fb_2022_04.tsv.gz</nowiki></code></br>
  
The "Main Data Set" section of the "[http://{{flybaseorg}}/cgi-bin/get_static_page.pl?file=bulkdata7.html&title=Current%20Release Precomputed files]" page provides links to FlyBase FTP repository. The "Chado database" link leads to the psql directory of the current FTP repository where you can obtain a dump of the PostgreSQL Chado database. If you have a PostgreSQL client application installed and would like to access the latest FlyBase release without installing the database you can connect to the FlyBase public read only Chado database as:
+
===Opening compressed files===
 +
Most of the files are compressed with the [http://www.gzip.org/ GNU gzip] program and have the suffix '.gz'. Most modern computers will unpack and open these files automatically after download. Alternatively, the gunzip command may be used on machines runnign Apple OS X or Unix. On a Windows machine we suggest you use the program [http://www.7-zip.org/ 7-zip] to open these files as several people have reported problems using WinZip. The resulting file should open with any standard text editor.
  
$ psql -h chado.flybase.org -U flybase flybase
+
===Archived Data===
 +
Data files from previous releases, as well as links to servers hosting older releases of FlyBase, can be accessed via the [http://{{flybaseorg}}/downloads/archivedata Archived Data] webpage.
  
The version running on this service is identical to the current web site release.
+
Using an FTP client, data files from previous releases can be obtained by including the FlyBase release in the path /releases/<RELEASE_NUMBER>/. For example to retrieve the 'fbgn_annotation_ID' file for the FB2018_06 release, type:
  
The "Drosophila Data" section contains links to other sections of the FTP repository. The data can be accessed either by the version of FlyBase (The "[ftp://ftp.flybase.net/releases/current/ Current FTP repository]") or for sequence data by the annotation release of a particular Drosophila species (see the "[ftp://ftp.flybase.net/genomes/ Genomes FTP archive]")
+
<code>
 +
wget <nowiki>"ftp://ftp.flybase.net/releases/FB2018_06/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz"</nowiki>
 +
</code>
  
==General information about the available files==
+
or more specifically:
  
===File names===
+
<code>
The first part of a filename always describes the content of the file, for example the file fbgn_annotation_ID_fb_2008_10.tsv.gz maps the primary FlyBase identifiers to the annotation symbols used for genes.
+
wget <nowiki>ftp://ftp.flybase.net/releases/FB2018_06/precomputed_files/genes/fbgn_annotation_ID_fb_2018_06.tsv.gz</nowiki>
 +
</code>
  
Many of the precomputed data filenames also contain a release or version number. In the example above "fb_2008_10" denotes version FB2008_10 of FlyBase. Sequence data files, such as dmel-all-CDS-r5.13.fasta.gz, denote the annotation release that the data relates to as '-r5.13'. The 'r5' indicates the data is taken from the release 5 of the sequence assembly and the '13' after the decimal point specifies the annotation release.
+
The /releases/current/ path will always point to latest FlyBase release and this directory will have only one copy of the file.
  
The following notation is used for different file formats:
+
==Main Data Set==
  
:'''Tab separated files''' are indicated by the extension ''''.tsv''''
+
This section contains links to top-level directories of the [ftp://ftp.flybase.net/releases/current/ FlyBase FTP repository].
:'''[http://www.geneontology.org/GO.format.shtml#oboflat OBO format files]''', which are used for the ontologies, are denoted ''''.obo''''
 
:'''Plain text''' files are given the extension ''''.txt''''
 
:Files containing nucleic acid or polypeptide sequence data in the '''FASTA''' format are listed as ''''.fasta''''
 
:'''GFF''' files include the suffix ''''.gff''''
 
:'''GTF''' files include the suffix ''''.gtf''''
 
:'''XML''' data files are denoted ''''.xml''''
 
  
Most of the files listed on the "Precomputed files" page and the "Archived data" page are compressed with the [http://www.gzip.org/ GNU gzip] program. These files end with the suffix '.gz'. The ontology files are compressed in the ZIP file format, as indicated by the suffix '.zip'.
+
===Postgres Chado Database Dump===
  
===Accessing files===
+
The Chado database link leads to the psql directory of the current FTP repository where you can obtain a dump of the PostgreSQL Chado database. If you have a PostgreSQL client application installed and would like to access the latest FlyBase release without installing the database you can connect to the FlyBase public read only Chado database as:
Files can be downloaded either directly through the web interface, or by using an ftp client such as wget to obtain the file from the FTP repository. Please note that at present Safari fails to connect to the FTP repository correctly, so we recommend that you use another browser if you wish to access the files through the web interface. The ftp client wget accepts wild card patterns which means you can use a query of the following kind to obtain the latest file without having to specify the FlyBase release number:
+
$ psql -h chado.flybase.org -U flybase flybase
  
$ wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz
+
The version running on this service is identical to the current web site release.
  
The /releases/current/ path will always point to latest FlyBase release and this directory will have only one copy of the file. Archived copies of the files from previous releases can be obtained by including the FlyBase release in the path /releases/<RELEASE_NUMBER>/. For example to retrieve this file for the FB2008_05 release, type:
+
===Drosophila Data===
  
$ wget ftp://ftp.flybase.net/releases/FB2008_05/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz
+
This section contains links to:
 +
* the [ftp://ftp.flybase.net/releases/current/ current FTP repository], containing all files for the current FlyBase release
  
or more specifically:
+
* the [ftp://ftp.flybase.net/releases/current/chado-xml current Chado-XML repository], containing the chado XML files generated from the PostgreSQL database for each FlyBase data class for the current FlyBase release. These files contain all the information used to generate FlyBase report pages and reflect the organization of the data in the database. The DTDs for these XML files, listing the structure of the files, are included in this directory.
  
$ wget ftp://ftp.flybase.net/releases/FB2008_05/precomputed_files/genes/fbgn_annotation_ID_fb_2008_05.tsv.gz
+
* the [ftp://ftp.flybase.net/genomes/ Genomes FTP repository], containing genome and genome annotation data files (including FASTA, GFF and GTF files) for D. melanogaster and other Drosophila species, organized by genome/FlyBase release number. For releases FB2018_05 and earlier, data are available for each of the original 12 sequenced Drosophila species. For releases FB2018_06 to FB2020_02, data are available only for D. melanogaster, D. simulans, D. ananassae, D. pseudoobscura and D. virilis. From release FB2020_03 onward, data are available only for D. melanogaster.
  
 +
==Bulk data files==
  
===Opening compressed files===
+
The remaining sections of the [https://flybase.org/downloads/bulkdata Current Release page] are organized by  data class/type and provide direct downloads of the current bulk data files from the FTP site. Most files are from the [ftp://ftp.flybase.net/releases/current/precomputed_files/ current precomputed files] directory of the FTP site and contain useful data for the specified data type (described in detail below).  The [https://wiki.flybase.org/wiki/FlyBase:Downloads_Overview#Genomes:_Annotation_and_Sequence Genomes] files are from the [ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current current D. melanogaster FTP genomes directory] or the current files for selected other Drosophila species.
Microsoft and Apple include built-in ZIP support in later versions of their operating systems. These files can also be opened with the free program [http://www.7-zip.org/ 7-zip].
 
  
On OS X or Unix a GZIP compressed file can be extracted with the gunzip command. On a Windows machine we suggest that you use the program [http://www.7-zip.org/ 7-zip] to open these files, because several people have reported problems using WinZip. You should be able to open and read the resulting file with any text editor.  
+
The first part of a filename always describes the content of the file, and the second part may contain a FlyBase or genome annotation version number. For example, the file "fbgn_annotation_ID_fb_2018_06.tsv.gz" maps the primary FlyBase gene identifiers (FBgn) to their annotation IDs for the FB2018_06 release of FlyBase. The "dmel-all-CDS-r6.25.fasta.gz" files contains the coding sequences for all D. melanogaster genes from the release 6 of the sequence assembly, annotation release 25.
  
 +
At the top and bottom of each tab separated text file there are a few lines that describe the file. These lines start with a '#' symbol. The line immediately before the start of the data contains headings for each of the tab separated columns in the file. The file can also include some blank lines to separate information about the version of the file from the description of data in the file.
  
==Types of files available==
+
Superscripts and subscripts are represented in the precomputed data files in the ASCII text format used by FlyBase, which is described in [[FlyBase:Nomenclature#10.3|section 10.3]] of the Nomenclature document.
  
===XML files===
+
Each precomputed data file contains the complete data set for the FlyBase release. If you are looking for information on a defined subset of genes or other FlyBase data type, you can use the [http://{{flybaseorg}}/batchdownload Batch Download] tool to query the precomputed data files and thus obtain only the data you require. This approach is described in more detail [https://wiki.flybase.org/wiki/FlyBase:Batch_Download here].
  
These files are generated from the PostgreSQL Chado database for each release of FlyBase. The Chado XML files contain all the information used to generate the report pages and reflect the organization of the data in the database. The DTDs for these XML files, listing the structure of the files, are posted in the chado-xml directory of the FTP repository for each release. For the latest versions of the DTDs please see:
 
  
ftp://ftp.flybase.net/releases/current/chado-xml/
+
===Synonyms===
  
The XML files can also be obtained directly from the "Precomputed files" page by clicking on the "download" link under the ChadoXML heading in the appropriate section. The Chado XML files are available for genes, alleles, stocks, transcripts, polypeptides, insertions, transgenic (recombinant) constructs, aberrations, balancers, clones, and references.
+
Files described in this section are in the "synonyms" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/synonyms/fb_synonym_*.tsv.gz</nowiki></code></br>
  
===Ontology files===
+
====FlyBase Synonyms (fb_synonym_*.tsv)====
 +
The file reports current symbols and synonyms for the following objects in FlyBase: genes (FBgn), alleles (FBal), balancers (FBba), aberrations (FBab), transgenic constructs (FBtp), insertions (FBti), transcripts (FBtr), and proteins (FBpp).
  
The [http://{{flybaseorg}}/static_pages/docs/refman/refman-G.html#G.2. controlled vocabularies] (aka ontologies) used by FlyBase are available under the Ontology Terms section of the "[http://{{flybaseorg}}/static_pages/downloads/bulkdata7.html Precomputed files]" page. Each controlled vocabulary is described in detail in [[FlyBase:Controlled vocabularies used by FlyBase|section G.2]]. of the Reference Manual. The files are in the [http://www.geneontology.org/GO.format.shtml#oboflat OBO format] used by the [http://www.obofoundry.org/ Open Biomedical Ontology] group, and are designed to be used with the free [http://www.oboedit.org/ OBO-Edit] tool.
+
The file includes:
  
Controlled vocabularies undergo continual development; terms and definitions are refined, added, merged, split and obsoleted in an effort to improve the way they represent their various subjects. On the "[http://{{flybaseorg}}/static_pages/downloads/bulkdata7.html Precomputed files]" page the frozen versions of the controlled vocabularies used for the current release of FlyBase are available, and there are also links to the current 'live' versions maintained by the [http://www.obofoundry.org/ Open Biomedical Ontology] group.
+
* nuclear genes located to the sequence
 +
* mitochondrial genes
 +
* genes not located to the sequence
 +
* genes from drosophilid species and genes from non-drosophilids that have been introduced into transgenic flies
  
Frozen versions of the controlled vocabularies used for previous releases of FlyBase are available on the "[http://{{flybaseorg}}/static_pages/downloads/archivedata3.html Archived data]" page, and in the following directories of the FTP repository:
+
File format:
  
ftp://ftp.flybase.net/releases/<RELEASE_NUMBER>/precomputed_files/ontologies/
+
{| class= "wikitable"
 
+
!Column heading
For example see:
+
!Content Description
 
+
|-
ftp://ftp.flybase.net/releases/FB2008_03/precomputed_files/ontologies/
+
|'''primary_FBid'''
 
+
|Primary FlyBase identifier for the object.
===FASTA files===
+
|-
 
+
|'''organism_abbreviation'''
The FlyBase FASTA files generally follow the [http://en.wikipedia.org/wiki/Fasta_format FASTA format] guidelines with one exception being that our header lines sometime exceed the 80 character limit. The FASTA filenames follow these formats:
+
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin.
 +
|-
 +
|'''current_symbol'''
 +
|Current symbol used in FlyBase for the object.
 +
|-
 +
|'''current_fullname'''
 +
|Current full name used in FlyBase for the object.
 +
|-
 +
|'''fullname_synonym(s)'''
 +
|Non-current full name(s) associated with the object (pipe separated values).
 +
|-
 +
|'''symbol_synonym(s)'''
 +
|Non-current symbol(s) associated with the object (pipe separated values).
 +
|-
 +
|}
 +
 
 +
===Genes===
  
'''dmel-all-<data type>-r<release-number>.fasta.gz'''
+
Files described in this section are in the "genes" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz</nowiki></code></br>
  
or
+
====Genes data (Chado XML)====
 +
The chado XML file generated from the FlyBase PostgreSQL database for the 'genes' data class.
  
'''dmel-<chromosome_arm>-<data_type>-r<release-number>.fasta.gz'''
+
====Genetic interaction table (gene_genetic_interactions_*.tsv)====
 +
The file reports the summary of gene-level genetic interactions in FlyBase. This data is computed from the allele-level genetic interaction data captured by FlyBase curators.
  
Where '''data_type''' is one of the following entries in the table below. The '''all''' files contain sequences for those data types on all chromosome arms whereas the specific chromosome arm have only those features for that particular chromosome.  
+
The file includes information for Dmel genes only.
  
{| class= "wikitable"
+
Interactions involving any of the following kinds of allele are considered when the gene-level genetic interaction data is computed:
!Data Type
+
 
!Content Description
+
* classical mutations
 +
* alleles carried on transgenic constructs
 +
* loss-of-function mutations
 +
* gain-of-function mutations
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 
|-
 
|-
|'''aligned '''
+
|'''Starting_gene(s)_symbol'''
|The region of genomic sequence that analysis features align to.
+
|Current FlyBase symbol of gene(s) involved in the starting genotype.
 
|-
 
|-
|'''CDS'''
+
|'''Starting_gene(s)_FBgn'''
|The contiguous protein coding sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.
+
|Current FlyBase identifier (FBgn#) of gene(s) involved in the starting genotype.
 
|-
 
|-
|'''chromosome'''
+
|'''Interacting_gene(s)_symbol'''
|The sequence of each chromosome arm.
+
|Current FlyBase symbol of gene(s) involved in the interacting genotype.
 
|-
 
|-
|'''clones'''
+
|'''Interacting_gene(s)_FBgn'''
|The sequence of full length cDNA, 3' and 5' ESTs, and partial length clones.
+
|Current FlyBase identifier (FBgn#) of gene(s) involved in the interacting genotype.
 
|-
 
|-
|'''exon '''
+
|'''Interaction_type'''
|The sequence of each exon split up into individual FASTA records.
+
|Type of interaction observed, either 'suppressible' or 'enhanceable'.
 
|-
 
|-
|'''five_prime_UTR'''
+
|'''Publication_FBrf'''
|The sequence of 5' untranslated regions.
+
|Current FlyBase identifier (FBrf#) of publication from which the data came.
 
|-
 
|-
|'''gene'''
+
|}
|The sequence of the gene span.
+
 
|-
+
 
|'''gene_extended2000'''
+
Notes:
|The sequence of the gene span with 2000 base pairs added upstream and downstream.
+
 
|-
+
* Each row contains information from a single reference.  Thus if the same genetic interaction has been reported in multiple references, multiple rows will exist for that genetic interaction in the file.
|'''intergenic'''
+
 
|The sequence of chromosomal regions between genes that do not contain known gene models.
+
* 'suppressible' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are suppressed by mutation of the gene(s) listed in the interacting genotype (column 3).
|-
+
 
|'''intron'''
+
* 'enhanceable' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are enhanced by mutation of the gene(s) listed in the interacting genotype (column 3).
|The sequence of each intron split up into individual FASTA records.
+
 
 +
''e.g.''
 +
 
 +
Pten&emsp;FBgn0026379&emsp;Akt1&emsp;FBgn0010379&emsp;suppressible&emsp;FBrf0127089
 +
 
 +
indicates that phenotype(s) caused by a mutation of Pten are suppressed by a mutation of Akt1.
 +
 
 +
* For cases where multiple genes are simultaneously mutated in either (or both) the starting and interacting genotype, then the genes involved are separated by a '|' in the relevant columns.  In this case, the order of the list of symbols and of the list of ids in columns 1 and 2, or in columns 3 and 4 respectively are the same, so that the FBgn corresponding to the symbol for each gene can easily be identified.
 +
 
 +
''e.g.''
 +
 
 +
robo1|sli&emsp;FBgn0005631|FBgn0264089&emsp;RhoGAP93B&emsp;FBgn0038853&emsp;enhanceable&emsp;FBrf0191476
 +
 
 +
indicates that:
 +
* phenotype(s) caused by a robo1, sli double mutant combination are enhanced by a mutation of RhoGAP93B.
 +
* FBgn0005631 corresponds to robo1, FBgn0264089 corresponds to sli
 +
 
 +
 
 +
====RNA-Seq RPKM values (gene_rpkm_report_fb_*.tsv.gz)====
 +
This file reports gene expression values based on RNA-Seq experiments, calculated as reads per kilobase per million reads (RPKM). RPKM values are calculated only for the unique exonic regions of the gene (excluding segments that overlap other genes), except for genes derived from dicistronic/polycistronic transcripts, in which case all exon regions are used in the RPKM expression calculation.
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 
|-
 
|-
|'''miRNA'''
+
|'''Release_ID'''
|The sequence of transcripts that are typed as micro RNAs.
+
|The D. melanogaster annotation set version from which the gene model used in the analysis derives.
 
|-
 
|-
|'''miscRNA'''
+
|'''FBgn#'''
|The sequence of transcripts that are typed as small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), or ribosomal RNA (rRNA). May also contain other transcript types that do not exist in their own individual files.
+
|The unique FlyBase gene ID for this gene.
 
|-
 
|-
|'''ncRNA'''
+
|'''GeneSymbol'''
|The sequence of transcripts that are typed as non coding RNAs (ncRNA).
+
|The official FlyBase symbol for this gene.
 
|-
 
|-
|'''predicted'''
+
|'''Parent_library_FBlc#'''
|The sequence of various features that are derived from a variety of prediction algorithms. These can encompass analyses conducted by FlyBase or by 3rd party groups.
+
|The unique FlyBase ID for the dataset project to which the RNA-Seq experiment belongs.
 
|-
 
|-
|'''pseudogene'''
+
|'''Parent_library_name'''
|The sequence of transcripts that are typed as pseudogenes.
+
|The official FlyBase symbol for the dataset project to which the RNA-Seq experiment belongs.
 
|-
 
|-
|'''sequence_features'''
+
|'''RNASource_FBlc#'''
|The sequence of sequence features, which currently describe data about RNAi reagents. In the future, it will also contain natural genomic features (aside from transcribed regions), such as replication origins, transcription factor binding sites and boundary elements, and other experimental reagents that map to the genome, such as microarray oligonucleotides and rescue fragments.
+
|The unique FlyBase ID for the RNA-Seq experiment used for RPKM expression calculation.
 
|-
 
|-
|'''synteny'''
+
|'''RNASource_name'''
|The sequence of syntenic regions between two species.
+
|The official FlyBase symbol for the RNA-Seq experiment used for RPKM expression calculation.
 
|-
 
|-
|'''three_prime_UTR'''
+
|'''RPKM_value'''
|The sequence of 3' untranslated regions.
+
|The RPKM expression value for the gene in the specified RNA-Seq experiment.
 
|-
 
|-
|'''transcript'''
+
|'''Bin_value'''
|The sequence of transcripts that are typed as messenger RNAs (mRNA).
+
|The expression bin classification of this gene in this RNA-Seq experiment, based on RPKM value. Bins range from 1 (no/extremely low expression) to 8 (extremely high expression).
 +
|-
 +
|'''Unique_exon_base_count'''
 +
|The number of exonic bases unique to the gene (not overlapping exons of other genes). Field will be blank for genes derived from dicistronic/polycistronic transcripts.
 
|-
 
|-
|'''translation'''
+
|'''Total_exon_base_count'''
|The resulting protein sequence from protein coding transcripts.
+
|The number of bases in all exons of this gene.
 
|-
 
|-
|'''transposon'''
+
|'''Count_used'''
|The sequence of transposable elements.
+
|Indicates if the RPKM expression value was calculated using only the exonic regions unique to the gene and not overlapping exons of other genes (Unique), or, if the RPKM expression value was calculated based on all exons of the gene regardless of overlap with other genes (Total). RPKM expression values are typically reported for the "Unique" count, except for genes on dicistronic/polycistronic transcripts, in which case the "Total" count is reported.
 
|-
 
|-
|'''tRNA'''
 
|The sequence of transcripts that are typed as transfer RNAs (tRNA).
 
 
|}
 
|}
  
 +
====RNA-Seq RPKM values matrix (gene_rpkm_matrix_fb_*.tsv.gz)====
 +
A simpler, spreadsheet-friendly version of the "gene_rpkm_report_fb_*.tsv.gz" file. This file provides a gene by expression value matrix based on RNA-Seq experiments. RPKM is calculated as reads per kilobase per million reads (RPKM). RPKM values are calculated only for the unique exonic regions of the gene (excluding segments that overlap other genes), except for genes derived from dicistronic/polycistronic transcripts, in which case all exon regions are used in the RPKM expression calculation. This RPKM matrix lacks the details of how RPKM was calculated for each gene.
  
====FASTA header format====
+
Note - In addition to FlyBase calculated RPKM RNA-Seq expression values, FlyAtlas2 data have been incorporated into this file. These data are in FPKM units, calculated by the FlyAtlas group [https://preview.flybase.org/reports/FBrf0258027.html Gillen, 2023].
  
The typical format of our FASTA header begins with an ID followed by any number of fields that follow this format
 
  
'''field_name=value;'''
+
File format:
  
Multiple field values are separated by commas
+
{| class= "wikitable"
 
+
!Column heading
'''field_name=value1,value2;'''
+
!Content Description
 
+
|-
This table describes some of the field names found in our FASTA headers
+
|'''gene_primary_id'''
 +
|The unique FlyBase gene ID for this gene.
 +
|-
 +
|'''gene_symbol'''
 +
|The official FlyBase symbol for this gene.
 +
|-
 +
|'''gene_fullname'''
 +
| The official full name for this gene.
 +
|-
 +
|'''gene_type'''
 +
| The type of gene: e.g., protein_coding_gene, non_protein_coding_gene.
 +
|-
 +
|'''DATASAMPLE_NAME_(DATASET_ID)'''
 +
| Each subsequent column reports the RNA-Seq gene expression value for the sample listed in the header. The dataset "FBlc" ID is listed in parentheses, and can be pasted into FlyBase search to access more information on the sample from the "dataset" report. Expression in most cases was calculated by FlyBase in RPKM units, with the exception of FlyAtlas2 data, which was calculated by the FlyAtlas group and is expressed in FPKM units.
 +
|-
 +
|}
  
{|class = "wikitable"
+
====Single Cell RNA-Seq Gene Expression (scRNA-Seq_gene_expression_fb_*.tsv.gz)====
 +
This file reports summarized gene expression levels from cell clusters observed in single cell RNA-Seq experiments; these data are processed from data at the EBI Single Cell Expression Atlas. The "Mean_Expression" is the average level of expression of the gene across all cells of the cluster in which the gene is detected at all; the "Spread" is the proportion of cells in the cluster in which the gene is detected. Please see the dataset reports for more experimental details and for links to other data repositories for raw and alternatively processed data.
 +
 
 +
File format:
  
!Field Name
+
{| class= "wikitable"
!Description
+
!Column heading
 +
!Content Description
 +
|-
 +
|'''Pub_ID'''
 +
|The FlyBase FBrf ID for the reference in which the expression was reported.
 
|-
 
|-
|'''type'''
+
|'''Pub_miniref'''
|The feature type of the FASTA sequence record.
+
|The FlyBase citation for the publication in which the expression was reported.
 
|-
 
|-
|'''loc'''
+
|'''Clustering_Analysis_ID'''
|The genomic location given in the NCBI's feature location format. Please see the [ftp://ftp.ncbi.nih.gov/genbank/docs/ NCBI's] site for more information.
+
|The FlyBase FBlc ID for the dataset representing the clustering analysis.
 
|-
 
|-
|'''ID'''
+
|'''Clustering_Analysis_Name'''
|A unique ID. IDs in the form of FBxx[0-9]+ are a unique FlyBase object identifier.
+
|The FlyBase name for the dataset representing the clustering analysis.
 
|-
 
|-
|'''name'''
+
|'''Source_Tissue_Sex'''
|The name or symbol of the feature.
+
|The sex of the source tissue used for the experiment: male, female or mixed.
 
|-
 
|-
|'''dbxref'''
+
|'''Source_Tissue_Stage'''
|Database cross references relating to the FASTA record. The dbxref values use a 'dbname:dbid' format.
+
|The life stage of the source tissue used for the experiment, using only high-level terms: embryonic stage, larval stage, pupal stage, adult stage or mixed.
 
|-
 
|-
|'''MD5'''
+
|'''Source_Tissue_Anatomy'''
|An [http://en.wikipedia.org/wiki/MD5 MD5] checksum calculated from the sequence that can be used to identify identical sequences.
+
|The anatomical region of the source tissue used for the experiment; only "mixed" is shown if many
 
|-
 
|-
|'''length'''
+
|'''Cluster_ID'''
|The length of the sequence found in the FASTA record.
+
|The FlyBase FBlc ID for the dataset representing the cell cluster.
 +
|-
 +
|'''Cluster_Name'''
 +
|The FlyBase name for the dataset representing the cell cluster.
 +
|-
 +
|'''Cluster_Cell_Type_ID'''
 +
|The FlyBase FBbt ID for the cell type represented by the cell cluster.
 +
|-
 +
|'''Cluster_Cell_Type_Name'''
 +
|The FlyBase name for the cell type represented by the cell cluster.
 +
|-
 +
|'''Gene_ID'''
 +
|The FlyBase FBgn ID for the expressed gene.
 +
|-
 +
|'''Gene_Symbol'''
 +
|The FlyBase symbol for the expressed gene (ASCII-format).
 +
|-
 +
|'''Mean_Expression'''
 +
|The average level of expression of the gene across all cells of the cluster in which the gene is detected at all.
 
|-
 
|-
|'''release'''
+
|'''Spread'''
|The release number denotes the annotation release which this FASTA record corresponds to.
+
|The proportion of cells in the cluster in which the gene is detected.
 
|-
 
|-
|'''species'''
 
|The species abbreviation that this FASTA record corresponds to.
 
 
|}
 
|}
  
 +
====Fly Cell Atlas gene expression in high-level cell types (FlyCellAtlas_slimmed_gene_expression_fb_*.tsv.gz)====
 +
This file provides the data used to generate the “Fly Cell Atlas Cell Type Expression Data” bar chart displayed on our Gene Report pages. For each gene that was found expressed in the Fly Cell Atlas dataset, it provides the mean expression level and the proportion of positive cells in the same 22 high level cell types displayed in the aforementioned bar chart. These data are calculated from FlyCellAtlas scRNA-Seq data for higher resolution cell clusters (having more detailed cell type classifications). For more detailed FlyCellAtlas data, and other scRNA-Seq data, please see the "Single Cell RNA-Seq Gene Expression" file.
  
===GFF files===
+
NOTE: Not yet available; coming in the FB2023_06 release.
  
The FlyBase GFF files follow the [http://www.sequenceontology.org/gff3.shtml GFF v3] specification. The GFF files contain feature line definitions for gene models, predicted features, alignments, and many other features. The GFF files are produced for each species and can be downloaded from our FTP site using this URL form:
+
File format:
  
ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gff/
+
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''gene_id'''
 +
|The unique FlyBase gene ID for this gene.
 +
|-
 +
|'''gene_Symbol'''
 +
|The official FlyBase symbol for this gene.
 +
|-
 +
|'''<cell_type>'''
 +
|Two colon-separated values: the mean expression level of the gene in <cell_type>, and the proportion of <cell_type> expressing the gene (percent).
 +
|-
  
e.g. ftp://ftp.flybase.org/genomes/dmel/current/gff/
+
|}
  
For melanogaster, there are 3 GFF files distributed, they include:
+
====High-Throughput Gene Expression (high-throughput_gene_expression_fb_*.tsv.gz)====
 +
This file reports most high-throughput gene expression data that is featured in the High-Throughput Expression Data section of the FlyBase gene report. Data is sorted first by the expression section in which the dataset is displayed, then by sample ID, then by gene ID. Additional information about the dataset or the sample can be obtained by searching FlyBase with the appropriate FBlc dataset/sample ID (columns 2 and 4). Note that scRNA-Seq data is not included in this file, as it is structured differently; scRNA-Seq data is available in other download files. This file includes the testis specificity index score, as calculated by [http://flybase.org/reports/FBrf0240104.htm Vedelek et al. (2018)]
  
:'''dmel-all-r<release-number>.gff.gz'''
+
File format:
::Contains all chromosome arms
 
:'''dmel-all-no-analysis-r<release-number>.gff.gz'''
 
::Same as above except all match and match_part features have been removed
 
:'''dmel-<chromosome_arm>-r<release-number>.gff.gz'''
 
::Contains only a single chromosome arm as identifed by the filename
 
  
The other species have the all chromosome arm file and also a tar and gzipped file containing the individual scaffolds. Please note that the tarball contains thousands of files in a single directory level so extracting them may result in filesystem performance issues.  
+
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''<High_Throughput_Expression_Section>'''
 +
| The name of the Gene report High-Throughput Expression Data section in which the data is reported.
 +
|-
 +
|'''<Dataset_ID>'''
 +
| The FBlc ID of the dataset.
 +
|-
 +
|'''<Dataset_Name>'''
 +
| The name of the dataset.
 +
|-
 +
|'''<Sample_ID>'''
 +
| The FBlc of the sample.
 +
|-
 +
|'''<Sample_Name>'''
 +
| The name of the sample.
 +
|-
 +
|'''<Gene_ID>'''
 +
| The FBgn ID of the gene.
 +
|-
 +
|'''<Gene_Symbol>'''
 +
| The gene symbol.
 +
|-
 +
|'''<Expression_Unit>'''
 +
| The unit of expression: e.g., RPKM, RPMM, TPM, LFQ_geom_mean_intensity, testis_specificity_index_score
 +
|-
 +
|'''<Expression_Value'''
 +
|The gene expression value.
 +
|-
 +
|}
  
===GTF files===
+
====Physical interaction MITAB file (physical_interactions_mitab_fb_*.tsv.gz)====
 +
This file reports each individual experiment curated by FlyBase that supports a physical interaction between two gene products. There can be multiple experiments (multiple rows in the file) between products of the same gene pair. Interaction molecule types currently curated are protein-protein, protein-RNA or RNA-RNA.
  
The FlyBase GTF files follow the [http://mblab.wustl.edu/GTF22.html GTF v2.2] specificationThe GTF files contain feature line definitions for gene models. The GTF are produced for each species and can be downloaded from our FTP site using this URL form:
+
This file is in PSI-MI TAB format, a tab-delimited format developed by the HUPO Proteomics Standards Initiative (PSI) Molecular Interactions (MI) working group to facilitate interactomics data comparison and exchange. Details on the general MITAB format can be found [https://psicquic.github.io/MITAB27Format.html here]. The file makes use of the Molecular Interactions ontology which can be searched or browsed [https://www.ebi.ac.uk/ols/ontologies/mi here].  Fields are filled with “-” if values are missing  or not relevant.
  
ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gtf/
 
  
e.g. ftp://ftp.flybase.org/genomes/dmel/current/gtf/
+
File format:
  
===Precomputed data text files===
+
{| class= "wikitable"
 
+
!Column number
Precomputed data files that contain useful sets of data are generated for every release of FlyBase. For example, the file fbgn_NAseq_Uniprot_fb_2008_10.tsv.gz contains the mapping between valid  [[FlyBase:RefMan_F.#FlyBase_Identifier_Numbers|FlyBase identifiers]] and the corresponding nucleic acid and protein accession numbers used by DDBJ/EMBL/GenBank and UniprotKB/Swiss-Prot/TrEMBL. These files can be found under the "Other" heading of each section of the "[http://{{flybaseorg}}/static_pages/downloads/bulkdata7.html Precomputed files]" and "[http://{{flybaseorg}}/static_pages/downloads/archivedata3.html Archived data]" pages, and are also available under the precomputed_files directory of each FlyBase release in the [ftp://ftp.flybase.net/releases/ FTP repository].
+
!Column heading
 
+
!General format
Superscripts and subscripts are represented in the precomputed data files in the ASCII text format used by FlyBase, which is described in [[FlyBase:Nomenclature#10.3|section 10.3]] of the Nomenclature document.
+
!FlyBase example
 
+
!Content description
At the top and bottom of each tab separated text file there are a few lines that describe the file. These lines start with a '#' symbol. The line immediately before the start of the data contains headings for each of the tab separated columns in the file. The file can also include some blank lines to separate information about the version of the file from the description of data in the file.
 
 
 
Each precomputed data file available for download on the "[http://{{flybaseorg}}/static_pages/downloads/bulkdata7.html Precomputed files]" page contains the complete data set for the FlyBase release. Please note, if you are only looking for information on a defined subset of genes, or other FlyBase data type, you can query the current set of precomputed data files through the [http://{{flybaseorg}}/batchdownload Batch Download] tool to obtain the data you require. This approach is described in more detail in [https://wiki.flybase.org/wiki/FlyBase:Batch_Download this] help document.
 
 
 
==Contents of the precomputed data text files listed by section==
 
 
 
===Main data set===
 
 
 
====Postgres Chado database dump====
 
=====Chado database (ftp://ftp.flybase.net/releases/current/psql)=====
 
The entire SQL Chado database is available for download. Follow the "README" directions herein.
 
 
 
====Drosophila data====
 
=====Current FTP repository (ftp://ftp.flybase.net/releases/current/)=====
 
All files for this current FlyBase release are available on this FTP site.
 
 
 
=====Current Chado-XML repository (ftp://ftp.flybase.net/releases/current/chado-xml)=====
 
All Chado XML files for this current FlyBase release are available on this FTP site.
 
 
 
=====Genomes FTP archive (ftp://ftp.flybase.net/genomes/)=====
 
All FlyBase genome and genome annotation files are available for various Drosophila species. Formats include Chado XML, DNA, FASTA, GFF and GTF. Files from both the current release and previous FlyBase releases are offered. For release FB2018_05 and earlier, data is available for each of the original 12 sequenced Drosophila species. From release FB2018_06 onward, data is available only for D. melanogaster, D. simulans, D. ananassae, D. pseudoobscura and D. virilis.
 
 
 
===Synonyms===
 
 
 
====FlyBase Synonyms (fb_synonym_*.tsv)====
 
The file reports current symbols and synonyms for the following objects in FlyBase: genes (FBgn), alleles (FBal), balancers (FBba), aberrations (FBab), transgenic constructs (FBtp), insertions (FBti), transcripts (FBtr), and proteins (FBpp).
 
 
 
The file includes:
 
 
 
* nuclear genes located to the sequence
 
* mitochondrial genes
 
* genes not located to the sequence
 
* genes from drosophilid species and genes from non-drosophilids that have been introduced into transgenic flies
 
 
 
File format:
 
 
 
{| class= "wikitable"
 
!Column heading
 
!Content Description
 
 
|-
 
|-
|'''primary_FBid'''
+
|'''1'''
|Primary FlyBase identifier for the object.
+
|'''ID(s) Interactor A'''
 +
|database:identifier
 +
|flybase:FBgn0002121
 +
|The unique Flybase  identifier for the first gene of the interacting pair.
 
|-
 
|-
|'''organism_abbreviation'''
+
|'''2'''
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin.
+
|'''ID(s) Interactor B'''
 +
|”
 +
|”
 +
|The unique Flybase  identifier for the second gene of the interacting pair.
 
|-
 
|-
|'''current_symbol'''
+
|'''3'''
|Current symbol used in FlyBase for the object.
+
|'''Alt ID(s) Interactor A'''
 +
|database:identifier
 +
|<nowiki>flybase:CG2671|entrez gene/locuslink:33156</nowiki>
 +
|<nowiki>The alternative gene  identifiers currently provided are Flybase annotation IDs (CG#) and NCBI’s Entrez Gene ID separated by “|”.</nowiki>
 
|-
 
|-
|'''current_fullname'''
+
|'''4'''
|Current full name used in FlyBase for the object.
+
|'''Alt ID(s) Interactor B'''
 +
|”
 +
|”
 +
|”
 
|-
 
|-
|'''fullname_synonym(s)'''
+
|'''5'''
|Non-current full name(s) associated with the object (comma separated values).
+
|'''Alias(es) Interactor A'''
 +
|database:name(alias type)
 +
|flybase:l(2)gl(gene name)
 +
|The official Flybase gene symbol. It is referred to as “gene name” to adhere to the psi-mi ontology.
 
|-
 
|-
|'''symbol_synonym(s)'''
+
|'''6'''
|Non-current symbol(s) associated with the object (comma separated values).
+
|'''Alias(es) Interactor B'''
 +
|
 +
|”
 +
|”
 +
|-
 +
|'''7'''
 +
|'''Interaction Detection Method(s)'''
 +
|ontology:identifier(method name)
 +
|psi-mi:"MI:0006"(anti bait coimmunoprecipitation)
 +
|The assay used to detect the interaction, taken from the psi-mi ontology.
 
|-
 
|-
|}
+
|'''8'''
 
+
|'''Publication 1st Author(s)'''
===Genes===
+
|surname initial(s) (publication year)
 
+
|Betschinger K. (2003)
====Genes data (Chado XML)====
+
|The first author and year of the publication where the interaction is described.
 
 
====Genetic interaction table (gene_genetic_interactions_*.tsv)====
 
The file reports the summary of gene-level genetic interactions in FlyBase. This data is computed from the allele-level genetic interaction data captured by FlyBase curators.
 
 
 
The file includes information for Dmel genes only.
 
 
 
Interactions involving any of the following kinds of allele are considered when the gene-level genetic interaction data is computed:
 
 
 
* classical mutations
 
* alleles carried on transgenic constructs
 
* loss-of-function mutations
 
* gain-of-function mutations
 
 
 
File format:
 
 
 
{| class= "wikitable"
 
!Column heading
 
!Content Description
 
 
|-
 
|-
|'''Starting_gene(s)_symbol'''
+
|'''9'''
|Current FlyBase symbol of gene(s) involved in the starting genotype.
+
|'''Publication ID(s)'''
 +
|database:identifier
 +
|<nowiki>flybase:FBrf0157155|pubmed:12629552</nowiki>
 +
|<nowiki>The unique FlyBase identifier for the publication followed by the unique PubMed identifier (if there is one) separated by “|”.</nowiki>
 
|-
 
|-
|'''Starting_gene(s)_FBgn'''
+
|'''10'''
|Current FlyBase identifier (FBgn#) of gene(s) involved in the starting genotype.
+
|'''Taxid Interactor A'''
 +
|taxid:identifier
 +
|taxid:7227("Drosophila melanogaster")
 +
|The NCBI taxonomy identifier for the source organism of the interactor. The vast majority of interactors in FlyBase come from D. melanogaster. There are, however, a few interspecies interactions consisting of a D. melanogaster interactor and an interactor of a different species.
 
|-
 
|-
|'''Interacting_gene(s)_symbol'''
+
|'''11'''
|Current FlyBase symbol of gene(s) involved in the interacting genotype.
+
|'''Taxid Interactor B'''
 +
|”
 +
|”
 +
|”
 
|-
 
|-
|'''Interacting_gene(s)_FBgn'''
+
|'''12'''
|Current FlyBase identifier (FBgn#) of gene(s) involved in the interacting genotype.
+
|'''Interaction Type(s)'''
 +
|ontology:identifier(interaction type)
 +
|psi-mi:"MI:0915"(physical association)
 +
|Taken from the psi-mi ontology. Most often “physical association” for FlyBase.
 
|-
 
|-
|'''Interaction_type'''
+
|'''13'''
|Type of interaction observed, either 'suppressible' or 'enhanceable'.
+
|'''Source Database(s)'''
 +
|ontology:identifier(database name)
 +
|psi-mi:"MI:0478"(flybase)
 +
|All interactions are curated by FlyBase.
 
|-
 
|-
|'''Publication_FBrf'''
+
|'''14'''
|Current FlyBase identifier (FBrf#) of publication from which the data came.
+
|'''Interaction Identifier(s)'''
 +
|database:identifier
 +
|flybase:FBrf0157155-13.coIP.WB
 +
|The unique FlyBase identifier for this interaction.
 
|-
 
|-
|}
+
|'''15'''
 
+
|'''Confidence Value(s)'''
 
+
|
Notes:
+
|
 
+
|Not applicable
* Each row contains information from a single reference.  Thus if the same genetic interaction has been reported in multiple references, multiple rows will exist for that genetic interaction in the file.
+
|-
 
+
|'''16'''
* 'suppressible' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are suppressed by mutation of the gene(s) listed in the interacting genotype (column 3).
+
|'''Expansion Method(s)'''
 
+
|
* 'enhanceable' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are enhanced by mutation of the gene(s) listed in the interacting genotype (column 3).
+
|
 
+
|Not applicable
''e.g.''
+
|-
 
+
|'''17'''
Pten&emsp;FBgn0026379&emsp;Akt1&emsp;FBgn0010379&emsp;suppressible&emsp;FBrf0127089
+
|'''Biological Role(s) Interactor A'''
 
+
|
indicates that phenotype(s) caused by a mutation of Pten are suppressed by a mutation of Akt1.
+
|
 
+
|Not applicable
* For cases where multiple genes are simultaneously mutated in either (or both) the starting and interacting genotype, then the genes involved are separated by a '|' in the relevant columns.  In this case, the order of the list of symbols and of the list of ids in columns 1 and 2, or in columns 3 and 4 respectively are the same, so that the FBgn corresponding to the symbol for each gene can easily be identified.
 
 
 
''e.g.''
 
 
 
robo1|sli&emsp;FBgn0005631|FBgn0264089&emsp;RhoGAP93B&emsp;FBgn0038853&emsp;enhanceable&emsp;FBrf0191476
 
 
 
indicates that:
 
* phenotype(s) caused by a robo1, sli double mutant combination are enhanced by a mutation of RhoGAP93B.
 
* FBgn0005631 corresponds to robo1, FBgn0264089 corresponds to sli
 
 
 
 
 
 
 
====RNA-Seq RPKM values (gene_rpkm_report_fb_*.tsv.gz)====
 
This file reports gene expression values based on RNA-Seq experiments, calculated as reads per kilobase per million reads (RPKM). RPKM values are calculated only for the unique exonic regions of the gene (excluding segments that overlap other genes), except for genes derived from dicistronic/polycistronic transcripts, in which case all exon regions are used in the RPKM expression calculation.
 
 
 
File format:
 
 
 
{| class= "wikitable"
 
!Column heading
 
!Content Description
 
 
|-
 
|-
|'''Release_ID'''
+
|'''18'''
|The D. melanogaster annotation set version from which the gene model used in the analysis derives.
+
|'''Biological Role(s) Interactor B'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''FBgn#'''
+
|'''19'''
|The unique FlyBase gene ID for this gene.
+
|'''Experimental Role(s) Interactor A'''
 +
|ontology:identifier(experimental role name)
 +
|psi-mi:"MI:0496"(bait)
 +
|The role played by the interactor in the experiment. Taken from the psi-mi ontology.
 
|-
 
|-
|'''GeneSymbol'''
+
|'''20'''
|The official FlyBase symbol for this gene.
+
|'''Experimental Role(s) Interactor B'''
 +
|”
 +
|”
 +
|”
 
|-
 
|-
|'''Parent_library_FBlc#'''
+
|'''21'''
|The unique FlyBase ID for the dataset project to which the RNA-Seq experiment belongs.
+
|'''Type(s) Interactor A'''
 +
|ontology:identifier(interactor type name)
 +
|psi-mi:"MI:0326"(protein)
 +
|The molecule type. For FlyBase, these are limited to protein or ribonucleic acid. Taken from the psi-mi ontology.
 
|-
 
|-
|'''Parent_library_name'''
+
|'''22'''
|The official FlyBase symbol for the dataset project to which the RNA-Seq experiment belongs.
+
|'''Type(s) Interactor B'''
|-
+
|
|'''RNASource_FBlc#'''
+
|”
|The unique FlyBase ID for the RNA-Seq experiment used for RPKM expression calculation.
+
|”
 
|-
 
|-
|'''RNASource_name'''
+
|'''23'''
|The official FlyBase symbol for the RNA-Seq experiment used for RPKM expression calculation.
+
|'''Xref(s) Interactor A'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''RPKM_value'''
+
|'''24'''
|The RPKM expression value for the gene in the specified RNA-Seq experiment.
+
|'''Xref(s) Interactor B'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''Bin_value'''
+
|'''25'''
|The expression bin classification of this gene in this RNA-Seq experiment, based on RPKM value. Bins range from 1 (no/extremely low expression) to 8 (extremely high expression).
+
|'''Interaction Xref(s)'''
 +
|database:identifier
 +
|flybase:FBig0000000103
 +
|<nowiki>Cross references for the interactions. For Flybase, these include an interaction group identifier (FBig) and possibly a collection identifier (FBlc) separated  by “|”. All experiments that show an interaction between the products of gene A and gene B are compiled into an A-B interaction group, such that all interactions are associated with an interaction group identified by an FBig number. Interactions identified as part of a large scale study are also associated with the collection identifier, or FBlc number.</nowiki>
 
|-
 
|-
|'''Unique_exon_base_count'''
+
|'''26'''
|The number of exonic bases unique to the gene (not overlapping exons of other genes). Field will be blank for genes derived from dicistronic/polycistronic transcripts.
+
|'''Annotation(s) Interactor A'''
 +
|topic:text
 +
|isoform-comment:a isoform
 +
|Information on whether the interaction is specific to a particular interactor isoform.
 
|-
 
|-
|'''Total_exon_base_count'''
+
|'''27'''
|The number of bases in all exons of this gene.
+
|'''Annotation(s) Interactor B'''
 +
|”
 +
|”
 +
|”
 +
|-
 +
|'''28'''
 +
|'''Interaction Annotation(s)'''
 +
|topic:text
 +
|molecular source:Source was cell extract of S2 cell line; bait produced from endogenous gene; prey produced from endogenous gene.|comment:Phosphorylated isoforms of @l(2)gl@ are absent when @aPKC@ is knocked down by RNAi.
 +
|Describes the source(s) of the interaction participants and includes free text comments about the interaction.
 
|-
 
|-
|'''Count_used'''
+
|'''29'''
|Indicates if the RPKM expression value was calculated using only the exonic regions unique to the gene and not overlapping exons of other genes (Unique), or, if the RPKM expression value was calculated based on all exons of the gene regardless of overlap with other genes (Total). RPKM expression values are typically reported for the "Unique" count, except for genes on dicistronic/polycistronic transcripts, in which case the "Total" count is reported.
+
|'''Host Organism(s)'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|}
+
|'''30'''
 
+
|'''Interaction Parameters'''
 
+
|
 
+
|
====Physical interaction table (physical_interactions_fb_*.tsv.gz)====
+
|Not applicable
This file reports unique gene pairs with curated support for some type of physical interaction. The file does not currently distinguish between genes that are involved in protein-protein or RNA-protein interactions (or both).
 
 
 
File format:
 
 
 
{| class= "wikitable"
 
!Column heading
 
!Content Description
 
 
|-
 
|-
|'''gene_FBgn1'''
+
|'''31'''
|The unique FlyBase gene ID for the first gene of the interacting pair.
+
|'''Creation Date'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''gene_symbol1'''
+
|'''32'''
|The official FlyBase symbol for the first gene of the interacting pair.
+
|'''Update Date'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''gene_FBgn2'''
+
|'''33'''
|The unique FlyBase gene ID for the second gene of the interacting pair.
+
|'''Checksum Interactor A'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''gene_symbol2'''
+
|'''34'''
|The official FlyBase symbol for the second gene of the interacting pair.
+
|'''Checksum Interactor B'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''FBrf(s)'''
+
|'''35'''
|The unique FlyBase IDs for the publications supporting this interaction.
+
|'''Interaction Checksum'''
 +
|
 +
|
 +
|Not applicable
 
|-
 
|-
|'''FBig_id'''
+
|'''36'''
|The unique FlyBase ID for this pairwise interaction.
+
|'''Negative'''
 +
|
 +
|FALSE
 +
|All interactions in FlyBase are positive.
 
|-
 
|-
|'''#_reported_interactions'''
+
|'''37'''
|The number of distinct experiments in support of this interaction.
+
|'''Feature(s) Interactor A'''
 +
|feature_type:range(text)
 +
|sufficient binding region:aa 1-58(N-terminal region)
 +
|Describes features of Interactor A such as binding sites, mutations that disrupt the interaction, epitope tags, etc.
 
|-
 
|-
|}
+
|'''38'''
 
+
|'''Feature(s) Interactor B'''
 
+
|”
====Physical interaction MITAB file (physical_interactions_mitab_fb_*.tsv.gz)====
+
|”
This file reports each individual experiment curated by FlyBase that supports a physical interaction between two gene products. There can be multiple experiments (multiple rows in the file) between products of the same gene pair. Interaction molecule types currently curated are protein-protein, protein-RNA or RNA-RNA.
+
|”
 
+
|-
This file is in PSI-MI TAB format, a tab-delimited format developed by the HUPO Proteomics Standards Initiative (PSI) Molecular Interactions (MI) working group to facilitate interactomics data comparison and exchange. Details on the general MITAB format can be found [https://psicquic.github.io/MITAB27Format.html here]. The file makes use of the Molecular Interactions ontology which can be searched or browsed [https://www.ebi.ac.uk/ols/ontologies/mi here].  Fields are filled with  “-” if values are missing  or not relevant.
+
|'''39'''
 +
|'''Stoichiometry Interactor A'''
 +
|
 +
|
 +
|Not applicable
 +
|-
 +
|'''40'''
 +
|'''Stoichiometry Interactor B'''
 +
|
 +
|
 +
|Not applicable
 +
|-
 +
|'''41'''
 +
|'''Identification Method(s) Participant A'''
 +
|
 +
|
 +
|Not applicable
 +
|-
 +
|'''42'''
 +
|'''Identification Method(s) Participant B'''
 +
|
 +
|
 +
|Not applicable
 +
|-
 +
|}
 +
 
  
 +
====Functional complementation table (gene_functional_complementation_*.tsv)====
 +
 +
This file reports when functional complementation of Dmel genes by non-Dmel orthologs has been observed. This data is computed by FlyBase using a combination of the orthology data obtained from DIOPT and OrthoDB and the allele-level genetic interaction data curated from the literature. The file contains a list of gene Dmel - to - non-Dmel-ortholog gene pairs where a transgenic construct/mutant allele of the non-Dmel ortholog has been shown to at least partially suppress mutant phenotype(s) of an allele of the Dmel gene.
  
 
File format:
 
File format:
Line 464: Line 652:
 
{| class= "wikitable"
 
{| class= "wikitable"
 
!Column number
 
!Column number
!Column heading
+
!Column heading
!General format
+
!Content Description
!FlyBase example
 
!Content description
 
 
|-
 
|-
 
|'''1'''
 
|'''1'''
|'''ID(s) Interactor A'''
+
|'''Dmel gene (symbol)'''
|database:identifier
+
|Current FlyBase symbol of Dmel gene.
|flybase:FBgn0002121
 
|The unique Flybase  identifier for the first gene of the interacting pair.
 
 
|-
 
|-
 
|'''2'''
 
|'''2'''
|'''ID(s) Interactor B'''
+
|'''Dmel gene (FBgn)'''
|
+
|Current FlyBase identifier (FBgn#) of Dmel gene in column 1.
|”
 
|The unique Flybase  identifier for the second gene of the interacting pair.
 
 
|-
 
|-
 
|'''3'''
 
|'''3'''
|'''Alt ID(s) Interactor A'''
+
|'''Functionally complementing ortholog (symbol)'''
|database:identifier
+
|Current FlyBase symbol of a non-Dmel ortholog of the Dmel gene in column 1 where this non-Dmel gene has been show to functionally complement the Dmel gene.
|<nowiki>flybase:CG2671|entrez gene/locuslink:33156</nowiki>
 
|<nowiki>The alternative gene identifiers currently provided are Flybase annotation IDs (CG#) and NCBI’s Entrez Gene ID separated by “|”.</nowiki>
 
 
|-
 
|-
 
|'''4'''
 
|'''4'''
|'''Alt ID(s) Interactor B'''
+
|'''Functionally complementing ortholog (FBgn#)'''
|
+
|Current FlyBase identifier (FBgn#) of a non-Dmel ortholog of the Dmel gene in column 1 where this non-Dmel gene has been show to functionally complement the Dmel gene.
|”
 
|”
 
 
|-
 
|-
 
|'''5'''
 
|'''5'''
|'''Alias(es) Interactor A'''
+
|'''Supporting_FBrf'''
|database:name(alias type)
+
|Current FlyBase identifier (FBrf#) of the publication that provides support for the functional complementation statement (the publication that reported the suppression of a mutant phenotype of the Dmel gene by a transgenic construct/mutant allele of the non-Dmel ortholog).
|flybase:l(2)gl(gene name)
 
|The official Flybase gene symbol. It is referred to as “gene name” to adhere to the psi-mi ontology.
 
 
|-
 
|-
|'''6'''
+
|}
|'''Alias(es) Interactor B'''
+
 
|”
+
Notes:
|”
+
 
|”
+
* Each row contains information from a single reference.  Thus if multiple references support the same functional complementation statement, multiple rows will exist for that statement in the file.
|-
+
 
|'''7'''
+
 
|'''Interaction Detection Method(s)'''
+
====FBgn <=> DB Accession IDs (fbgn_NAseq_Uniprot_*.tsv)====
|ontology:identifier(method name)
+
The file reports EMBL/GenBank/DDBJ nucleotide and protein accessions, UniProtKB/SwissProt/TrEMBL protein accessions, NCBI Entrez gene IDs and NCBI RefSeq transcript and protein accessions associated with FlyBase genes.
|psi-mi:"MI:0006"(anti bait coimmunoprecipitation)
+
 
|The assay used to detect the interaction, taken from the psi-mi ontology.
+
The file includes:
 +
* nuclear genes with sequence accession numbers
 +
* mitochondrial genes
 +
 
 +
it excludes:
 +
* genes without sequence accession numbers
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column number
 +
!Column heading
 +
!Content Description
 
|-
 
|-
|'''8'''
+
|'''1'''
|'''Publication 1st Author(s)'''
+
|'''gene_symbol'''
|surname initial(s) (publication year)
+
|Current symbol of gene.
|Betschinger K. (2003)
 
|The first author and year of the publication where the interaction is described.
 
 
|-
 
|-
|'''9'''
+
|'''2'''
|'''Publication ID(s)'''
+
|'''organism_abbreviation'''
|database:identifier
+
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
|<nowiki>flybase:FBrf0157155|pubmed:12629552</nowiki>
 
|<nowiki>The unique FlyBase identifier for the publication followed by the unique PubMed identifier (if there is one) separated by “|”.</nowiki>
 
 
|-
 
|-
|'''10'''
+
|'''3'''
|'''Taxid Interactor A'''
+
|'''primary_FBgn#'''
|taxid:identifier
+
|Current FlyBase identifier (FBgn#) of gene.
|taxid:7227("Drosophila melanogaster")
+
|-
|The NCBI taxonomy identifier for the source organism of the interactor. The vast majority of interactors in FlyBase come from D. melanogaster. There are, however, a few interspecies interactions consisting of a D. melanogaster interactor and an interactor of a different species.
+
|'''4'''
 +
|'''nucleotide_accession'''
 +
|EMBL/GenBank/DDBJ nucleotide accession associated with the gene.
 
|-
 
|-
|'''11'''
+
|'''5'''
|'''Taxid Interactor B'''
+
|'''na_based_protein_accession'''
|
+
|EMBL/GenBank/DDBJ protein accession associated with the gene and the nucleotide accession in the preceeding 'nucleotide_accession' column
|”
 
|”
 
 
|-
 
|-
|'''12'''
+
|'''6'''
|'''Interaction Type(s)'''
+
|'''UniprotKB/Swiss-Prot/TrEMBL_accession'''
|ontology:identifier(interaction type)
+
|UniProtKB/SwissProt/TrEMBL protein accession associated with the gene.
|psi-mi:"MI:0915"(physical association)
 
|Taken from the psi-mi ontology. Most often “physical association” for FlyBase.
 
 
|-
 
|-
|'''13'''
+
|'''7'''
|'''Source Database(s)'''
+
|'''EntrezGene_ID'''
|ontology:identifier(database name)
+
|NCBI Entrez ID associated with the gene.
|psi-mi:"MI:0478"(flybase)
 
|All interactions are curated by FlyBase.
 
 
|-
 
|-
|'''14'''
+
|'''8'''
|'''Interaction Identifier(s)'''
+
|'''RefSeq_transcripts'''
|database:identifier
+
|NCBI RefSeq transcript accession associated with the gene.
|flybase:FBrf0157155-13.coIP.WB
 
|The unique FlyBase identifier for this interaction.
 
 
|-
 
|-
|'''15'''
+
|'''9'''
|'''Confidence Value(s)'''
+
|'''RefSeq_proteins'''
|
+
|NCBI RefSeq protein accession associated with the gene and the transcript accession in the preceeding 'RefSeq_transcripts' column.
|
 
|Not applicable
 
 
|-
 
|-
|'''16'''
+
|}
|'''Expansion Method(s)'''
+
 
|
+
Notes:
|
+
 
|Not applicable
+
* Each row contains information about a single accession associated with a gene, thus if a gene has multiple accessions associated with it, multiple rows will exist for that gene in the file.
 +
 
 +
* A single row contains '''only''' information about an EMBL/GenBank/DDBJ accession '''or''' information about a UniProtKB/SwissProt/TrEMBL accession '''or''' an NCBI Entrez gene ID '''or''' an NCBI RefSeq transcript accession.
 +
 
 +
* For rows containing information about a EMBL/GenBank/DDBJ accession, a nucleotide accession associated with the gene is listed in column 4 ('nucleotide_accession'). If there is also a EMBL/GenBank/DDBJ protein accession associated with that gene '''and''' with the nucleotide accession in column 4, this protein accession is listed in column 5 ('na_based_protein_accession'). In this case, columns 6, 7, 8 and 9 are always empty.
 +
 
 +
* For rows containing information about a UniProtKB/SwissProt/TrEMBL protein accession, a protein accession associated with the gene is listed in column 6 ('UniprotKB/Swiss-Prot/TrEMBL_accession'). In this case, columns 4, 5, 7, 8 and 9 are always empty.
 +
 
 +
* For rows containing information about an NCBI Entrez gene, an ID associated with the gene is listed in column 7 ('EntrezGene_ID'). In this case, columns 4, 5, 6, 8 and 9 are always empty.
 +
 
 +
* For rows containing information about an NCBI RefSeq accession, a transcript accession associated with the gene is listed in column 8 ('RefSeq_transcripts'). If there is also an NCBI RefSeq protein accession associated with that gene '''and''' with the transcript accession in column 8, this protein accession is listed in column 9 ('RefSeq_proteins'). In this case, columns 4, 5, 6 and 7 are always empty.
 +
 
 +
 
 +
====FBgn <=> Annotation ID (fbgn_annotation_ID_*.tsv)====
 +
The file reports current and secondary FlyBase identifiers associated with FlyBase genes, including current and secondary gene identifiers (FBgn#), and current and secondary annotation identifiers (CG#).
 +
 
 +
The file includes:
 +
* nuclear genes located to the sequence
 +
* mitochondrial genes
 +
 
 +
it excludes:
 +
* genes not located to the sequence
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 
|-
 
|-
|'''17'''
+
|'''gene_symbol'''
|'''Biological Role(s) Interactor A'''
+
|Current symbol of gene.
|
 
|
 
|Not applicable
 
 
|-
 
|-
|'''18'''
+
|'''organism_abbreviation'''
|'''Biological Role(s) Interactor B'''
+
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
|
 
|
 
|Not applicable
 
 
|-
 
|-
|'''19'''
+
|'''primary_FBgn#'''
|'''Experimental Role(s) Interactor A'''
+
|Current FlyBase identifier (FBgn#) of gene.
|ontology:identifier(experimental role name)
 
|psi-mi:"MI:0496"(bait)
 
|The role played by the interactor in the experiment. Taken from the psi-mi ontology.
 
 
|-
 
|-
|'''20'''
+
|'''secondary_FBgn#(s)'''
|'''Experimental Role(s) Interactor B'''
+
|Secondary FlyBase identifier(s) (FBgn#) associated with the gene (comma separated values).
|”
 
|”
 
|”
 
 
|-
 
|-
|'''21'''
+
|'''annotation_ID'''
|'''Type(s) Interactor A'''
+
|Current annotation identifier associated with the gene.
|ontology:identifier(interactor type name)
 
|psi-mi:"MI:0326"(protein)
 
|The molecule type. For FlyBase, these are limited to protein or ribonucleic acid. Taken from the psi-mi ontology.
 
 
|-
 
|-
|'''22'''
+
|'''secondary_annotation_ID(s)'''
|'''Type(s) Interactor B'''
+
|Secondary annotation identifier(s) associated with the gene (comma separated values).
|”
 
|”
 
|”
 
 
|-
 
|-
|'''23'''
+
|}
|'''Xref(s) Interactor A'''
+
Notes:
|
+
 
|
+
* If a gene has multiple secondary identifiers, all the values are stored within one tab separated column and are separated by commas (for example as: FBgn0034701,FBgn0034702).
|Not applicable
+
 
|-
+
 
|'''24'''
+
====FBgn <=> GLEANR IDs (fbgn_gleanr_*.tsv)====
|'''Xref(s) Interactor B'''
+
This file reports the relationship between the symbols and gene identifiers used by FlyBase for non-melanogaster genes identified by the AAA consortium, and the GLEANR identifier assigned to the gene during the initial annotation of the genome sequence.
|
+
 
|
+
The file includes:
|Not applicable
+
* non-melanogaster genes located to the sequence
 +
 
 +
it excludes:
 +
* ''D. melanogaster'' genes
 +
* non-melanogaster genes not located to the sequence
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''organism_abbreviation'''
 +
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
 
|-
 
|-
|'''25'''
+
|'''gene_symbol'''
|'''Interaction Xref(s)'''
+
|Current FlyBase gene symbol.
|database:identifier
 
|flybase:FBig0000000103
 
|<nowiki>Cross references for the interactions. For Flybase, these include an interaction group identifier (FBig) and possibly a collection identifier (FBlc) separated  by “|”. All experiments that show an interaction between the products of gene A and gene B are compiled into an A-B interaction group, such that all interactions are associated with an interaction group identified by an FBig number. Interactions identified as part of a large scale study are also associated with the collection identifier, or FBlc number.</nowiki>
 
 
|-
 
|-
|'''26'''
+
|'''primary_FBgn#'''
|'''Annotation(s) Interactor A'''
+
|Current FlyBase identifier (FBgn#) of the gene.
|topic:text
 
|isoform-comment:a isoform
 
|Information on whether the interaction is specific to a particular interactor isoform.
 
 
|-
 
|-
|'''27'''
+
|'''GLEANR_ID'''
|'''Annotation(s) Interactor B'''
+
|GLEANR identifier assigned by the AAA Consortium.
|”
 
|”
 
|”
 
 
|-
 
|-
|'''28'''
+
|}
|'''Interaction Annotation(s)'''
+
 
|topic:text
+
 
|molecular source:Source was cell extract of S2 cell line; bait produced from endogenous gene; prey produced from endogenous gene.|comment:Phosphorylated isoforms of @l(2)gl@ are absent when @aPKC@ is knocked down by RNAi.
+
====FBgn <=> FBtr <=> FBpp IDs (fbgn_fbtr_fbpp_*.tsv)====
|Describes the source(s) of the interaction participants and includes free text comments about the interaction.
+
This file reports the relationship of gene identifiers used by FlyBase for sequence localized genes, and the identifiers used for the transcript and polypeptide products of these genes.
|-
+
 
|'''29'''
+
The file includes:
|'''Host Organism(s)'''
+
* genes located to the sequence
|
+
 
|
+
it excludes:
|Not applicable
+
* genes not located to the sequence
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''FlyBase_FBgn'''
 +
|Current FlyBase identifier (FBgn#) of the gene.
 
|-
 
|-
|'''30'''
+
|'''FlyBase_FBtr'''
|'''Interaction Parameters'''
+
|Current FlyBase identifier (FBtr#) of a transcript encoded by the gene listed in the preceeding 'FlyBase_FBgn' column.
|
 
|
 
|Not applicable
 
 
|-
 
|-
|'''31'''
+
|'''FlyBase_FBpp'''
|'''Creation Date'''
+
|Current FlyBase identifier (FBpp#) of a polypeptide encoded by the transcript listed in the preceeding 'FlyBase_FBtr' column, where this is relevant.
|
 
|
 
|Not applicable
 
 
|-
 
|-
|'''32'''
+
|}
|'''Update Date'''
+
 
|
+
Notes:
|
+
 
|Not applicable
+
* Each row contains information about a single transcript and the polypeptide it encodes (if relevant). Thus if a gene encodes multiple isoforms, multiple rows with exist for that gene in the file.
 +
 
 +
 
 +
====FBgn <=> FBtr <=> FBpp IDs (expanded) (fbgn_fbtr_fbpp_expanded_*.tsv)====
 +
This expanded version of the "FBgn <=> FBtr <=> FBpp IDs" file adds organism, symbol and type information to the identifiers for sequence localized genes and their related transcript and protein products.
 +
 
 +
The file includes:
 +
* sequence localized nuclear genes with transcript/polypeptide annotations.
 +
* sequence localized mitochondrial genes with transcript/polypeptide annotations.
 +
 
 +
it excludes:
 +
* genes that have not been localized to the reference genome assembly for a given species.
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column number
 +
!Column heading
 +
!Content Description
 
|-
 
|-
|'''33'''
+
|'''1'''
|'''Checksum Interactor A'''
+
|'''organism'''
|
+
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
|
 
|Not applicable
 
 
|-
 
|-
|'''34'''
+
|'''2'''
|'''Checksum Interactor B'''
+
|'''gene_type'''
|
+
|The type of gene, represented by a Sequence Ontology term.
|
 
|Not applicable
 
 
|-
 
|-
|'''35'''
+
|'''3'''
|'''Interaction Checksum'''
+
|'''gene_ID'''
|
+
|Current "FBgn" identifier of gene.
|
 
|Not applicable
 
 
|-
 
|-
|'''36'''
+
|'''4'''
|'''Negative'''
+
|'''gene_symbol'''
|
+
|Current symbol of the gene.
|FALSE
 
|All interactions in FlyBase are positive.
 
 
|-
 
|-
|'''37'''
+
|'''5'''
|'''Feature(s) Interactor A'''
+
|'''gene_fullname'''
|feature_type:range(text)
+
|Current full name of the gene.
|sufficient binding region:aa 1-58(N-terminal region)
 
|Describes features of Interactor A such as binding sites, mutations that disrupt the interaction, epitope tags, etc.
 
 
|-
 
|-
|'''38'''
+
|'''6'''
|'''Feature(s) Interactor B'''
+
|'''annotation_ID'''
|
+
|Current FlyBase annotation identifier of the gene.
|
+
|-
|
+
|'''7'''
 +
|'''transcript_type'''
 +
|The type of transcript, represented by a Sequence Ontology term.
 
|-
 
|-
|'''39'''
+
|'''8'''
|'''Stoichiometry Interactor A'''
+
|'''transcript_ID'''
|
+
|Current FlyBase annotation identifier of the transcript.
|
 
|Not applicable
 
 
|-
 
|-
|'''40'''
+
|'''9'''
|'''Stoichiometry Interactor B'''
+
|'''transcript_symbol'''
|
+
|Current symbol of the transcript.
|
 
|Not applicable
 
 
|-
 
|-
|'''41'''
+
|'''10'''
|'''Identification Method(s) Participant A'''
+
|'''polypeptide_ID'''
|
+
|Current FlyBase annotation identifier of the polypeptide.
|
 
|Not applicable
 
 
|-
 
|-
|'''42'''
+
|'''11'''
|'''Identification Method(s) Participant B'''
+
|'''polypeptide_symbol'''
|
+
|Current symbol of the polypeptide.
|
 
|Not applicable
 
 
|-
 
|-
 
|}
 
|}
  
 +
Notes:
 +
 +
* Each row contains information about a single transcript annotation, and if applicable, its associated polypeptide annotation.
 +
* Multiple rows may exist for a given gene in the file.
 +
* The "polypeptide_ID" and "polypeptide_symbol" columns are blank for non-mRNA transcript types.
 +
* For non-melanogaster annotations derived from NCBI Gnomon, some genes may be associated with a mix of coding and non-coding transcripts.
 +
* For D. melanogaster annotations, annotation IDs have a "CG" prefix for coding genes, or a "CR" prefix for non-protein-coding genes.
 +
* For non-melanogaster annotations, the annotation ID prefix varies by organism: "GD" for D. simulans ("Dsim"), "GF" for D. ananassae ("Dana"), "GA" for D. pseudoobscura ("Dpse") and "GJ" for D. virilis ("Dvir")
 +
 +
====FBgn exons <=> Affy1 (fbgn_exons2affy1_overlaps.tsv)====
 +
The file is generated by testing for overlaps, no matter how small, of the locations of Affy1 oligos in the genome with the locations of gene exons, as defined by the '''Dmel''' gene models for the current release of FlyBase. If the location of an Affy1 oligo shows any kind of overlap with an exon of a gene, a Gene=>Affy reference is recorded in this file.
 +
 +
The extent of the overlap has no influence on the inclusion of a crossreference in this file. The overlap might be just one nucleotide, or it could be an exact match to the exon. For interpretation of the significance of a partial overlap please contact Affymetrix.
 +
 +
The file includes the following '''Dmel''' genes:
 +
* nuclear genes located to the sequence
 +
 +
it excludes:
 +
* genes not located to the sequence
 +
* mitochondrial genes
 +
 +
Notes:
 +
 +
* Each line of the file '''can contain many''' tab separated columns:
 +
 +
* '''The first column of a line''' contains the valid [[FlyBase:RefMan_F.#FlyBase_Identifier_Numbers|FlyBase identifiers]] of a gene.
 +
* '''Subsequent columns:''' Each '''Affy1 ID''' that overlaps with an exon of the gene, as described above, is listed in an additional tab separated column. Thus, this file does not contain a predefined number of columns.
 +
 +
 +
====FBgn exons <=> Affy2 (fbgn_exons2affy2_overlaps.tsv)====
 +
The file is generated from the location of Affy2 oligos exactly as [[FlyBase:FilesOverview#5.2.6|described for Affy1 oligos]] above.
  
====Functional complementation table (gene_functional_complementation_*.tsv)====
 
  
This file reports when functional complementation of Dmel genes by non-Dmel orthologs has been observed. This data is computed by FlyBase using a combination of the orthology data obtained from DIOPT and OrthoDB and the allele-level genetic interaction data curated from the literature. The file contains a list of gene Dmel - to - non-Dmel-ortholog gene pairs where a transgenic construct/mutant allele of the non-Dmel ortholog has been shown to at least partially suppress mutant phenotype(s) of an allele of the Dmel gene.
+
====Genes Sequence Ontology (SO) data (dmel_gene_sequence_ontology_annotations_fb_*.tsv.gz)====
 +
This file provides SO term annotations for ''D. melanogaster'' genes that have been mapped to the current genome assembly. It will be available beginning with the FB2021_02 release.
  
 
File format:
 
File format:
  
 
{| class= "wikitable"
 
{| class= "wikitable"
!Column number
 
 
!Column heading
 
!Column heading
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''1'''
+
|'''gene_primary_id'''
|'''Dmel gene (symbol)'''
+
|The unique FlyBase gene ID for this gene.
|Current FlyBase symbol of Dmel gene.
 
 
|-
 
|-
|'''2'''
+
|'''gene_symbol'''
|'''Dmel gene (FBgn)'''
+
|The official FlyBase symbol for this gene.
|Current FlyBase identifier (FBgn#) of Dmel gene in column 1.
 
 
|-
 
|-
|'''3'''
+
|'''so_term_name'''
|'''Functionally complementing ortholog (symbol)'''
+
|The SO term name.
|Current FlyBase symbol of a non-Dmel ortholog of the Dmel gene in column 1 where this non-Dmel gene has been show to functionally complement the Dmel gene.
 
 
|-
 
|-
|'''4'''
+
|'''so_term_id'''
|'''Functionally complementing ortholog (FBgn#)'''
+
|The SO term primary identifier.
|Current FlyBase identifier (FBgn#) of a non-Dmel ortholog of the Dmel gene in column 1 where this non-Dmel gene has been show to functionally complement the Dmel gene.
 
|-
 
|'''5'''
 
|'''Supporting_FBrf'''
 
|Current FlyBase identifier (FBrf#) of the publication that provides support for the functional complementation statement (the publication that reported the suppression of a mutant phenotype of the Dmel gene by a transgenic construct/mutant allele of the non-Dmel ortholog).
 
 
|-
 
|-
 
|}
 
|}
  
Notes:
+
====Genes map table (gene_map_table_*.tsv)====
 +
The file reports available localization information for FlyBase genes.
  
* Each row contains information from a single reference.  Thus if multiple references support the same functional complementation statement, multiple rows will exist for that statement in the file.
+
It includes:
  
 +
* nuclear genes located to the sequence
 +
* mitochondrial genes
 +
* genes not located to the sequence
  
====FBgn <=> DB Accession IDs (fbgn_NAseq_Uniprot_*.tsv)====
+
File format:
The file reports EMBL/GenBank/DDBJ nucleotide and protein accessions, UniProtKB/SwissProt/TrEMBL protein accessions, NCBI Entrez gene IDs and NCBI RefSeq transcript and protein accessions associated with FlyBase genes.
 
 
 
The file includes:
 
* nuclear genes with sequence accession numbers
 
* mitochondrial genes
 
 
 
it excludes:
 
* genes without sequence accession numbers
 
 
 
File format:
 
  
 
{| class= "wikitable"
 
{| class= "wikitable"
!Column number
 
 
!Column heading
 
!Column heading
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''1'''
 
|'''gene_symbol'''
 
|Current symbol of gene.
 
|-
 
|'''2'''
 
 
|'''organism_abbreviation'''
 
|'''organism_abbreviation'''
 
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
 
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
 
|-
 
|-
|'''3'''
+
|'''current_symbol'''
|'''primary_FBgn#'''
+
|Current FlyBase gene symbol.
 +
|-
 +
|'''primary_FBid'''
 
|Current FlyBase identifier (FBgn#) of gene.
 
|Current FlyBase identifier (FBgn#) of gene.
 
|-
 
|-
|'''4'''
+
|'''recombination_loc'''
|'''nucleotide_accession'''
+
|recombination map location.
|EMBL/GenBank/DDBJ nucleotide accession associated with the gene.
 
 
|-
 
|-
|'''5'''
+
|'''cytogenetic_loc'''
|'''na_based_protein_accession'''
+
|cytogenetic location.
|EMBL/GenBank/DDBJ protein accession associated with the gene and the nucleotide accession in the preceeding 'nucleotide_accession' column
+
|-
 +
|'''sequence_loc'''
 +
|genomic location.
 +
|-
 +
|}
 +
 
 +
====Best gene summaries (best_gene_summary*.tsv)====
 +
The single best available gene summary is reported for each D. melanogaster gene (available in the FB2022_05 release).<br/>
 +
Gene summaries are taken from the following sources, in order of decreasing rank:
 +
* FlyBase gene snapshots
 +
* UniProtKB functional descriptions
 +
* InteractiveFly summaries
 +
* Alliance of Genome Resources automated descriptions
 +
* FlyBase automatically generated summaries
 +
For other non-D. melanogaster genes, please see FlyBase's "automated_gene_summaries.tsv.gz" file.
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 
|-
 
|-
|'''6'''
+
|'''FBgn_ID'''
|'''UniprotKB/Swiss-Prot/TrEMBL_accession'''
+
|Current FlyBase identifier number for the gene.
|UniProtKB/SwissProt/TrEMBL protein accession associated with the gene.
 
 
|-
 
|-
|'''7'''
+
|'''Gene_Symbol'''
|'''EntrezGene_ID'''
+
|Current FlyBase symbol of the gene.
|NCBI Entrez ID associated with the gene.
 
 
|-
 
|-
|'''8'''
+
|'''Summary_Source'''
|'''RefSeq_transcripts'''
+
|The source of the gene summary.
|NCBI RefSeq transcript accession associated with the gene.
 
 
|-
 
|-
|'''9'''
+
|'''Summary'''
|'''RefSeq_proteins'''
+
|The gene summary text.
|NCBI RefSeq protein accession associated with the gene and the transcript accession in the preceeding 'RefSeq_transcripts' column.
 
 
|-
 
|-
 
|}
 
|}
  
Notes:
+
====Automated gene summaries (automated_gene_summaries.tsv)====
 +
The file contains the summaries found on gene report pages and the pop-ups in JBrowse and Interactions Browser in plain text.
  
* Each row contains information about a single accession associated with a gene, thus if a gene has multiple accessions associated with it, multiple rows will exist for that gene in the file.
+
It includes:
 
 
* A single row contains '''only''' information about an EMBL/GenBank/DDBJ accession '''or''' information about a UniProtKB/SwissProt/TrEMBL accession '''or''' an NCBI Entrez gene ID '''or''' an NCBI RefSeq transcript accession.
 
 
 
* For rows containing information about a EMBL/GenBank/DDBJ accession, a nucleotide accession associated with the gene is listed in column 4 ('nucleotide_accession'). If there is also a EMBL/GenBank/DDBJ protein accession associated with that gene '''and''' with the nucleotide accession in column 4, this protein accession is listed in column 5 ('na_based_protein_accession'). In this case, columns 6, 7, 8 and 9 are always empty.
 
  
* For rows containing information about a UniProtKB/SwissProt/TrEMBL protein accession, a protein accession associated with the gene is listed in column 6 ('UniprotKB/Swiss-Prot/TrEMBL_accession'). In this case, columns 4, 5, 7, 8 and 9 are always empty.
 
 
* For rows containing information about an NCBI Entrez gene, an ID associated with the gene is listed in column 7 ('EntrezGene_ID'). In this case, columns 4, 5, 6, 8 and 9 are always empty.
 
 
* For rows containing information about an NCBI RefSeq accession, a transcript accession associated with the gene is listed in column 8 ('RefSeq_transcripts'). If there is also an NCBI RefSeq protein accession associated with that gene '''and''' with the transcript accession in column 8, this protein accession is listed in column 9 ('RefSeq_proteins'). In this case, columns 4, 5, 6 and 7 are always empty.
 
 
 
 
====FBgn <=> Annotation ID (fbgn_annotation_ID_*.tsv)====
 
The file reports current and secondary FlyBase identifiers associated with FlyBase genes, including current and secondary gene identifiers (FBgn#), and current and secondary annotation identifiers (CG#).
 
 
The file includes:
 
 
* nuclear genes located to the sequence
 
* nuclear genes located to the sequence
 
* mitochondrial genes
 
* mitochondrial genes
 
it excludes:
 
 
* genes not located to the sequence
 
* genes not located to the sequence
  
Line 849: Line 1,053:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''gene_symbol'''
+
|'''-'''
|Current symbol of gene.
+
|FlyBase ID. The Valid FlyBase identifier number for the gene.
 
|-
 
|-
|'''organism_abbreviation'''
+
|'''-'''
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
+
|The gene summary as a string of plain text.
|-
 
|'''primary_FBgn#'''
 
|Current FlyBase identifier (FBgn#) of gene.
 
|-
 
|'''secondary_FBgn#(s)'''
 
|Secondary FlyBase identifier(s) (FBgn#) associated with the gene (comma separated values).
 
|-
 
|'''annotation_ID'''
 
|Current annotation identifier associated with the gene.
 
|-
 
|'''secondary_annotation_ID(s)'''
 
|Secondary annotation identifier(s) associated with the gene (comma separated values).
 
 
|-
 
|-
 
|}
 
|}
Notes:
 
  
* If a gene has multiple secondary identifiers, all the values are stored within one tab separated column and are separated by commas (for example as: FBgn0034701,FBgn0034702).  
+
====Gene Snapshots (gene_snapshots_*.tsv)====
 +
The file contains in plain text the gene snapshot information visible on gene report pages.
 +
 +
It includes only Dmel protein coding genes.
  
 +
File format:
  
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''FBgn_ID'''
 +
|Current FlyBase identifier number for the gene.
 +
|-
 +
|'''GeneSymbol'''
 +
|Current FlyBase symbol of the gene.
 +
|-
 +
|'''GeneName'''
 +
|Current FlyBase name of the gene.
 +
|-
 +
|'''datestamp'''
 +
|Date on which the information was last reviewed.
 +
|-
 +
|'''gene_snapshot_text'''
 +
|Gene snapshot information for the gene. Cases that are in progress or are deemed to have insufficient data to summarize are stated as such.
 +
|-
 +
|}
 +
 +
====Unique protein isoforms (dmel_unique_protein_isoforms_fb_*.tsv.gz)====
 +
The file reports ''D. melanogaster'' genes and their unique protein isoforms.
  
====FBgn <=> GLEANR IDs (fbgn_gleanr_*.tsv)====
 
This file reports the relationship between the symbols and gene identifiers used by FlyBase for non-melanogaster genes identified by the AAA consortium, and the GLEANR identifier assigned to the gene during the initial annotation of the genome sequence.
 
 
 
The file includes:
 
The file includes:
* non-melanogaster genes located to the sequence
+
* melanogaster genes located to the sequence
  
 
it excludes:
 
it excludes:
* ''D. melanogaster'' genes
+
* melanogaster genes not located to the sequence
* non-melanogaster genes not located to the sequence
+
* non-melanogaster genes
  
 
File format:
 
File format:
Line 890: Line 1,105:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''organism_abbreviation'''
+
|'''FBgn'''
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
+
|Current FlyBase identifier (FBgn#) of the gene.
 
|-
 
|-
|'''gene_symbol'''
+
|'''FB_gene_symbol'''
|Current FlyBase gene symbol.
+
|Current FlyBase gene symbol of the gene.
 
|-
 
|-
|'''primary_FBgn#'''
+
|'''representative_protein'''
|Current FlyBase identifier (FBgn#) of the gene.
+
|Current FlyBase protein symbol of the representative protein isoform.
 
|-
 
|-
|'''GLEANR_ID'''
+
|'''identical_protein(s)'''
|GLEANR identifier assigned by the AAA Consortium.
+
|Current FlyBase protein symbol(s) of identical protein isoforms.
 
|-
 
|-
 
|}
 
|}
  
  
 +
====Non-coding RNAs (JSON) (ncRNA_genes_fb_*.json.gz)====
 +
This file reports all ncRNAs with gene models supported by FlyBase in JSON format, as submitted to [http://rnacentral.org/ RNAcentral]. Pseudogenes are excluded. In addition to the symbols and IDs for ncRNAs, this file also includes their associated gene, genomic location, sequence, Sequence Ontology classification, etc.  The full schema for this file is available [https://github.com/RNAcentral/rnacentral-data-schema/blob/master/sections/ncrna.json here].
 +
 +
Note - from release FB2020_03 onward, this file reports only ncRNAs for D. melanogaster; earlier files include ncRNAs for D. ananassae, D. pseudoobscura pseudoobscura, D. simulans and D. virilis.
  
====FBgn <=> FBtr <=> FBpp IDs (fbgn_fbtr_fbpp_*.tsv)====
+
 
This file reports the relationship of gene identifiers used by FlyBase for sequence localized genes, and the identifiers used for the transcript and polypeptide products of these genes.
+
====Enzyme data (dmel_enzyme_data_fb_*.tsv.gz)====
 +
This file reports nomenclature and functional data (GO annotations, EC annotations, gene group membership) for ''D. melanogaster'' genes encoding enzymes, as defined by membership of the ENZYMES [https://flybase.org/reports/FBgg0001715 (FBgg0001715)] gene group. If a gene is a member of multiple enzyme gene groups, then that gene has separate entries for each group of which it is a member.
  
 
The file includes:
 
The file includes:
* genes located to the sequence
+
* melanogaster genes located to the sequence
  
 
it excludes:
 
it excludes:
* genes not located to the sequence
+
* melanogaster genes not located to the sequence
 +
* non-melanogaster genes
  
 
File format:
 
File format:
Line 921: Line 1,142:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''FlyBase_FBgn'''
+
|'''group_id'''
|Current FlyBase identifier (FBgn#) of the gene.
+
|FlyBase gene group (FBgg) ID of the relevant terminal group within the ENZYMES (FBgg0001715) hierarchy (only terminal groups contain members).
 +
|-
 +
|'''group_name'''
 +
|FlyBase gene group (FBgg) name of relevant terminal group within the ENZYMES (FBgg0001715) hierarchy (only terminal groups contain members).
 +
|-
 +
|'''group_GO_ID'''
 +
|The GO molecular function term ID on the given gene group. Multiple entries are separated with a pipe.
 
|-
 
|-
|'''FlyBase_FBtr'''
+
|'''group_GO_name'''
|Current FlyBase identifier (FBtr#) of a transcript encoded by the gene listed in the preceeding 'FlyBase_FBgn' column.
+
|The GO molecular function term name on the given gene group. Multiple entries are separated with a pipe.
 
|-
 
|-
|'''FlyBase_FBpp'''
+
|'''group_EC_number'''
|Current FlyBase identifier (FBpp#) of a polypeptide encoded by the transcript listed in the preceeding 'FlyBase_FBtr' column, where this is relevant.
+
|The EC number on the given gene group, if present. (This is computed, corresponding to the EC cross-reference on the GO molecular function term.)
 
|-
 
|-
|}
+
|'''group_EC_name'''
 +
|The EC name on the given gene group, if present. (This is computed, corresponding to the EC cross-reference on the GO molecular function term.)
 +
|-
 +
|'''gene_id'''
 +
|The current FlyBase gene ID (FBgn) of the gene.
 +
|-
 +
|'''gene_symbol'''
 +
|The current FlyBase symbol of the gene.
 +
|-
 +
|'''gene_name'''
 +
|The current FlyBase name of the gene.
 +
|-
 +
|'''gene_EC_number'''
 +
|The EC number(s) associated with the gene, if present. Multiple entries are separated with a pipe. (This is computed, corresponding to the EC cross-reference(s) on any positive GO molecular function term(s) annotated to the gene.)
 +
|-
 +
|'''gene_EC_name'''
 +
|The EC name(s) associated with the gene, if present. Multiple entries are separated with a pipe. (This is computed, corresponding to the EC cross-reference(s) on any positive GO molecular function term(s) annotated to the gene.)
 +
|-
 +
|}
  
Notes:
+
===Gene Ontology annotation files (go)===
  
* Each row contains information about a single transcript and the polypeptide it encodes (if relevant). Thus if a gene encodes multiple isoforms, multiple rows with exist for that gene in the file.
+
Files described in this section are in the "go" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/go/gene_association.fb.gz</nowiki></code></br>
  
 +
====Gene Association File - GAF (gene_association.fb.gz)====
  
 +
The file contains the [http://www.geneontology.org/ Gene Ontology] (GO) controlled vocabulary (CV) terms assigned to FlyBase genes.
  
====FBgn exons <=> Affy1 (fbgn_exons2affy1_overlaps.tsv)====
+
The file includes the following Dmel genes:
The file is generated by testing for overlaps, no matter how small, of the locations of Affy1 oligos in the genome with the locations of gene exons, as defined by the '''Dmel''' gene models for the current release of FlyBase. If the location of an Affy1 oligo shows any kind of overlap with an exon of a gene, a Gene=>Affy reference is recorded in this file.
 
  
The extent of the overlap has no influence on the inclusion of a crossreference in this file. The overlap might be just one nucleotide, or it could be an exact match to the exon. For interpretation of the significance of a partial overlap please contact Affymetrix.
 
 
The file includes the following '''Dmel''' genes:
 
 
* nuclear genes located to the sequence
 
* nuclear genes located to the sequence
 
+
* mitochondrial genes
it excludes:
 
 
* genes not located to the sequence
 
* genes not located to the sequence
* mitochondrial genes
 
  
Notes:
+
The columns of the file are described in [[FlyBase:Gene Ontology (GO) Annotation|section G.3.1.]] of the Reference manual.
  
* Each line of the file '''can contain many''' tab separated columns:
+
====Gene Product Information - GPI (gp_information.fb.gz)====
  
* '''The first column of a line''' contains the valid [[FlyBase:RefMan_F.#FlyBase_Identifier_Numbers|FlyBase identifiers]] of a gene.
+
This file contains mapping information for FlyBase D.mel protein coding genes to UniProtKB IDs as specified by the [http://geneontology.org/docs/gene-product-information-gpi-format/ GO consortium]
* '''Subsequent columns:''' Each '''Affy1 ID''' that overlaps with an exon of the gene, as described above, is listed in an additional tab separated column. Thus, this file does not contain a predefined number of columns.
 
  
 +
===Gene groups===
  
 +
Files described in this section are in the "genes" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/gene_group_data_fb_*tsv.gz</nowiki></code></br>
  
====FBgn exons <=> Affy2 (fbgn_exons2affy2_overlaps.tsv)====
+
====Gene group data (gene_group_data_fb_*.tsv)====
The file is generated from the location of Affy2 oligos exactly as [[FlyBase:FilesOverview#5.2.6|described for Affy1 oligos]] above.
+
This file reports Gene Groups in FlyBase, together with their hierarchical relationships (where relevant) and member genes. Note, that as of FB202206, this file no longer contains Pathway groups, which can be found in a separate file (pathway_group_data_fb_*.tsv)
 
 
 
 
 
 
====Genes GO data (gene_association.fb)====
 
The file contains the [http://www.geneontology.org/ Gene Ontology] (GO) controlled vocabulary (CV) terms assigned to FlyBase genes.
 
 
 
The file includes the following Dmel genes:
 
 
 
* nuclear genes located to the sequence
 
* mitochondrial genes
 
* genes not located to the sequence
 
 
 
The columns of the file are described in [[FlyBase:Gene Ontology (GO) Annotation|section G.3.1.]] of the Reference manual.
 
 
 
 
 
 
 
====Genes map table (gene_map_table_*.tsv)====
 
The file reports available localization information for FlyBase genes.
 
 
 
It includes:
 
 
 
* nuclear genes located to the sequence
 
* mitochondrial genes
 
* genes not located to the sequence
 
  
 
File format:
 
File format:
Line 992: Line 1,212:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''organism_abbreviation'''
+
|'''FB_group_id'''
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the gene.
+
|Current FlyBase identifier (FBgg##) of Gene Group.
 
|-
 
|-
|'''current_symbol'''
+
|'''FB_group_symbol'''
|Current FlyBase gene symbol.
+
|Current FlyBase symbol of Gene Group.
 +
|-
 +
|'''FB_group_name'''
 +
|Current FlyBase full name of Gene Group.
 
|-
 
|-
|'''primary_FBid'''
+
|'''Parent_FB_group_id'''
|Current FlyBase identifier (FBgn#) of gene.
+
|Current FlyBase identifier (FBgg##) of parent of given Gene Group (if relevant).
 
|-
 
|-
|'''recombination_loc'''
+
|'''Parent_FB_group_symbol'''
|recombination map location.
+
|Current FlyBase symbol of parent of given Gene Group (if relevant).
 
|-
 
|-
|'''cytogenetic_loc'''
+
|'''Group_member_FB_gene_id'''
|cytogenetic location.
+
|Current FlyBase identifier (FBgn##) of member gene (if terminal group).
 
|-
 
|-
|'''sequence_loc'''
+
|'''Group_member_FB_gene_symbol'''
|genomic location.
+
|Current FlyBase symbol of member gene (if terminal group).
 
|-
 
|-
 
|}
 
|}
  
 +
Notes:
  
 +
* Where groups are arranged into hierarchies:
 +
** the member genes are only associated with the terminal subgroups,
 +
** the immediate parent of any subgroup is identified in the ‘Parent_FB_group_id' and 'Parent_FB_group_symbol' columns.
  
====Automated gene summaries (automated_gene_summaries.tsv)====
+
* Separate lines are used for each member gene, meaning that each terminal group is listed multiple times (equal to the number of member genes).
The file contains the summaries found on gene report pages and the pop-ups in GBrowse and Interactions Browser in plain text.
 
  
It includes:
+
====Gene groups with HGNC IDs (gene_groups_HGNC_fb_*.tsv)====
 
+
This file reports all Gene Groups in FlyBase, together with the corresponding HGNC 'gene family' ID (where relevant).
* nuclear genes located to the sequence
 
* mitochondrial genes
 
* genes not located to the sequence
 
  
 
File format:
 
File format:
Line 1,029: Line 1,252:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''-'''
+
|'''FB_group_id'''
|FlyBase ID. The Valid FlyBase identifier number for the gene.
+
|Current FlyBase identifier (FBgg##) of Gene Group.
 +
|-
 +
|'''FB_group_symbol'''
 +
|Current FlyBase symbol of Gene Group.
 +
|-
 +
|'''FB_group_name'''
 +
|Current FlyBase full name of Gene Group.
 
|-
 
|-
|'''-'''
+
|'''HGNC_family_ID'''
|The gene summary as a string of plain text.
+
|HGNC ID of equivalent human 'gene family'.
 
|-
 
|-
 
|}
 
|}
  
 +
Notes:
 +
 +
* The absence of an HGNC_family_ID entry indicates there is no equivalent HGNC gene family for that FlyBase gene group.
 +
 +
* Because of different sub-group structures (etc), a single HGNC family may be associated with multiple FlyBase gene groups.
 +
 +
* Similarly, a single FlyBase gene group may be associated with multiple HGNC gene families - these are shown on separate lines.
  
 +
Pathway group data (pathway_group_data_fb_*.tsv)
  
====Gene Snapshots (gene_snapshots_*.tsv)====
+
====Pathway group data (pathway_group_data_fb_*.tsv)====
The file contains in plain text the gene snapshot information visible on gene report pages.
+
This file reports all Pathway Gene Groups in FlyBase, together with their hierarchical relationships (where relevant) and member genes.
 
It includes only Dmel protein coding genes.
 
  
 
File format:
 
File format:
Line 1,050: Line 1,285:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''FBgn_ID'''
+
|'''FB_group_id'''
|Current FlyBase identifier number for the gene.
+
|Current FlyBase identifier (FBgg##) of Pathway Gene Group.
 +
|-
 +
|'''FB_group_symbol'''
 +
|Current FlyBase symbol of Pathway Gene Group.
 +
|-
 +
|'''FB_group_name'''
 +
|Current FlyBase full name of Pathway Gene Group.
 
|-
 
|-
|'''GeneSymbol'''
+
|'''Parent_FB_group_id'''
|Current FlyBase symbol of the gene.
+
|Current FlyBase identifier (FBgg##) of parent of given Pathway Gene Group (if relevant).
 
|-
 
|-
|'''GeneName'''
+
|'''Parent_FB_group_symbol'''
|Current FlyBase name of the gene.
+
|Current FlyBase symbol of parent of given Pathway Gene Group (if relevant).
 
|-
 
|-
|'''datestamp'''
+
|'''Group_member_FB_gene_id'''
|Date on which the information was last reviewed.
+
|Current FlyBase identifier (FBgn##) of member gene (if terminal group).
 
|-
 
|-
|'''gene_snapshot_text'''
+
|'''Group_member_FB_gene_symbol'''
|Gene snapshot information for the gene. Cases that are in progress or are deemed to have insufficient data to summarize are stated as such.
+
|Current FlyBase symbol of member gene (if terminal group).
 
|-
 
|-
 
|}
 
|}
  
 +
Notes:
  
 +
* Where pathway groups are arranged into hierarchies:
 +
** the member genes are only associated with the terminal pathway subgroups,
 +
** the immediate parent of any subgroup is identified in the ‘Parent_FB_group_id' and 'Parent_FB_group_symbol' columns.
  
====Unique protein isoforms (dmel_unique_protein_isoforms_fb_*.tsv.gz)====
+
* Separate lines are used for each member gene, meaning that each terminal group is listed multiple times (equal to the number of member genes).
The file reports ''D. melanogaster'' genes and their unique protein isoforms.
 
  
The file includes:
+
===Alleles and Stocks===
* melanogaster genes located to the sequence
 
  
it excludes:
+
Files described in this section are in the "alleles" or "stocks" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
* melanogaster genes not located to the sequence
+
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/alleles/allele_genetic_interactions_*tsv.gz</nowiki></code></br>
* non-melanogaster genes
+
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/stocks/stocks_*.tsv.gz</nowiki></code></br>
  
File format:
+
====Allele data (Chado XML)====
 +
The chado XML file generated from the FlyBase PostgreSQL database for the 'alleles' data class.
  
{| class= "wikitable"
 
!Column heading
 
!Content Description
 
|-
 
|'''FBgn'''
 
|Current FlyBase identifier (FBgn#) of the gene.
 
|-
 
|'''FB_gene_symbol'''
 
|Current FlyBase gene symbol of the gene.
 
|-
 
|'''representative_protein'''
 
|Current FlyBase protein symbol of the representative protein isoform.
 
|-
 
|'''identical_protein(s)'''
 
|Current FlyBase protein symbol(s) of identical protein isoforms.
 
|-
 
|}
 
  
 +
====Stock data (Chado XML)====
 +
The chado XML file generated from the FlyBase PostgreSQL database for the 'stocks' data class.
  
  
====Non-coding RNA genes (TSV) (ncRNA_genes_fb_*.tsv.gz)====
+
====Stock data (stocks_*.tsv.gz)====
This file reports all genes encoding ncRNAs for D. melanogaster and 11 other sequenced Drosophila species in TSV format. Pseudogenes are excluded.
+
This file reports genetic components and related information about Stocks in FlyBase.
 
 
Columns are:
 
  
 
File format:
 
File format:
Line 1,111: Line 1,338:
 
!Column heading
 
!Column heading
 
!Content Description
 
!Content Description
 +
!Example
 
|-
 
|-
|'''accession_id'''
+
|'''FBst'''
|INSDC accession ID.
+
|The unique identifier assigned to this stock by FlyBase.
|-
+
|FBst0000002
|'''FB_gene_ID'''
 
|Current FlyBase gene identifier (FBgn#).
 
 
|-
 
|-
|'''species_FB_annotation_ID(locus_tag)'''
+
|'''collection_short_name'''
|Current FlyBase annotation ID, in the form "<species abbreviation>_<annotation_ID>", which equates to the 'locus tag' field in INSDC records.
+
|A short name for the stock collection that holds the stock.
 +
|Bloomington
 +
|-
 +
|'''stock_type_cv'''
 +
|The controlled vocabulary term and unique identifier that describe the state of the stock.
 +
|living stock ; FBsv:0000002
 +
|-
 +
|'''species'''
 +
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of the stock.
 +
|Dmel
 +
|-
 +
|'''FB_genotype'''
 +
|Genetic components of the stock corresponding to alleles, aberrations, balancers, or insertions in FlyBase. May be empty.
 +
|w[*]; betaTub60D[2] Kr[If-1]/CyO
 +
|-
 +
|'''description'''
 +
|Genetic components of the stock as provided to FlyBase by the collection that holds the stock.
 +
|FlyTrap: ZCL1796 III
 +
|-
 +
|'''stock_number'''
 +
|The stock identifier provided to FlyBase by the collection that holds the stock. May be empty.
 +
|110818
 
|-
 
|-
 
|}
 
|}
  
  
 +
====Genetic interactions (allele_genetic_interactions_*.tsv)====
 +
The file reports controlled vocabulary (i.e. not free text) genetic interaction data associated with alleles. This is the data reported in the "Phenotypic Class" and "Phenotype Manifest in" subsections of the "Interactions" section of each Allele Report.
  
====Non-coding RNAs (JSON) (ncRNA_genes_fb_*.json.gz)====
+
File format:
This file reports all ncRNAs for D. melanogaster and 11 other sequenced Drosophila species in JSON format, as submitted to [http://rnacentral.org/ RNAcentral]. Pseudogenes are excluded.  In addition to the symbols and IDs for ncRNAs, this file also includes their associated gene, genomic location, sequence, Sequence Ontology classification, etc.  The full schema for this file is available [https://github.com/RNAcentral/rnacentral-data-schema/blob/master/sections/ncrna.json here].
 
 
 
===Gene groups===
 
====Gene group data (gene_group_data_*.tsv)====
 
This file reports all Gene Groups in FlyBase, together with their hierarchical relationships (where relevant) and member genes.
 
 
 
File format:
 
  
 
{| class= "wikitable"
 
{| class= "wikitable"
Line 1,138: Line 1,380:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''FB_group_id'''
+
|'''allele_symbol'''
|Current FlyBase identifier (FBgg##) of Gene Group.
+
|Current FlyBase allele symbol.
 
|-
 
|-
|'''FB_group_symbol'''
+
|'''allele_FBal#'''
|Current FlyBase symbol of Gene Group.
+
|Current FlyBase identifier (FBal#) of allele.
 
|-
 
|-
|'''FB_group_name'''
+
|'''interaction'''
|Current FlyBase full name of Gene Group.
+
|Interaction information associated with allele.
 
|-
 
|-
|'''Parent_FB_group_id'''
+
|'''FBrf#'''
|Current FlyBase identifier (FBgg##) of parent of given Gene Group (if relevant).
+
|Current FlyBase identifer (FBrf#) of publication from which data came.
|-
 
|'''Parent_FB_group_symbol'''
 
|Current FlyBase symbol of parent of given Gene Group (if relevant).
 
|-
 
|'''Group_member_FB_gene_id'''
 
|Current FlyBase identifier (FBgn##) of member gene (if terminal group).
 
|-
 
|'''Group_member_FB_gene_symbol'''
 
|Current FlyBase symbol of member gene (if terminal group).
 
 
|-
 
|-
 
|}
 
|}
Line 1,163: Line 1,396:
 
Notes:
 
Notes:
  
* Where groups are arranged into hierarchies:
+
* Each row contains information about a single interaction from a single reference.  Thus if multiple genetic interactions have been reported for a given allele, or if multiple references report the same interaction for a given allele, multiple rows will exist for that allele in the file.
** the member genes are only associated with the terminal subgroups,
 
** the immediate parent of any subgroup is identified in the ‘Parent_FB_group_id' and 'Parent_FB_group_symbol' columns.
 
 
 
* Separate lines are used for each member gene, meaning that each terminal group is listed multiple times (equal to the number of member genes).
 
  
  
 +
====Phenotypic data (genotype_phenotype_data_*.tsv)====
  
====Gene groups with HGNC IDs (gene_groups_HGNC_*.tsv)====
+
The file reports controlled vocabulary (i.e. not free text) phenotypic data associated with genotypes. This is the data reported in the [[FlyBase:Allele Report#Phenotypic_Class|Phenotypic Class]] and [[FlyBase:Allele Report#Phenotype_Manifest_In|Phenotype Manifest in]] subsections of the [[FlyBase:Allele Report##Phenotypic Data|Phenotypic Data]] section of each Allele Report.
This file reports all Gene Groups in FlyBase, together with the corresponding HGNC 'gene family' ID (where relevant).
 
  
 
File format:
 
File format:
Line 1,180: Line 1,409:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''FB_group_id'''
+
|'''genotype_symbols'''
|Current FlyBase identifier (FBgg##) of Gene Group.
+
|Current FlyBase symbol(s) of the components that make up the genotype.
 +
|-
 +
|'''genotype_FBids'''
 +
|Current FlyBase identifier(s) of the components that make up the genotype.
 +
|-
 +
|'''phenotype_name'''
 +
|Phenotypic name associated with the genotype.
 +
|-
 +
|'''phenotype_id'''
 +
|Phenotypic identifier associated with the genotype.
 
|-
 
|-
|'''FB_group_symbol'''
+
|'''qualifier_names'''
|Current FlyBase symbol of Gene Group.
+
|Qualifier name(s) associated with phenotypic data for genotype.
 
|-
 
|-
|'''FB_group_name'''
+
|'''qualifier_ids'''
|Current FlyBase full name of Gene Group.
+
|Qualifier identifier(s) associated with phenotypic data for genotype.
 
|-
 
|-
|'''HGNC_family_ID'''
+
|'''reference'''
|HGNC ID of equivalent human 'gene family'.
+
|Current FlyBase identifer (FBrf#) of publication from which data came.
 
|-
 
|-
 
|}
 
|}
  
Notes:
+
Notes:  
  
* The absence of an HGNC_family_ID entry indicates there is no equivalent HGNC gene family for that FlyBase gene group.
+
* Each row contains information about a single phenotype from a single reference.  Thus if multiple phenotypes have been reported for a given genotype, or if multiple references report the same phenotype for a given genotype, multiple rows will exist for that genotype in the file.
  
* Because of different sub-group structures (etc), a single HGNC family may be associated with multiple FlyBase gene groups.
+
* For cases where the genotype contains more than one component, then the components are separated as follows (columns 1 and 2):
  
* Similarly, a single FlyBase gene group may be associated with multiple HGNC gene families - these are shown on separate lines.
+
  * Homozygous or transheterozygous combinations of classical/insertional alleles at a single locus are separated by a '/'.
  
 +
  * Hemizygous combinations affecting a single locus (classical/insertional allele over a deficiency for that locus) are separated by a '/'.
  
 +
  * Heterozygosity for a classical/insertional allele or aberration is represented by '/+'.
  
===Alleles and Stocks===
+
  * In all other cases, other genotype components (e.g. drivers, transgenic alleles) are separated by a space.
====Allele data (Chado XML)====
 
  
====Stock data (Chado XML)====
+
* Where multiple qualifiers are used to add information to a phenotypic data, then these are separated by a pipe '|' (columns 5 and 6).
  
====Stock data (stocks_*.tsv.gz)====
+
* Where multiple entries/column can exist, the order and separation of the symbols and of the ids are preserved in the column pairs i.e. for genotype, columns 1 and 2 and qualifiers in columns 5 and 6.
This file reports genetic components and related information about Stocks in FlyBase.
+
 
 +
 
 +
*Note: this file replaces 'allele_phenotypic_data_*.tsv' from FB2023_01 onward.
 +
 
 +
====Alleles <=> Genes (fbal_to_fbgn_fb_*.tsv)====
 +
This file reports the relationship between gene identifiers and the identifiers used for alleles of these genes.
  
 
File format:
 
File format:
Line 1,217: Line 1,461:
 
!Column heading
 
!Column heading
 
!Content Description
 
!Content Description
!Example
 
 
|-
 
|-
|'''FBst'''
+
|'''AlleleID'''
|The unique identifier assigned to this stock by FlyBase.
+
|Current FlyBase identifier (FBal#) of the allele.
|FBst0000002
 
 
|-
 
|-
|'''collection_short_name'''
+
|'''AlleleSymbol'''
|A short name for the stock collection that holds the stock.
+
|Current symbol of the allele.
|Bloomington
 
 
|-
 
|-
|'''stock_type_cv'''
+
|'''GeneID'''
|The controlled vocabulary term and unique identifier that describe the state of the stock.
+
|Current FlyBase identifier (FBgn#) of the gene.
|living stock ; FBsv:0000002
 
 
|-
 
|-
|'''species'''
+
|'''GeneSymbol'''
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of the stock.
+
|Current symbol of the gene.
|Dmel
 
 
|-
 
|-
|'''FB_genotype'''
+
|}
|Genetic components of the stock corresponding to alleles, aberrations, balancers, or insertions in FlyBase. May be empty.
+
 
|w[*]; betaTub60D[2] Kr[If-1]/CyO
+
===Homologs===
|-
 
|'''description'''
 
|Genetic components of the stock as provided to FlyBase by the collection that holds the stock.
 
|FlyTrap: ZCL1796 III
 
|-
 
|'''stock_number'''
 
|The stock identifier provided to FlyBase by the collection that holds the stock. May be empty.
 
|110818
 
|-
 
|}
 
  
 +
Files described in this section are in the "orthologs" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/orthologs/dmel_paralogs_fb_*.tsv.gz</nowiki></code></br>
  
 
+
====Drosophila Paralogs (dmel_paralogs_fb_*.tsv.gz)====
====Genetic interactions (allele_genetic_interactions_*.tsv)====
+
The file reports ''D. melanogaster'' genes and their paralogs, as provided by DIOPT. (The version of DIOPT currently being used is shown in the 'Paralogs' -> 'Paralogs (via DIOPT)' section of a Gene Report.)
The file reports controlled vocabulary (i.e. not free text) genetic interaction data associated with alleles. This is the data reported in the "Phenotypic Class" and "Phenotype Manifest in" subsections of the "Interactions" section of each Allele Report.
 
  
 
File format:
 
File format:
Line 1,260: Line 1,490:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''allele_symbol'''
+
|'''FBgn_ID'''
|Current FlyBase allele symbol.
+
|Current FlyBase identifier (FBgn#) of the ''D. melanogaster'' gene.
 
|-
 
|-
|'''allele_FBal#'''
+
|'''GeneSymbol'''
|Current FlyBase identifier (FBal#) of allele.
+
|Current FlyBase gene symbol of the ''D. melanogaster'' gene.
 
|-
 
|-
|'''interaction'''
+
|'''Arm/Scaffold'''
|Interaction information associated with allele.
+
|Arm upon which the ''D. melanogaster'' gene is localized.
 
|-
 
|-
|'''FBrf#'''
+
|'''Location'''
|Current FlyBase identifer (FBrf#) of publication from which data came.
+
|Location of ''D. melanogaster'' gene on the arm.
 +
|-
 +
|'''Strand'''
 +
|Strand of ''D. melanogaster'' gene ('1' indicates the positive strand, '-1' indicates the negative strand).
 +
|-
 +
|'''Paralog_FBgn_ID'''
 +
|Current FlyBase identifier (FBgn#) of the paralogous gene.
 +
|-
 +
|'''Paralog_GeneSymbol'''
 +
|Current FlyBase gene symbol of the paralogous gene.
 +
|-
 +
|'''Paralog_Arm/Scaffold'''
 +
|Arm upon which the paralogous gene is localized.
 +
|-
 +
|'''Paralog_Location'''
 +
|Location of paralogous gene on the arm.
 +
|-
 +
|'''Paralog_Strand'''
 +
|Strand of paralogous gene ('1' indicates the positive strand, '-1' indicates the negative strand).
 +
|-
 +
|'''DIOPT_score'''
 +
|DIOPT 'score' for the paralog call (i.e. the number of individual algorithms that support the call).
 
|-
 
|-
 
|}
 
|}
Line 1,276: Line 1,527:
 
Notes:
 
Notes:
  
* Each row contains information about a single interaction from a single reference. Thus if multiple genetic interactions have been reported for a given allele, or if multiple references report the same interaction for a given allele, multiple rows will exist for that allele in the file.
+
* Each row is a pair-wise association between a given ''D. melanogaster'' and a paralog. Thus, two rows exist for each paralogous pair in the file.
  
 
+
====Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)====
 
+
This file reports the human orthologs of ''D. melanogaster'' genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines. Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.
====Phenotypic data (allele_phenotypic_data_*.tsv)====
 
The file reports controlled vocabulary (i.e. not free text) phenotypic data associated with alleles. This is the data reported in the "Phenotypic Class" and "Phenotype Manifest in" subsections of the "Phenotypic Data" section of each Allele Report.
 
  
 
File format:
 
File format:
Line 1,289: Line 1,538:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''allele_symbol'''
+
|'''Dmel_gene_ID'''
|Current FlyBase allele symbol.
+
|Current FlyBase identifier (FBgn#) of the ''D. melanogaster'' gene.
 
|-
 
|-
|'''allele_FBal#'''
+
|'''Dmel_gene_symbol'''
|Current FlyBase identifier (FBal#) of allele.
+
|Current FlyBase gene symbol of the ''D. melanogaster'' gene.
 
|-
 
|-
|'''phenotype'''
+
|'''Human_gene_HGNC_ID'''
|Phenotypic data associated with allele.
+
|HGNC ID of orthologous human gene.
 
|-
 
|-
|'''FBrf#'''
+
|'''Human_gene_OMIM_ID'''
|Current FlyBase identifer (FBrf#) of publication from which data came.
+
|OMIM ID of orthologous human gene.
 +
|-
 +
|'''Human_gene_symbol'''
 +
|HGNC gene symbol of orthologous human gene.
 +
|-
 +
|'''DIOPT_score'''
 +
|DIOPT 'score' for orthology call (i.e. the number of individual algorithms that support the call).
 +
|-
 +
|'''OMIM_Phenotype_IDs'''
 +
|OMIM Phenotype ID of orthologous human gene (comma separated values).
 +
|-
 +
|'''OMIM_Phenotype_IDs[name]'''
 +
|OMIM Phenotype ID of orthologous human gene (with the corresponding OMIM name in square brackets). Multiple phenotype[name] entries are separated by a comma.
 
|-
 
|-
 
|}
 
|}
  
Notes:
+
===Human disease===
 
 
* Each row contains information about a single phenotype from a single reference.  Thus if multiple phenotypes have been reported for a given allele, or if multiple references report the same phenotype for a given allele, multiple rows will exist for that allele in the file.
 
  
 +
Files described in this section are in the "human_disease" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/human_disease/disease_model_annotations_fb_*.tsv.gz</nowiki></code></br>
  
 
+
====Human disease model data (disease_model_annotations_fb_*.tsv.gz)====
====Alleles <=> Genes (fbal_to_fbgn_fb_*.tsv)====
+
This file reports (i) all experimental-based disease model annotations, associated with alleles; and (ii) all 'potential' disease models based on orthology to human disease genes in OMIM (see [http://flybase.org/reports/FBrf0241599 FBrf0241599] for more information on this pipeline) for ''D. melanogaster''. 'Alleles' encompass both classical alleles and transgenic alleles; the latter may relate to transgenic constructs of ''D. melanogaster'' genes or non-''D. melanogaster'' genes (often human genes) inserted into the ''D. melanogaster'' genome.  These disease model annotations are reported in the "Human Disease Model Data" -> "Disease Ontology (DO) Annotations" section of the Gene and Allele Reports.
This file reports the relationship between gene identifiers and the identifiers used for alleles of these genes.
 
  
 
File format:
 
File format:
Line 1,318: Line 1,578:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''AlleleID'''
+
|'''FBgn ID'''
|Current FlyBase identifier (FBal#) of the allele.
+
|Current FlyBase identifier (FBgn#) of the gene associated with the allele of an experimental annotation, or the D. melanogaster ortholog of a human gene associated with a disease in OMIM.
 +
|-
 +
|'''Gene symbol'''
 +
|Current FlyBase symbol of the gene in column 1.
 
|-
 
|-
|'''AlleleSymbol'''
+
|'''HGNC ID'''
|Current symbol of the allele.
+
|HGNC ID of the gene identified in column 1 where it is a human gene (experimental-based annotations only).
 
|-
 
|-
|'''GeneID'''
+
|'''DO qualifier'''
|Current FlyBase identifier (FBgn#) of the gene.
+
|Type of association between the object of annotation and the disease - one of 'model of', 'ameliorates', 'exacerbates', 'DOES NOT model', 'DOES NOT ameliorate' or 'DOES NOT exacerbate'.
 
|-
 
|-
|'''GeneSymbol'''
+
|'''DO ID'''
|Current symbol of the gene.
+
|Disease Ontology (DO) ID.
 +
|-
 +
|'''DO term'''
 +
|Disease Ontology (DO) term.
 +
|-
 +
|'''Allele used in model (FBal ID)'''
 +
|Current FlyBase identifier (FBal#) of allele (experimental-based annotations only).
 +
|-
 +
|'''Allele used in model (symbol)'''
 +
|Current FlyBase symbol of allele (experimental-based annotations only).
 +
|-
 +
|'''Based on orthology with (HGNC ID)'''
 +
|HGNC ID of the human ortholog used for annotations based on orthology to human disease genes.
 +
|-
 +
|'''Based on orthology with (symbol)'''
 +
|HGNC gene symbol of the human ortholog used for annotations based on orthology to human disease genes.
 +
|-
 +
|'''Evidence/interacting alleles'''
 +
|Evidence code, with interacting allele(s) where appropriate. For experimental-based annotations, the evidence code is one of: 'inferred from mutant phenotype', 'in combination with', 'modeled by', 'is ameliorated by', 'is exacerbated by', 'is NOT ameliorated by' or 'is NOT exacerbated by'.  Interacting alleles are give as 'FLYBASE:<allele_symbol>; FB:<FBal_ID>', with multiple alleles separated by a comma.  For orthology-based annotations, the evidence code is 'inferred from electronic annotation'.
 +
|-
 +
|'''Reference (FBrf ID)'''
 +
|Current FlyBase identifier (FBrf#) of the source publication.
 
|-
 
|-
 
|}
 
|}
  
 +
====Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)====
  
 +
This file reports the human orthologs of ''D. melanogaster'' genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines.  Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.
 +
 +
This is identical to the file of the same name listed under the 'Orthologs' section above.
 +
 +
===Organisms===
  
===Orthologs===
+
Files described in this section are in the "species" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/species/organism_list_fb*.tsv.gz</nowiki></code></br>
  
====Drosophila Orthologs (dmel_orthologs_in_drosophila_species_fb_*.tsv.gz)====
+
====Species list (organism_list_*.tsv.gz)====
The file reports ''D. melanogaster'' genes and their orthologs in other sequenced Drosophila genomes, as determined by OrthoDB.  (The version of OrthoDB currently being used is shown in the 'Orthologs' -> 'Orthologs (via OrthoDB)' section of a Gene Report.)
 
  
The file includes:
+
This file lists all the species for which FlyBase has some information.
* nuclear genes located to the sequence
+
 
 +
FlyBase includes gene reports for genes derived from species within the family Drosophilidae, as well as gene reports for non-drosophilid genes that have been introduced into a Drosophila genome via either transposable-element based transgenic constructs or via targeted insertion of DNA by a technique such as homologous recombination or CRISPR/Cas9. In this case, there will be a species 'Abbreviation' in the table, a standard prefix that is used in FlyBase as the first part of the symbol (before the '\') of any object, e.g. a gene or allele, that originates from this species.
 +
 
 +
In addition, information about non-Drosophilid species is also included in orthology data that is diplayed on gene reports and on G/JBrowse. In this case, a species 'Abbreviation' is not automatically generated in the database for the species, and thus the column in the table may be blank.
 +
 
 +
The file thus includes information for both Drosophilid and non-Drosophilid species.
  
it excludes:
 
* genes not located to the sequence
 
* mitochondrial genes
 
  
File format:
+
File format:  
  
 
{| class= "wikitable"
 
{| class= "wikitable"
Line 1,352: Line 1,644:
 
!Content Description
 
!Content Description
 
|-
 
|-
|'''FBgn_ID'''
+
|'''Genus'''
|Current FlyBase identifier (FBgn#) of the ''D. melanogaster'' gene.
+
|The genus designation of the organism.
 
|-
 
|-
|'''GeneSymbol'''
+
|'''Species name'''
|Current FlyBase gene symbol of the ''D. melanogaster'' gene.
+
|The species designation of the organism.
 
|-
 
|-
|'''Arm/Scaffold'''
+
|'''Abbreviation'''
|Arm upon which the ''D. melanogaster'' gene is localized.
+
|The standard FlyBase prefix for the species. This abbreviation is used in FlyBase as the first part of the symbol (before the '\') of any object, e.g. a gene or allele, that originates from this species. This column may be blank, if no individual report page exists for that species in FlyBase.
 
|-
 
|-
|'''Location'''
+
|'''Common name'''
|Location of ''D. melanogaster'' gene on the arm.
+
|The [https://www.ncbi.nlm.nih.gov/taxonomy/ NCBI Taxonomy Database] common name of the organism. This column may be blank.
 
|-
 
|-
|'''Strand'''
+
|'''Ncbi-taxon-id'''
|Strand of ''D. melanogaster'' gene ('1' indicates the positive strand, '-1' indicates the negative strand).
+
|The [https://www.ncbi.nlm.nih.gov/taxonomy/ NCBI Taxonomy Database] Taxon ID for the organism. This column may be blank.
|-
 
|'''Ortholog_FBgn_ID'''
 
|Current FlyBase identifier (FBgn#) of the non-melanogaster orthologous gene.
 
 
|-
 
|-
|'''Ortholog_GeneSymbol'''
+
|'''drosophilid'''
|Current FlyBase gene symbol of the non-melanogaster orthologous gene.
+
|If the species is from the family Drosophilidae, this column is filled in with 'y'.
|-
 
|'''Ortholog_Arm/Scaffold'''
 
|Arm upon which the non-melanogaster orthologous gene is localized.
 
|-
 
|'''Ortholog_Location'''
 
|Location of non-melanogaster orthologous gene on the arm.
 
|-
 
|'''Ortholog_Strand'''
 
|Strand of non-melanogaster orthologous gene ('1' indicates the positive strand, '-1' indicates the negative strand).
 
|-
 
|'''OrthoDB_Group_ID'''
 
|OrthoDB orthology group ID to which the pair-wise association belongs.
 
 
|-
 
|-
 
|}
 
|}
  
Notes:
+
===Ontology Terms===
  
* Each row is a pair-wise association beween a ''D. melanogaster'' gene and a non-melanogaster ortholog. Thus, multiple rows exist for each ''D. melanogaster'' gene in the file.
+
The [http://{{flybaseorg}}/static_pages/docs/refman/refman-G.html#G.2. ontology files] used by FlyBase are in the [http://www.geneontology.org/GO.format.shtml#oboflat OBO format] used by the [http://www.obofoundry.org/ Open Biomedical Ontology] group, and may be viewed using the free [http://www.oboedit.org/ OBO-Edit] tool.
  
 +
Ontologies undergo continual development. Links are provided to the 'frozen versions' used for the current release of FlyBase, together with links to the current 'live' versions at external sites.
  
 +
====Frozen files used for this release of FlyBase====
  
 +
List of ontologies available for download:
 +
 +
* FBbt: fly_anatomy
 +
* FBdv: fly_development
 +
* FBcv: flybase controlled vocabulary
 +
* FBsv: stock ontology
 +
* GO: gene ontology
 +
* FBbi: image ontology
 +
* SO: sequence ontology
 +
* DO: human disease ontology
  
====Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)====
 
This file reports the human orthologs of ''D. melanogaster'' genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines.  Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.
 
  
File format:
+
====Current 'Live' Files====
  
{| class= "wikitable"
+
List of ontologies available for download:
!Column heading
 
!Content Description
 
|-
 
|'''Dmel_gene_ID'''
 
|Current FlyBase identifier (FBgn#) of the ''D. melanogaster'' gene.
 
|-
 
|'''Dmel_gene_symbol'''
 
|Current FlyBase gene symbol of the ''D. melanogaster'' gene.
 
|-
 
|'''Human_gene_HGNC_ID'''
 
|HGNC ID of orthologous human gene.
 
|-
 
|'''Human_gene_OMIM_ID'''
 
|OMIM ID of orthologous human gene.
 
|-
 
|'''Human_gene_symbol'''
 
|HGNC gene symbol of orthologous human gene.
 
|-
 
|'''DIOPT_score'''
 
|DIOPT 'score' for orthology call (i.e. the number of individual algorithms that support the call).
 
|-
 
|'''OMIM_Phenotype_IDs'''
 
|OMIM Phenotype ID of orthologous human gene (comma separated values).
 
|-
 
|'''OMIM_Phenotype_IDs[name]'''
 
|OMIM Phenotype ID of orthologous human gene (with the corresponding OMIM name in square brackets). Multiple phenotype[name] entries are separated by a comma.
 
|-
 
|}
 
  
 +
* FBbt: fly_anatomy
 +
''Note: link points to the ontology version fbbt-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'fly_anatomy.obo' version''
 +
* FBdv: fly_development
 +
''Note: link points to the ontology version fbbt-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'fly_development.obo' version''
 +
* FBcv: flybase controlled vocabulary
 +
''Note: link points to the ontology version fbcv-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'flybase_controlled_vocabulary.obo' version''
 +
* FBsv: stock ontology
 +
* GO: gene ontology
 +
* FBbi: image ontology
 +
* SO: sequence ontology
 +
* DO: human disease ontology
  
  
===Human disease===
+
===Genomes: Annotation and Sequence===
 +
====All Sequenced Drosophila Species====
 +
 
 +
Links are available to the following FTP repositories:
 +
 
 +
* Current FTP repository
 +
* Current FastA repository
 +
* Current GFF repository
 +
 
 +
* FTP archive (previous releases)
 +
 
 +
* Current list of individual FASTA files
 +
* Current list of individual GFF files
 +
 
 +
 
 +
====Individual Sequenced Drosophila Species====
  
====Human disease model data (allele_human_disease_model_data_fb_*.tsv.gz)====
+
From release FB2020_03 onward, the above links are available for downloading only D. melanogaster data.
This file reports all experimental-based disease model annotations, associated with alleles, that have been curated for ''D. melanogaster''. 'Alleles' encompasses both classical alleles and transgenic alleles; the latter may relate to transgenic constructs of ''D. melanogaster'' genes or non-''D. melanogaster'' genes, often human genes.  These are the data reported in the "Human Disease Model Data" -> "Disease Ontology" section of the Allele Report, which are repeated in the "Human Disease Model Data" -> "Alleles Reported to Model Human Disease (Disease Ontology)" section of the Gene Report.
 
  
File format:
+
For releases FB2018_06 to FB2020_02, the above links are available for the following sequenced Drosophila species:
  
 
{| class= "wikitable"
 
{| class= "wikitable"
!Column heading
+
!Species name
!Content Description
+
!Abbreviation
 
|-
 
|-
|'''FBal_ID'''
+
|Drosophila melanogaster
|Current FlyBase identifier (FBal#) of allele.
+
|Dmel
 
|-
 
|-
|'''AlleleSymbol'''
+
|Drosophila ananassae
|Current FlyBase symbol of allele.
+
|Dana
 
|-
 
|-
|'''DOID_qualifier'''
+
|Drosophila pseudoobscura pseudoobscura
|Annotation qualifier - one of 'model of', 'ameliorates', 'exacerbates', 'DOES NOT model', 'DOES NOT ameliorate' or 'DOES NOT exacerbate'.
+
|Dpse
 
|-
 
|-
|'''DOID_term'''
+
|Drosophila simulans
|Disease Ontology term.
+
|Dsim
 
|-
 
|-
|'''DOID_ID'''
+
|Drosophila virilis
|Disease Ontology ID.
+
|Dvir
|-
 
|'''Evidence/interacting_alleles'''
 
|Evidence code, with interacting allele(s) where appropriate. Evidence code is one of: 'inferred from mutant phenotype', 'in combination with', 'modeled by', 'is ameliorated by', 'is exacerbated by', 'is NOT ameliorated by' or 'is NOT exacerbated by'.  Interacting alleles are give as 'FLYBASE:<allele_symbol>; FB:<FBal_ID>', with multiple alleles separated by a comma.
 
|-
 
|'''Reference_FBid'''
 
|Current FlyBase identifier (FBrf#) of the publication from which the data came.
 
 
|-
 
|-
 
|}
 
|}
  
  
 
+
For earlier archived releases, the above links are also available for these additional species (other members of the original 12 sequenced Drosophila species):
====Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)====
 
 
 
This file reports the human orthologs of ''D. melanogaster'' genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines.  Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.
 
 
 
This is identical to the file of the same name listed under the 'Orthologs' section above.
 
 
 
 
 
 
 
===Nomenclature===
 
 
 
====Species abbreviation list (species-ab.gz)====
 
 
 
The species-abbreviations.txt file lists all the species for which FlyBase has some information. FlyBase includes gene reports for genes derived from species within the family Drosophilidae, as well as gene reports for non-drosophilid genes ("foreign genes") that have been introduced into Drosophila via transgenic constructs and for engineered objects such as a fusion gene between two ''D.melanogaster'' genes. In addition, information about non-Drosophilid species is also displayed in GBrowse, for example in the "Similarity: Proteins" evidence tier. Thus, the file contains information for both Drosophilid and non-Drosophilid species.
 
 
 
There are 8 columns of data in the file, each separated by " | ".
 
  
 
{| class= "wikitable"
 
{| class= "wikitable"
!Column heading
+
!Species name
!Content Description
+
!Abbreviation
 
|-
 
|-
|'''Internal_id'''
+
|Drosophila erecta
|The Primary  [[FlyBase:RefMan_F.#FlyBase_Identifier_Numbers|FlyBase identifier]] of the organism.
+
|Dere
 
|-
 
|-
|'''Taxgroup'''
+
|Drosophila grimshawi
|A grouping term, currently one of "drosophilid", "non-drosophilid eukaryote", "prokaryote", "transposable element" or "virus".
+
|Dgri
 
|-
 
|-
|'''Abbreviation'''
+
|Drosophila mojavensis
|The standard FlyBase prefix for the species. This abbreviation is used in FlyBase as the first part of the symbol (before the '\') of any object, e.g. a gene or allele, that originates from this species. This column may be blank, if data from a species is displayed in an evidence tier on [http://{{flybaseorg}}/cgi-bin/gbrowse/dmel/ GBrowse] but no individual report page exists for that species in FlyBase.
+
|Dmoj
 
|-
 
|-
|'''Genus'''
+
|Drosophila persimilis
|The genus name of the organism.
+
|Dper
 
|-
 
|-
|'''Species name'''
+
|Drosophila sechellia
|The species name of the organism.
+
|Dsec
 
|-
 
|-
|'''Common name'''
+
|Drosophila willistoni
|The common name of the organism. This column may be blank.
+
|Dwil
 
|-
 
|-
|'''Comment'''
+
|Drosophila yakuba
|A free text field for additional comments. This column may be blank.
+
|Dyak
|-
 
|'''Ncbi-taxon-id'''
 
|The [http://www.ncbi.nlm.nih.gov/taxonomy NCBI Taxonomy Database] Taxon ID for the organism. This column may be blank.
 
 
|-
 
|-
 
|}
 
|}
  
An html version of this file is also available - see the [[FlyBase:Abbreviations|Species Abbreviations]] page.
+
====FASTA files====
  
 +
The FlyBase FASTA files generally follow the [http://en.wikipedia.org/wiki/Fasta_format FASTA format] guidelines with one exception being that our header lines sometime exceed the 80 character limit. The FASTA filenames follow these formats:
  
 +
'''dmel-all-<data type>-r<release-number>.fasta.gz'''
  
===Ontology Terms===
+
or
====Frozen files used for this release of FlyBase====
 
  
List of ontologies available for download:
+
'''dmel-<chromosome_arm>-<data_type>-r<release-number>.fasta.gz'''
  
* FBbt: fly_anatomy
+
Where '''data_type''' is one of the following entries in the table below. The '''all''' files contain sequences for those data types on all chromosome arms whereas the specific chromosome arm have only those features for that particular chromosome.
* FBdv: fly_development
 
* FBcv: flybase controlled vocabulary
 
* FBsv: stock ontology
 
* GO: gene ontology
 
* FBbi: image ontology
 
* SO: sequence ontology
 
* DO: human disease ontology
 
  
====Current 'Live' Files====
+
{| class= "wikitable"
 
+
!Data Type
List of ontologies available for download:
+
!Content Description
 
+
|-
* FBbt: fly_anatomy
+
|'''aligned '''
''Note: link points to the ontology version fbbt-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'fly_anatomy.obo' version''
+
|The region of genomic sequence that analysis features align to.
* FBdv: fly_development
+
|-
''Note: link points to the ontology version fbbt-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'fly_development.obo' version''
+
|'''CDS'''
* FBcv: flybase controlled vocabulary
+
|The contiguous protein coding sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.
''Note: link points to the ontology version fbcv-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'flybase_controlled_vocabulary.obo' version''
 
* FBsv: stock ontology
 
* GO: gene ontology
 
* FBbi: image ontology
 
* SO: sequence ontology
 
* DO: human disease ontology
 
 
 
 
 
===Genomes: Annotation and Sequence===
 
====All Sequenced Drosophila Species====
 
 
 
Links are available to the following FTP repositories:
 
 
 
* Current FTP repository
 
* Current FastA repository
 
* Current GFF repository
 
 
 
* FTP archive (previous releases)
 
 
 
* Current list of individual FASTA files
 
* Current list of individual GFF files
 
 
 
For release FB2018_06 onward, the above links are available for the following sequenced Drosophila species:
 
 
 
 
 
{| class= "wikitable"
 
!Species name
 
!Abbreviation
 
 
|-
 
|-
|Drosophila melanogaster
+
|'''chromosome'''
|Dmel
+
|The sequence of each chromosome arm.
 
|-
 
|-
|Drosophila ananassae
+
|'''clones'''
|Dana
+
|The sequence of full length cDNA, 3' and 5' ESTs, and partial length clones.
 
|-
 
|-
|Drosophila pseudoobscura pseudoobscura
+
|'''exon '''
|Dpse
+
|The sequence of each exon split up into individual FASTA records.
 
|-
 
|-
|Drosophila simulans
+
|'''five_prime_UTR'''
|Dsim
+
|The sequence of 5' untranslated regions.
 
|-
 
|-
|Drosophila virilis
+
|'''gene'''
|Dvir
+
|The sequence of the gene span.
 
|-
 
|-
|}
+
|'''gene_extended2000'''
 
+
|The sequence of the gene span with 2000 base pairs added upstream and downstream.
 
 
For earlier archived releases, the above links are also available for these additional species (other members of the original 12 sequenced Drosophila species):
 
 
 
{| class= "wikitable"
 
!Species name
 
!Abbreviation
 
 
|-
 
|-
|Drosophila erecta
+
|'''intergenic'''
|Dere
+
|The sequence of chromosomal regions between genes that do not contain known gene models.
 
|-
 
|-
|Drosophila grimshawi
+
|'''intron'''
|Dgri
+
|The sequence of each intron split up into individual FASTA records.
 
|-
 
|-
|Drosophila mojavensis
+
|'''miRNA'''
|Dmoj
+
|The sequence of transcripts that are typed as micro RNAs.
 
|-
 
|-
|Drosophila persimilis
+
|'''miscRNA'''
|Dper
+
|The sequence of transcripts that are typed as small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), or ribosomal RNA (rRNA). May also contain other transcript types that do not exist in their own individual files.
 
|-
 
|-
|Drosophila sechellia
+
|'''ncRNA'''
|Dsec
+
|The sequence of transcripts that are typed as non coding RNAs (ncRNA).
 
|-
 
|-
|Drosophila willistoni
+
|'''predicted'''
|Dwil
+
|The sequence of various features that are derived from a variety of prediction algorithms. These can encompass analyses conducted by FlyBase or by 3rd party groups.
 
|-
 
|-
|Drosophila yakuba
+
|'''pseudogene'''
|Dyak
+
|The sequence of transcripts that are typed as pseudogenes.
 
|-
 
|-
|}
+
|'''sequence_features'''
 +
|The sequence of sequence features, which currently describe data about RNAi reagents. In the future, it will also contain natural genomic features (aside from transcribed regions), such as replication origins, transcription factor binding sites and boundary elements, and other experimental reagents that map to the genome, such as microarray oligonucleotides and rescue fragments.
 +
|-
 +
|'''synteny'''
 +
|The sequence of syntenic regions between two species.
 +
|-
 +
|'''three_prime_UTR'''
 +
|The sequence of 3' untranslated regions.
 +
|-
 +
|'''transcript'''
 +
|The sequence of transcripts that are typed as messenger RNAs (mRNA).
 +
|-
 +
|'''translation'''
 +
|The resulting protein sequence from protein coding transcripts.
 +
|-
 +
|'''transposon'''
 +
|The sequence of transposable elements.
 +
|-
 +
|'''tRNA'''
 +
|The sequence of transcripts that are typed as transfer RNAs (tRNA).
 +
|}
  
===Transcripts and Polypeptides===
 
  
====Transcript data (Chado XML)====
+
The typical format of our FASTA header begins with an ID followed by any number of fields that follow this format
  
====Polypeptide data (Chado XML)====
+
'''field_name=value;'''
  
====Non-coding RNAs (JSON) (ncRNA_genes_fb_*.json.gz)====
+
Multiple field values are separated by commas
This file reports all ncRNAs for D. melanogaster and 11 other sequenced Drosophila species in JSON format, as submitted to [http://rnacentral.org/ RNAcentral]. Pseudogenes are excluded.  In addition to the symbols and IDs for ncRNAs, this file also includes their associated gene, genomic location, sequence, Sequence Ontology classification, etc.  The full schema for this file is available [https://github.com/RNAcentral/rnacentral-data-schema/blob/master/sections/ncrna.json here].
 
  
===Transposons, Transgenic Constructs, and Insertions===
+
'''field_name=value1,value2;'''
  
====Insertions (Chado XML)====
+
This table describes some of the field names found in our FASTA headers
====Transgenic Constructs (Chado XML)====
 
  
 +
{|class = "wikitable"
  
====Transgenic construct maps (construct_maps.zip)====
+
!Field Name
The construct_maps.zip file unpacks as a directory containing maps of recombinant constructs and transgenic transposons generated by FlyBase, that are based on the compiled sequence data curated by FlyBase. The name of each PNG image in the directory corresponds to the  [[FlyBase:RefMan_F.#FlyBase_Identifier_Numbers|FlyBase identifier]] of the respective recombinant construct or transgenic transposon.
+
!Description
 
 
'''Please note:''' For transgenic transposons, the image may be a map of the corresponding plasmid form.
 
 
 
====Map data for insertions (insertion_mapping_*.tsv)====
 
The insertion mapping table reports available localization information for '''Dmel''' insertions.
 
 
 
File format:
 
 
 
{| class= "wikitable"
 
!Column heading
 
!Content Description
 
 
|-
 
|-
|'''insertion_symbol'''
+
|'''type'''
|Current symbol of insertion.
+
|The feature type of the FASTA sequence record.
 
|-
 
|-
|'''FBti#'''
+
|'''loc'''
|Current FlyBase identifier (FBti#) of insertion
+
|The genomic location given in the NCBI's feature location format. Please see the [ftp://ftp.ncbi.nih.gov/genbank/docs/ NCBI's] site for more information.
.
 
 
|-
 
|-
|'''genomic_location'''
+
|'''ID'''
|Genomic location of insertion.
+
|A unique ID. IDs in the form of FBxx[0-9]+ are a unique FlyBase object identifier.
 +
|-
 +
|'''name'''
 +
|The name or symbol of the feature.
 
|-
 
|-
|'''range'''
+
|'''dbxref'''
|Range (t/f) indicates whether genomic location is range or single base.
+
|Database cross references relating to the FASTA record. The dbxref values use a 'dbname:dbid' format.
 
|-
 
|-
|'''orientation'''
+
|'''MD5'''
|Orientation (1/0) indicates orientation of insertion on chromosome.
+
|An [http://en.wikipedia.org/wiki/MD5 MD5] checksum calculated from the sequence that can be used to identify identical sequences.
 
|-
 
|-
|'''estimated_cytogenetic_location'''
+
|'''length'''
|Estimated cytogenetic location based on correlation of genomic location and estimated genomic location of cytological bands.
+
|The length of the sequence found in the FASTA record.
 
|-
 
|-
|'''observed_cytogenetic_location'''
+
|'''release'''
|Observed cytogenetic location reported in the literature.
+
|The release number denotes the annotation release which this FASTA record corresponds to.
 
|-
 
|-
 +
|'''species'''
 +
|The species abbreviation that this FASTA record corresponds to.
 
|}
 
|}
  
  
 +
====GFF files====
  
====Transposable elements (canonical set) (transposon_sequence_set.embl.txt)====
+
The FlyBase GFF files follow the [http://www.sequenceontology.org/gff3.shtml GFF v3] specification. The GFF files contain feature line definitions for gene models, predicted features, alignments, and many other features.
This is a file of 'canonical' sequences of the transposable elements from Drosophila maintained by M. Ashburner.
 
  
The first section of the file outlines the history and revisions to the file and also lists the current set of elements, their size and whether the subsequent sequence data is complete.
+
For melanogaster, there are 4 GFF files distributed:
  
The second section of the file, which is separated from the first by a line of "_" characters contains the sequence data of all the elements in [ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/FT_current.html EMBL format]. The record for each element starts with a line prefixed by "ID" and ends with a line containing "//".
+
:'''dmel-all-r<release-number>.gff.gz'''
 +
::Contains all major chromosome arms (X, 2L, 2R, 3L, 3R, 4, Y, mitochondrion_genome) and ~1,860 minor scaffolds.
 +
:'''dmel-all-no-analysis-r<release-number>.gff.gz'''
 +
::Same as 'dmel-all' except all match and match_part features have been removed.
 +
:'''dmel-all-filtered-r<release-number>.gff.gz'''
 +
::Same as 'dmel-all' except all trans spliced (SO:0000459) and discistronics (SO:0000722) have been removed.
 +
:'''dmel-<chromosome_arm>-r<release-number>.gff.gz'''
 +
::Contains only a single chromosome arm or minor scaffold as identified by the filename. Included within the '''dmel-gff_all_scaffolds-r<release-number>.gff.gz''' folder.
  
====Frequently-used GAL4 drivers table (JSON) (fu_gal4_table_fb_2018_06.json.gz)====
 
This file reports a list of all GAL4 drivers that have been curated to at least 21 references and/or are among 150 most frequently requested GAL4 stocks from the [https://bdsc.indiana.edu/ Bloomington Drosophila Stock Center], in JSON format. In addition to the symbols and IDs for Scer\GAL4 alleles, this file also includes their associated transposon or insertion, associated gene, expression pattern in controlled vocabulary stage and anatomy terms, stocks, and publications, all with IDs, as well as free text expression pattern descriptions. This file, except for publications and stocks,  is also available in TSV format [http://flybase.org/GAL4/freq_used_drivers.tsv here].
 
  
===Aberrations===
+
The other species have the all chromosome arm file and also a tar and gzipped file containing the individual scaffolds. Please note that the tarball contains thousands of files in a single directory level so extracting them may result in filesystem performance issues.
====Aberration data (Chado XML)====
 
====Balancer data (Chado XML)====
 
  
===Large dataset metadata===
+
The GFF files are produced for each species and can be downloaded from our FTP site using this URL form:
====Dataset metadata members (dataset_metadata_fb_*.tsv.gz)====
 
This file lists all features that are associated with a dataset/collection (e.g., genes, cDNA clones, TF_binding_sites, Affymetrix probes).
 
  
File format:
+
ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gff/
  
{| class= "wikitable"
+
e.g. ftp://ftp.flybase.org/genomes/dmel/current/gff/
!Column heading
 
!Content Description
 
|-
 
|'''Dataset_Metadata_ID'''
 
|The unique FlyBase ID for the dataset.
 
|-
 
|'''Dataset_Metadata_Name'''
 
|The official FlyBase symbol for the dataset.
 
|-
 
|'''Item_ID'''
 
|The unique FlyBase ID for the feature associated with this dataset.
 
|-
 
|'''Item_Name'''
 
|The official FlyBase symbol for the feature associated with this dataset.
 
|-
 
|}
 
  
 +
====GTF files====
  
 +
The FlyBase GTF files follow the [http://mblab.wustl.edu/GTF22.html GTF v2.2] specification.  The GTF files contain feature line definitions for gene models.
  
===Clones===
+
The GTF are produced for each species and can be downloaded from our FTP site using this URL form:
  
====Clone data (Chado XML)====
+
ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gtf/
 
+
 
====cDNAs: FBcl <=> acc. ID (cDNA_clone_data_*.tsv)====
+
e.g. ftp://ftp.flybase.org/genomes/dmel/current/gtf/
The file reports basic cDNA clone data in FlyBase.
+
 
 
+
 
File format:
+
===Transcripts and Polypeptides===
 
+
 
{| class= "wikitable"
+
====Transcript data (Chado XML)====
!Column heading
+
The chado XML file generated from the FlyBase PostgreSQL database for the 'transcripts' data class.
!Content Description
+
 
|-
+
 
|'''FBcl#'''
+
====Polypeptide data (Chado XML)====
|Current FlyBase identifier (FBcl#) of cDNA clone.
+
The chado XML file generated from the FlyBase PostgreSQL database for the 'polypeptide' data class.
|-
+
 
|'''organism_abbreviation'''
+
 
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the clone.
+
====Non-coding RNAs (JSON) (ncRNA_genes_fb_*.json.gz)====
|-
+
This file reports all ncRNAs with gene models supported by FlyBase in JSON format, as submitted to [http://rnacentral.org/ RNAcentral]. Pseudogenes are excluded. In addition to the symbols and IDs for ncRNAs, this file also includes their associated gene, genomic location, sequence, Sequence Ontology classification, etc.  The full schema for this file is available [https://github.com/RNAcentral/rnacentral-data-schema/blob/master/sections/ncrna.json here].
|'''clone_name'''
+
 
|Clone name.
+
Note - from release FB2020_03 onward, this file reports only ncRNAs for D. melanogaster; earlier files include ncRNAs for D. ananassae, D. pseudoobscura pseudoobscura, D. simulans and D. virilis.
|-
+
 
|'''dataset_metadata_name'''
+
===Transposons, Transgenic Constructs, and Insertions===
|Name of dataset associated with clone.
+
 
|-
+
Files described in this section are in the "insertions" subdirectory of the FTP site (unless otherwise noted). Download the latest file using a query of this form:</br>
|'''cDNA_accession(s)'''
+
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/insertions/insertion_mapping_fb_*.tsv.gz</nowiki></code></br>
|EMBL/GenBank/DDBJ cDNA accession number.
+
 
|-
+
====Insertions (Chado XML)====
|'''EST_accession(s)'''
+
The chado XML file generated from the FlyBase PostgreSQL database for the 'insertions' data class.
|EMBL/GenBank/DDBJ EST accession number.
+
 
|-
+
 
 +
====Transgenic Constructs (Chado XML)====
 +
The chado XML file generated from the FlyBase PostgreSQL database for the 'transgenic constructs' data class.
 +
 
 +
 
 +
====Transgenic construct maps (construct_maps.zip)====
 +
The construct_maps.zip file unpacks as a directory containing maps of recombinant constructs and transgenic transposons generated by FlyBase, that are based on the compiled sequence data curated by FlyBase. The name of each PNG image in the directory corresponds to the  [[FlyBase:RefMan_F.#FlyBase_Identifier_Numbers|FlyBase identifier]] of the respective recombinant construct or transgenic transposon.
 +
 
 +
'''Please note:''' For transgenic transposons, the image may be a map of the corresponding plasmid form.
 +
 
 +
 
 +
====Map data for insertions (insertion_mapping_*.tsv)====
 +
The insertion mapping table reports available localization information for '''Dmel''' insertions.
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''insertion_symbol'''
 +
|Current symbol of insertion.
 +
|-
 +
|'''FBti#'''
 +
|Current FlyBase identifier (FBti#) of insertion
 +
.
 +
|-
 +
|'''genomic_location'''
 +
|Genomic location of insertion.
 +
|-
 +
|'''range'''
 +
|Range (t/f) indicates whether genomic location is range or single base.
 +
|-
 +
|'''orientation'''
 +
|Orientation (1/0) indicates orientation of insertion on chromosome.
 +
|-
 +
|'''estimated_cytogenetic_location'''
 +
|Estimated cytogenetic location based on correlation of genomic location and estimated genomic location of cytological bands.
 +
|-
 +
|'''observed_cytogenetic_location'''
 +
|Observed cytogenetic location reported in the literature.
 +
|-
 +
|}
 +
 
 +
 
 +
====Transposable elements (canonical set) (transposon_sequence_set.*)====
 +
These files, in FASTA or GFF format, represent 'canonical' sequences of transposable elements of Drosophila species (primarily but not exclusively of D. melanogaster), including the protein sequences of encoded genes. Based on a file originally compiled by Michael Ashburner; currently maintained by Casey Bergman.</br>
 +
To download the latest files:
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/transposons/transposon_sequence_set.fa.gz</nowiki></code></br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/transposons/transposon_sequence_set.gff.gz</nowiki></code></br>
 +
 
 +
====Frequently-used GAL4 drivers table (JSON) (fu_gal4_table_fb_2018_06.json.gz)====
 +
This file reports a list of all GAL4 drivers that have been curated to at least 21 references and/or are among 150 most frequently requested GAL4 stocks from the [https://bdsc.indiana.edu/ Bloomington Drosophila Stock Center], in JSON format. In addition to the symbols and IDs for Scer\GAL4 alleles, this file also includes their associated transposon or insertion, associated gene, expression pattern in controlled vocabulary stage and anatomy terms, stocks, and publications, all with IDs, as well as free text expression pattern descriptions. This file, except for publications and stocks,  is also available in TSV format [http://flybase.org/GAL4/freq_used_drivers.tsv here].
 +
 
 +
===Aberrations===
 +
====Aberration data (Chado XML)====
 +
The chado XML file generated from the FlyBase PostgreSQL database for the 'aberrations' data class.
 +
 
 +
 
 +
====Balancer data (Chado XML)====
 +
The chado XML file generated from the FlyBase PostgreSQL database for the 'balancers' data class.
 +
 
 +
 
 +
===Large dataset metadata===
 +
 
 +
Files described in this section are in the "metadata" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/metadata/dataset_metadata_fb_*.tsv.gz</nowiki></code></br>
 +
 
 +
 
 +
====Dataset metadata members (dataset_metadata_fb_*.tsv.gz)====
 +
This file lists all features that are associated with a dataset/collection (e.g., genes, cDNA clones, TF_binding_sites, Affymetrix probes).
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''Dataset_Metadata_ID'''
 +
|The unique FlyBase ID for the dataset.
 +
|-
 +
|'''Dataset_Metadata_Name'''
 +
|The official FlyBase symbol for the dataset.
 +
|-
 +
|'''Item_ID'''
 +
|The unique FlyBase ID for the feature associated with this dataset.
 +
|-
 +
|'''Item_Name'''
 +
|The official FlyBase symbol for the feature associated with this dataset.
 +
|-
 +
|}
 +
 
 +
===Clones===
 +
 
 +
Files described in this section are in the "clones" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/clones/cDNA_clone_data_fb_*.tsv.gz</nowiki></code></br>
 +
 
 +
====Clone data (Chado XML)====
 +
The chado XML file generated from the FlyBase PostgreSQL database for the 'clones' data class.
 +
 
 +
 
 +
====cDNAs: FBcl <=> acc. ID (cDNA_clone_data_*.tsv)====
 +
The file reports basic cDNA clone data in FlyBase.
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''FBcl#'''
 +
|Current FlyBase identifier (FBcl#) of cDNA clone.
 +
|-
 +
|'''organism_abbreviation'''
 +
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the clone.
 +
|-
 +
|'''clone_name'''
 +
|Clone name.
 +
|-
 +
|'''dataset_metadata_name'''
 +
|Name of dataset associated with clone.
 +
|-
 +
|'''cDNA_accession(s)'''
 +
|EMBL/GenBank/DDBJ cDNA accession number.
 +
|-
 +
|'''EST_accession(s)'''
 +
|EMBL/GenBank/DDBJ EST accession number.
 +
|-
 +
|}
 +
 
 +
 
 +
====Genomic: FBcl <=> acc. ID (genomic_clone_data_*.tsv)====
 +
 
 +
The file reports basic genomic clone data in FlyBase.
 +
 
 +
File format:
 +
 
 +
{| class= "wikitable"
 +
!Column heading
 +
!Content Description
 +
|-
 +
|'''FBcl#'''
 +
|Current FlyBase identifier (FBcl#) of genomic clone.
 +
|-
 +
|'''organism_abbreviation'''
 +
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the clone.
 +
|-
 +
|'''clone_name'''
 +
|Clone name.
 +
|-
 +
|'''accession'''
 +
|EMBL/GenBank/DDBJ cDNA accession number.
 +
|-
 
|}
 
|}
  
 +
===References===
  
 +
Files described in this section are in the "references" subdirectory of the FTP site. Download the latest file using a query of this form:</br>
 +
<code><nowiki>wget ftp://ftp.flybase.net/releases/current/precomputed_files/references/fbrf_pmid_pmcid_doi_fb*.tsv.gz</nowiki></code></br>
  
====Genomic: FBcl <=> acc. ID (genomic_clone_data_*.tsv)====
+
====Combined reference data (Chado XML)====
 
+
The chado XML file generated from the FlyBase PostgreSQL database for the 'references' data class.
The file reports basic genomic clone data in FlyBase.
 
 
 
File format:
 
  
{| class= "wikitable"
 
!Column heading
 
!Content Description
 
|-
 
|'''FBcl#'''
 
|Current FlyBase identifier (FBcl#) of genomic clone.
 
|-
 
|'''organism_abbreviation'''
 
|Abbreviation (from the [[FlyBase:Abbreviations|Species Abbreviations]] list) indicating the species of origin of the clone.
 
|-
 
|'''clone_name'''
 
|Clone name.
 
|-
 
|'''accession'''
 
|EMBL/GenBank/DDBJ cDNA accession number.
 
|-
 
|}
 
 
 
 
===References===
 
====Combined reference data (Chado XML)====
 
 
====FlyBase FBrf <=> PubMed ID <=> PMCID <=> DOI (fbrf_pmid_pmcid_doi_fb_*.tsv.gz)====
 
====FlyBase FBrf <=> PubMed ID <=> PMCID <=> DOI (fbrf_pmid_pmcid_doi_fb_*.tsv.gz)====
  
Line 1,806: Line 2,154:
 
|-
 
|-
 
|}
 
|}
 
 
 
===Drosophila researchers===
 
Addresses of Drosophila researchers are copyrighted (GSA) material and only provided for official business of the Fly Board.
 
  
 
===Map conversion tables===
 
===Map conversion tables===
Line 1,868: Line 2,211:
  
 
An html version of this file is also available - see the [http://{{flybaseorg}}/static_pages/docs/cytotable3.html Map Conversion Table] page.
 
An html version of this file is also available - see the [http://{{flybaseorg}}/static_pages/docs/cytotable3.html Map Conversion Table] page.
 
  
  

Latest revision as of 13:45, 2 September 2024

Introduction

Browse Current Release Page

The Current Release page is a web interface allowing easy access to the main directories and the individual bulk data files available at the current FlyBase FTP repository. Files can be downloaded directly through the web interface.

Browse FTP Files

Users can also browse files on our FTP site, either for the current release or for past releases. It's also possible to browse the FTP file by genomes.
Note that the Safari browser does not support browsing of FTP site directories, though it does allow download of individual files.

Programmatic Download

The ftp client wget accepts wild card patterns which means you can use a query to obtain the latest file without having to specify the FlyBase release number. However, you will need to know the sub-directory in which the file resides: e.g., "genes", "orthologs", etc.

Here are some examples:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz
wget ftp://ftp.flybase.net/releases/current/precomputed_files/orthologs/dmel_human_orthologs_disease_fb_2022_04.tsv.gz

Opening compressed files

Most of the files are compressed with the GNU gzip program and have the suffix '.gz'. Most modern computers will unpack and open these files automatically after download. Alternatively, the gunzip command may be used on machines runnign Apple OS X or Unix. On a Windows machine we suggest you use the program 7-zip to open these files as several people have reported problems using WinZip. The resulting file should open with any standard text editor.

Archived Data

Data files from previous releases, as well as links to servers hosting older releases of FlyBase, can be accessed via the Archived Data webpage.

Using an FTP client, data files from previous releases can be obtained by including the FlyBase release in the path /releases/<RELEASE_NUMBER>/. For example to retrieve the 'fbgn_annotation_ID' file for the FB2018_06 release, type:

wget "ftp://ftp.flybase.net/releases/FB2018_06/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz"

or more specifically:

wget ftp://ftp.flybase.net/releases/FB2018_06/precomputed_files/genes/fbgn_annotation_ID_fb_2018_06.tsv.gz

The /releases/current/ path will always point to latest FlyBase release and this directory will have only one copy of the file.

Main Data Set

This section contains links to top-level directories of the FlyBase FTP repository.

Postgres Chado Database Dump

The Chado database link leads to the psql directory of the current FTP repository where you can obtain a dump of the PostgreSQL Chado database. If you have a PostgreSQL client application installed and would like to access the latest FlyBase release without installing the database you can connect to the FlyBase public read only Chado database as: $ psql -h chado.flybase.org -U flybase flybase

The version running on this service is identical to the current web site release.

Drosophila Data

This section contains links to:

  • the current Chado-XML repository, containing the chado XML files generated from the PostgreSQL database for each FlyBase data class for the current FlyBase release. These files contain all the information used to generate FlyBase report pages and reflect the organization of the data in the database. The DTDs for these XML files, listing the structure of the files, are included in this directory.
  • the Genomes FTP repository, containing genome and genome annotation data files (including FASTA, GFF and GTF files) for D. melanogaster and other Drosophila species, organized by genome/FlyBase release number. For releases FB2018_05 and earlier, data are available for each of the original 12 sequenced Drosophila species. For releases FB2018_06 to FB2020_02, data are available only for D. melanogaster, D. simulans, D. ananassae, D. pseudoobscura and D. virilis. From release FB2020_03 onward, data are available only for D. melanogaster.

Bulk data files

The remaining sections of the Current Release page are organized by data class/type and provide direct downloads of the current bulk data files from the FTP site. Most files are from the current precomputed files directory of the FTP site and contain useful data for the specified data type (described in detail below). The Genomes files are from the current D. melanogaster FTP genomes directory or the current files for selected other Drosophila species.

The first part of a filename always describes the content of the file, and the second part may contain a FlyBase or genome annotation version number. For example, the file "fbgn_annotation_ID_fb_2018_06.tsv.gz" maps the primary FlyBase gene identifiers (FBgn) to their annotation IDs for the FB2018_06 release of FlyBase. The "dmel-all-CDS-r6.25.fasta.gz" files contains the coding sequences for all D. melanogaster genes from the release 6 of the sequence assembly, annotation release 25.

At the top and bottom of each tab separated text file there are a few lines that describe the file. These lines start with a '#' symbol. The line immediately before the start of the data contains headings for each of the tab separated columns in the file. The file can also include some blank lines to separate information about the version of the file from the description of data in the file.

Superscripts and subscripts are represented in the precomputed data files in the ASCII text format used by FlyBase, which is described in section 10.3 of the Nomenclature document.

Each precomputed data file contains the complete data set for the FlyBase release. If you are looking for information on a defined subset of genes or other FlyBase data type, you can use the Batch Download tool to query the precomputed data files and thus obtain only the data you require. This approach is described in more detail here.


Synonyms

Files described in this section are in the "synonyms" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/synonyms/fb_synonym_*.tsv.gz

FlyBase Synonyms (fb_synonym_*.tsv)

The file reports current symbols and synonyms for the following objects in FlyBase: genes (FBgn), alleles (FBal), balancers (FBba), aberrations (FBab), transgenic constructs (FBtp), insertions (FBti), transcripts (FBtr), and proteins (FBpp).

The file includes:

  • nuclear genes located to the sequence
  • mitochondrial genes
  • genes not located to the sequence
  • genes from drosophilid species and genes from non-drosophilids that have been introduced into transgenic flies

File format:

Column heading Content Description
primary_FBid Primary FlyBase identifier for the object.
organism_abbreviation Abbreviation (from the Species Abbreviations list) indicating the species of origin.
current_symbol Current symbol used in FlyBase for the object.
current_fullname Current full name used in FlyBase for the object.
fullname_synonym(s) Non-current full name(s) associated with the object (pipe separated values).
symbol_synonym(s) Non-current symbol(s) associated with the object (pipe separated values).

Genes

Files described in this section are in the "genes" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz

Genes data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'genes' data class.

Genetic interaction table (gene_genetic_interactions_*.tsv)

The file reports the summary of gene-level genetic interactions in FlyBase. This data is computed from the allele-level genetic interaction data captured by FlyBase curators.

The file includes information for Dmel genes only.

Interactions involving any of the following kinds of allele are considered when the gene-level genetic interaction data is computed:

  • classical mutations
  • alleles carried on transgenic constructs
  • loss-of-function mutations
  • gain-of-function mutations

File format:

Column heading Content Description
Starting_gene(s)_symbol Current FlyBase symbol of gene(s) involved in the starting genotype.
Starting_gene(s)_FBgn Current FlyBase identifier (FBgn#) of gene(s) involved in the starting genotype.
Interacting_gene(s)_symbol Current FlyBase symbol of gene(s) involved in the interacting genotype.
Interacting_gene(s)_FBgn Current FlyBase identifier (FBgn#) of gene(s) involved in the interacting genotype.
Interaction_type Type of interaction observed, either 'suppressible' or 'enhanceable'.
Publication_FBrf Current FlyBase identifier (FBrf#) of publication from which the data came.


Notes:

  • Each row contains information from a single reference. Thus if the same genetic interaction has been reported in multiple references, multiple rows will exist for that genetic interaction in the file.
  • 'suppressible' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are suppressed by mutation of the gene(s) listed in the interacting genotype (column 3).
  • 'enhanceable' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are enhanced by mutation of the gene(s) listed in the interacting genotype (column 3).

e.g.

Pten FBgn0026379 Akt1 FBgn0010379 suppressible FBrf0127089

indicates that phenotype(s) caused by a mutation of Pten are suppressed by a mutation of Akt1.

  • For cases where multiple genes are simultaneously mutated in either (or both) the starting and interacting genotype, then the genes involved are separated by a '|' in the relevant columns. In this case, the order of the list of symbols and of the list of ids in columns 1 and 2, or in columns 3 and 4 respectively are the same, so that the FBgn corresponding to the symbol for each gene can easily be identified.

e.g.

robo1|sli FBgn0005631|FBgn0264089 RhoGAP93B FBgn0038853 enhanceable FBrf0191476

indicates that:

  • phenotype(s) caused by a robo1, sli double mutant combination are enhanced by a mutation of RhoGAP93B.
  • FBgn0005631 corresponds to robo1, FBgn0264089 corresponds to sli


RNA-Seq RPKM values (gene_rpkm_report_fb_*.tsv.gz)

This file reports gene expression values based on RNA-Seq experiments, calculated as reads per kilobase per million reads (RPKM). RPKM values are calculated only for the unique exonic regions of the gene (excluding segments that overlap other genes), except for genes derived from dicistronic/polycistronic transcripts, in which case all exon regions are used in the RPKM expression calculation.

File format:

Column heading Content Description
Release_ID The D. melanogaster annotation set version from which the gene model used in the analysis derives.
FBgn# The unique FlyBase gene ID for this gene.
GeneSymbol The official FlyBase symbol for this gene.
Parent_library_FBlc# The unique FlyBase ID for the dataset project to which the RNA-Seq experiment belongs.
Parent_library_name The official FlyBase symbol for the dataset project to which the RNA-Seq experiment belongs.
RNASource_FBlc# The unique FlyBase ID for the RNA-Seq experiment used for RPKM expression calculation.
RNASource_name The official FlyBase symbol for the RNA-Seq experiment used for RPKM expression calculation.
RPKM_value The RPKM expression value for the gene in the specified RNA-Seq experiment.
Bin_value The expression bin classification of this gene in this RNA-Seq experiment, based on RPKM value. Bins range from 1 (no/extremely low expression) to 8 (extremely high expression).
Unique_exon_base_count The number of exonic bases unique to the gene (not overlapping exons of other genes). Field will be blank for genes derived from dicistronic/polycistronic transcripts.
Total_exon_base_count The number of bases in all exons of this gene.
Count_used Indicates if the RPKM expression value was calculated using only the exonic regions unique to the gene and not overlapping exons of other genes (Unique), or, if the RPKM expression value was calculated based on all exons of the gene regardless of overlap with other genes (Total). RPKM expression values are typically reported for the "Unique" count, except for genes on dicistronic/polycistronic transcripts, in which case the "Total" count is reported.

RNA-Seq RPKM values matrix (gene_rpkm_matrix_fb_*.tsv.gz)

A simpler, spreadsheet-friendly version of the "gene_rpkm_report_fb_*.tsv.gz" file. This file provides a gene by expression value matrix based on RNA-Seq experiments. RPKM is calculated as reads per kilobase per million reads (RPKM). RPKM values are calculated only for the unique exonic regions of the gene (excluding segments that overlap other genes), except for genes derived from dicistronic/polycistronic transcripts, in which case all exon regions are used in the RPKM expression calculation. This RPKM matrix lacks the details of how RPKM was calculated for each gene.

Note - In addition to FlyBase calculated RPKM RNA-Seq expression values, FlyAtlas2 data have been incorporated into this file. These data are in FPKM units, calculated by the FlyAtlas group Gillen, 2023.


File format:

Column heading Content Description
gene_primary_id The unique FlyBase gene ID for this gene.
gene_symbol The official FlyBase symbol for this gene.
gene_fullname The official full name for this gene.
gene_type The type of gene: e.g., protein_coding_gene, non_protein_coding_gene.
DATASAMPLE_NAME_(DATASET_ID) Each subsequent column reports the RNA-Seq gene expression value for the sample listed in the header. The dataset "FBlc" ID is listed in parentheses, and can be pasted into FlyBase search to access more information on the sample from the "dataset" report. Expression in most cases was calculated by FlyBase in RPKM units, with the exception of FlyAtlas2 data, which was calculated by the FlyAtlas group and is expressed in FPKM units.

Single Cell RNA-Seq Gene Expression (scRNA-Seq_gene_expression_fb_*.tsv.gz)

This file reports summarized gene expression levels from cell clusters observed in single cell RNA-Seq experiments; these data are processed from data at the EBI Single Cell Expression Atlas. The "Mean_Expression" is the average level of expression of the gene across all cells of the cluster in which the gene is detected at all; the "Spread" is the proportion of cells in the cluster in which the gene is detected. Please see the dataset reports for more experimental details and for links to other data repositories for raw and alternatively processed data.

File format:

Column heading Content Description
Pub_ID The FlyBase FBrf ID for the reference in which the expression was reported.
Pub_miniref The FlyBase citation for the publication in which the expression was reported.
Clustering_Analysis_ID The FlyBase FBlc ID for the dataset representing the clustering analysis.
Clustering_Analysis_Name The FlyBase name for the dataset representing the clustering analysis.
Source_Tissue_Sex The sex of the source tissue used for the experiment: male, female or mixed.
Source_Tissue_Stage The life stage of the source tissue used for the experiment, using only high-level terms: embryonic stage, larval stage, pupal stage, adult stage or mixed.
Source_Tissue_Anatomy The anatomical region of the source tissue used for the experiment; only "mixed" is shown if many
Cluster_ID The FlyBase FBlc ID for the dataset representing the cell cluster.
Cluster_Name The FlyBase name for the dataset representing the cell cluster.
Cluster_Cell_Type_ID The FlyBase FBbt ID for the cell type represented by the cell cluster.
Cluster_Cell_Type_Name The FlyBase name for the cell type represented by the cell cluster.
Gene_ID The FlyBase FBgn ID for the expressed gene.
Gene_Symbol The FlyBase symbol for the expressed gene (ASCII-format).
Mean_Expression The average level of expression of the gene across all cells of the cluster in which the gene is detected at all.
Spread The proportion of cells in the cluster in which the gene is detected.

Fly Cell Atlas gene expression in high-level cell types (FlyCellAtlas_slimmed_gene_expression_fb_*.tsv.gz)

This file provides the data used to generate the “Fly Cell Atlas Cell Type Expression Data” bar chart displayed on our Gene Report pages. For each gene that was found expressed in the Fly Cell Atlas dataset, it provides the mean expression level and the proportion of positive cells in the same 22 high level cell types displayed in the aforementioned bar chart. These data are calculated from FlyCellAtlas scRNA-Seq data for higher resolution cell clusters (having more detailed cell type classifications). For more detailed FlyCellAtlas data, and other scRNA-Seq data, please see the "Single Cell RNA-Seq Gene Expression" file.

NOTE: Not yet available; coming in the FB2023_06 release.

File format:

Column heading Content Description
gene_id The unique FlyBase gene ID for this gene.
gene_Symbol The official FlyBase symbol for this gene.
<cell_type> Two colon-separated values: the mean expression level of the gene in <cell_type>, and the proportion of <cell_type> expressing the gene (percent).

High-Throughput Gene Expression (high-throughput_gene_expression_fb_*.tsv.gz)

This file reports most high-throughput gene expression data that is featured in the High-Throughput Expression Data section of the FlyBase gene report. Data is sorted first by the expression section in which the dataset is displayed, then by sample ID, then by gene ID. Additional information about the dataset or the sample can be obtained by searching FlyBase with the appropriate FBlc dataset/sample ID (columns 2 and 4). Note that scRNA-Seq data is not included in this file, as it is structured differently; scRNA-Seq data is available in other download files. This file includes the testis specificity index score, as calculated by Vedelek et al. (2018)

File format:

Column heading Content Description
<High_Throughput_Expression_Section> The name of the Gene report High-Throughput Expression Data section in which the data is reported.
<Dataset_ID> The FBlc ID of the dataset.
<Dataset_Name> The name of the dataset.
<Sample_ID> The FBlc of the sample.
<Sample_Name> The name of the sample.
<Gene_ID> The FBgn ID of the gene.
<Gene_Symbol> The gene symbol.
<Expression_Unit> The unit of expression: e.g., RPKM, RPMM, TPM, LFQ_geom_mean_intensity, testis_specificity_index_score
<Expression_Value The gene expression value.

Physical interaction MITAB file (physical_interactions_mitab_fb_*.tsv.gz)

This file reports each individual experiment curated by FlyBase that supports a physical interaction between two gene products. There can be multiple experiments (multiple rows in the file) between products of the same gene pair. Interaction molecule types currently curated are protein-protein, protein-RNA or RNA-RNA.

This file is in PSI-MI TAB format, a tab-delimited format developed by the HUPO Proteomics Standards Initiative (PSI) Molecular Interactions (MI) working group to facilitate interactomics data comparison and exchange. Details on the general MITAB format can be found here. The file makes use of the Molecular Interactions ontology which can be searched or browsed here. Fields are filled with “-” if values are missing or not relevant.


File format:

Column number Column heading General format FlyBase example Content description
1 ID(s) Interactor A database:identifier flybase:FBgn0002121 The unique Flybase identifier for the first gene of the interacting pair.
2 ID(s) Interactor B The unique Flybase identifier for the second gene of the interacting pair.
3 Alt ID(s) Interactor A database:identifier flybase:CG2671|entrez gene/locuslink:33156 The alternative gene identifiers currently provided are Flybase annotation IDs (CG#) and NCBI’s Entrez Gene ID separated by “|”.
4 Alt ID(s) Interactor B
5 Alias(es) Interactor A database:name(alias type) flybase:l(2)gl(gene name) The official Flybase gene symbol. It is referred to as “gene name” to adhere to the psi-mi ontology.
6 Alias(es) Interactor B
7 Interaction Detection Method(s) ontology:identifier(method name) psi-mi:"MI:0006"(anti bait coimmunoprecipitation) The assay used to detect the interaction, taken from the psi-mi ontology.
8 Publication 1st Author(s) surname initial(s) (publication year) Betschinger K. (2003) The first author and year of the publication where the interaction is described.
9 Publication ID(s) database:identifier flybase:FBrf0157155|pubmed:12629552 The unique FlyBase identifier for the publication followed by the unique PubMed identifier (if there is one) separated by “|”.
10 Taxid Interactor A taxid:identifier taxid:7227("Drosophila melanogaster") The NCBI taxonomy identifier for the source organism of the interactor. The vast majority of interactors in FlyBase come from D. melanogaster. There are, however, a few interspecies interactions consisting of a D. melanogaster interactor and an interactor of a different species.
11 Taxid Interactor B
12 Interaction Type(s) ontology:identifier(interaction type) psi-mi:"MI:0915"(physical association) Taken from the psi-mi ontology. Most often “physical association” for FlyBase.
13 Source Database(s) ontology:identifier(database name) psi-mi:"MI:0478"(flybase) All interactions are curated by FlyBase.
14 Interaction Identifier(s) database:identifier flybase:FBrf0157155-13.coIP.WB The unique FlyBase identifier for this interaction.
15 Confidence Value(s) Not applicable
16 Expansion Method(s) Not applicable
17 Biological Role(s) Interactor A Not applicable
18 Biological Role(s) Interactor B Not applicable
19 Experimental Role(s) Interactor A ontology:identifier(experimental role name) psi-mi:"MI:0496"(bait) The role played by the interactor in the experiment. Taken from the psi-mi ontology.
20 Experimental Role(s) Interactor B
21 Type(s) Interactor A ontology:identifier(interactor type name) psi-mi:"MI:0326"(protein) The molecule type. For FlyBase, these are limited to protein or ribonucleic acid. Taken from the psi-mi ontology.
22 Type(s) Interactor B
23 Xref(s) Interactor A Not applicable
24 Xref(s) Interactor B Not applicable
25 Interaction Xref(s) database:identifier flybase:FBig0000000103 Cross references for the interactions. For Flybase, these include an interaction group identifier (FBig) and possibly a collection identifier (FBlc) separated by “|”. All experiments that show an interaction between the products of gene A and gene B are compiled into an A-B interaction group, such that all interactions are associated with an interaction group identified by an FBig number. Interactions identified as part of a large scale study are also associated with the collection identifier, or FBlc number.
26 Annotation(s) Interactor A topic:text isoform-comment:a isoform Information on whether the interaction is specific to a particular interactor isoform.
27 Annotation(s) Interactor B
28 Interaction Annotation(s) topic:text comment:Phosphorylated isoforms of @l(2)gl@ are absent when @aPKC@ is knocked down by RNAi. Describes the source(s) of the interaction participants and includes free text comments about the interaction.
29 Host Organism(s) Not applicable
30 Interaction Parameters Not applicable
31 Creation Date Not applicable
32 Update Date Not applicable
33 Checksum Interactor A Not applicable
34 Checksum Interactor B Not applicable
35 Interaction Checksum Not applicable
36 Negative FALSE All interactions in FlyBase are positive.
37 Feature(s) Interactor A feature_type:range(text) sufficient binding region:aa 1-58(N-terminal region) Describes features of Interactor A such as binding sites, mutations that disrupt the interaction, epitope tags, etc.
38 Feature(s) Interactor B
39 Stoichiometry Interactor A Not applicable
40 Stoichiometry Interactor B Not applicable
41 Identification Method(s) Participant A Not applicable
42 Identification Method(s) Participant B Not applicable


Functional complementation table (gene_functional_complementation_*.tsv)

This file reports when functional complementation of Dmel genes by non-Dmel orthologs has been observed. This data is computed by FlyBase using a combination of the orthology data obtained from DIOPT and OrthoDB and the allele-level genetic interaction data curated from the literature. The file contains a list of gene Dmel - to - non-Dmel-ortholog gene pairs where a transgenic construct/mutant allele of the non-Dmel ortholog has been shown to at least partially suppress mutant phenotype(s) of an allele of the Dmel gene.

File format:

Column number Column heading Content Description
1 Dmel gene (symbol) Current FlyBase symbol of Dmel gene.
2 Dmel gene (FBgn) Current FlyBase identifier (FBgn#) of Dmel gene in column 1.
3 Functionally complementing ortholog (symbol) Current FlyBase symbol of a non-Dmel ortholog of the Dmel gene in column 1 where this non-Dmel gene has been show to functionally complement the Dmel gene.
4 Functionally complementing ortholog (FBgn#) Current FlyBase identifier (FBgn#) of a non-Dmel ortholog of the Dmel gene in column 1 where this non-Dmel gene has been show to functionally complement the Dmel gene.
5 Supporting_FBrf Current FlyBase identifier (FBrf#) of the publication that provides support for the functional complementation statement (the publication that reported the suppression of a mutant phenotype of the Dmel gene by a transgenic construct/mutant allele of the non-Dmel ortholog).

Notes:

  • Each row contains information from a single reference. Thus if multiple references support the same functional complementation statement, multiple rows will exist for that statement in the file.


FBgn <=> DB Accession IDs (fbgn_NAseq_Uniprot_*.tsv)

The file reports EMBL/GenBank/DDBJ nucleotide and protein accessions, UniProtKB/SwissProt/TrEMBL protein accessions, NCBI Entrez gene IDs and NCBI RefSeq transcript and protein accessions associated with FlyBase genes.

The file includes:

  • nuclear genes with sequence accession numbers
  • mitochondrial genes

it excludes:

  • genes without sequence accession numbers

File format:

Column number Column heading Content Description
1 gene_symbol Current symbol of gene.
2 organism_abbreviation Abbreviation (from the Species Abbreviations list) indicating the species of origin of the gene.
3 primary_FBgn# Current FlyBase identifier (FBgn#) of gene.
4 nucleotide_accession EMBL/GenBank/DDBJ nucleotide accession associated with the gene.
5 na_based_protein_accession EMBL/GenBank/DDBJ protein accession associated with the gene and the nucleotide accession in the preceeding 'nucleotide_accession' column
6 UniprotKB/Swiss-Prot/TrEMBL_accession UniProtKB/SwissProt/TrEMBL protein accession associated with the gene.
7 EntrezGene_ID NCBI Entrez ID associated with the gene.
8 RefSeq_transcripts NCBI RefSeq transcript accession associated with the gene.
9 RefSeq_proteins NCBI RefSeq protein accession associated with the gene and the transcript accession in the preceeding 'RefSeq_transcripts' column.

Notes:

  • Each row contains information about a single accession associated with a gene, thus if a gene has multiple accessions associated with it, multiple rows will exist for that gene in the file.
  • A single row contains only information about an EMBL/GenBank/DDBJ accession or information about a UniProtKB/SwissProt/TrEMBL accession or an NCBI Entrez gene ID or an NCBI RefSeq transcript accession.
  • For rows containing information about a EMBL/GenBank/DDBJ accession, a nucleotide accession associated with the gene is listed in column 4 ('nucleotide_accession'). If there is also a EMBL/GenBank/DDBJ protein accession associated with that gene and with the nucleotide accession in column 4, this protein accession is listed in column 5 ('na_based_protein_accession'). In this case, columns 6, 7, 8 and 9 are always empty.
  • For rows containing information about a UniProtKB/SwissProt/TrEMBL protein accession, a protein accession associated with the gene is listed in column 6 ('UniprotKB/Swiss-Prot/TrEMBL_accession'). In this case, columns 4, 5, 7, 8 and 9 are always empty.
  • For rows containing information about an NCBI Entrez gene, an ID associated with the gene is listed in column 7 ('EntrezGene_ID'). In this case, columns 4, 5, 6, 8 and 9 are always empty.
  • For rows containing information about an NCBI RefSeq accession, a transcript accession associated with the gene is listed in column 8 ('RefSeq_transcripts'). If there is also an NCBI RefSeq protein accession associated with that gene and with the transcript accession in column 8, this protein accession is listed in column 9 ('RefSeq_proteins'). In this case, columns 4, 5, 6 and 7 are always empty.


FBgn <=> Annotation ID (fbgn_annotation_ID_*.tsv)

The file reports current and secondary FlyBase identifiers associated with FlyBase genes, including current and secondary gene identifiers (FBgn#), and current and secondary annotation identifiers (CG#).

The file includes:

  • nuclear genes located to the sequence
  • mitochondrial genes

it excludes:

  • genes not located to the sequence

File format:

Column heading Content Description
gene_symbol Current symbol of gene.
organism_abbreviation Abbreviation (from the Species Abbreviations list) indicating the species of origin of the gene.
primary_FBgn# Current FlyBase identifier (FBgn#) of gene.
secondary_FBgn#(s) Secondary FlyBase identifier(s) (FBgn#) associated with the gene (comma separated values).
annotation_ID Current annotation identifier associated with the gene.
secondary_annotation_ID(s) Secondary annotation identifier(s) associated with the gene (comma separated values).

Notes:

  • If a gene has multiple secondary identifiers, all the values are stored within one tab separated column and are separated by commas (for example as: FBgn0034701,FBgn0034702).


FBgn <=> GLEANR IDs (fbgn_gleanr_*.tsv)

This file reports the relationship between the symbols and gene identifiers used by FlyBase for non-melanogaster genes identified by the AAA consortium, and the GLEANR identifier assigned to the gene during the initial annotation of the genome sequence.

The file includes:

  • non-melanogaster genes located to the sequence

it excludes:

  • D. melanogaster genes
  • non-melanogaster genes not located to the sequence

File format:

Column heading Content Description
organism_abbreviation Abbreviation (from the Species Abbreviations list) indicating the species of origin of the gene.
gene_symbol Current FlyBase gene symbol.
primary_FBgn# Current FlyBase identifier (FBgn#) of the gene.
GLEANR_ID GLEANR identifier assigned by the AAA Consortium.


FBgn <=> FBtr <=> FBpp IDs (fbgn_fbtr_fbpp_*.tsv)

This file reports the relationship of gene identifiers used by FlyBase for sequence localized genes, and the identifiers used for the transcript and polypeptide products of these genes.

The file includes:

  • genes located to the sequence

it excludes:

  • genes not located to the sequence

File format:

Column heading Content Description
FlyBase_FBgn Current FlyBase identifier (FBgn#) of the gene.
FlyBase_FBtr Current FlyBase identifier (FBtr#) of a transcript encoded by the gene listed in the preceeding 'FlyBase_FBgn' column.
FlyBase_FBpp Current FlyBase identifier (FBpp#) of a polypeptide encoded by the transcript listed in the preceeding 'FlyBase_FBtr' column, where this is relevant.

Notes:

  • Each row contains information about a single transcript and the polypeptide it encodes (if relevant). Thus if a gene encodes multiple isoforms, multiple rows with exist for that gene in the file.


FBgn <=> FBtr <=> FBpp IDs (expanded) (fbgn_fbtr_fbpp_expanded_*.tsv)

This expanded version of the "FBgn <=> FBtr <=> FBpp IDs" file adds organism, symbol and type information to the identifiers for sequence localized genes and their related transcript and protein products.

The file includes:

  • sequence localized nuclear genes with transcript/polypeptide annotations.
  • sequence localized mitochondrial genes with transcript/polypeptide annotations.

it excludes:

  • genes that have not been localized to the reference genome assembly for a given species.

File format:

Column number Column heading Content Description
1 organism Abbreviation (from the Species Abbreviations list) indicating the species of origin of the gene.
2 gene_type The type of gene, represented by a Sequence Ontology term.
3 gene_ID Current "FBgn" identifier of gene.
4 gene_symbol Current symbol of the gene.
5 gene_fullname Current full name of the gene.
6 annotation_ID Current FlyBase annotation identifier of the gene.
7 transcript_type The type of transcript, represented by a Sequence Ontology term.
8 transcript_ID Current FlyBase annotation identifier of the transcript.
9 transcript_symbol Current symbol of the transcript.
10 polypeptide_ID Current FlyBase annotation identifier of the polypeptide.
11 polypeptide_symbol Current symbol of the polypeptide.

Notes:

  • Each row contains information about a single transcript annotation, and if applicable, its associated polypeptide annotation.
  • Multiple rows may exist for a given gene in the file.
  • The "polypeptide_ID" and "polypeptide_symbol" columns are blank for non-mRNA transcript types.
  • For non-melanogaster annotations derived from NCBI Gnomon, some genes may be associated with a mix of coding and non-coding transcripts.
  • For D. melanogaster annotations, annotation IDs have a "CG" prefix for coding genes, or a "CR" prefix for non-protein-coding genes.
  • For non-melanogaster annotations, the annotation ID prefix varies by organism: "GD" for D. simulans ("Dsim"), "GF" for D. ananassae ("Dana"), "GA" for D. pseudoobscura ("Dpse") and "GJ" for D. virilis ("Dvir")

FBgn exons <=> Affy1 (fbgn_exons2affy1_overlaps.tsv)

The file is generated by testing for overlaps, no matter how small, of the locations of Affy1 oligos in the genome with the locations of gene exons, as defined by the Dmel gene models for the current release of FlyBase. If the location of an Affy1 oligo shows any kind of overlap with an exon of a gene, a Gene=>Affy reference is recorded in this file.

The extent of the overlap has no influence on the inclusion of a crossreference in this file. The overlap might be just one nucleotide, or it could be an exact match to the exon. For interpretation of the significance of a partial overlap please contact Affymetrix.

The file includes the following Dmel genes:

  • nuclear genes located to the sequence

it excludes:

  • genes not located to the sequence
  • mitochondrial genes

Notes:

  • Each line of the file can contain many tab separated columns:
  • The first column of a line contains the valid FlyBase identifiers of a gene.
  • Subsequent columns: Each Affy1 ID that overlaps with an exon of the gene, as described above, is listed in an additional tab separated column. Thus, this file does not contain a predefined number of columns.


FBgn exons <=> Affy2 (fbgn_exons2affy2_overlaps.tsv)

The file is generated from the location of Affy2 oligos exactly as described for Affy1 oligos above.


Genes Sequence Ontology (SO) data (dmel_gene_sequence_ontology_annotations_fb_*.tsv.gz)

This file provides SO term annotations for D. melanogaster genes that have been mapped to the current genome assembly. It will be available beginning with the FB2021_02 release.

File format:

Column heading Content Description
gene_primary_id The unique FlyBase gene ID for this gene.
gene_symbol The official FlyBase symbol for this gene.
so_term_name The SO term name.
so_term_id The SO term primary identifier.

Genes map table (gene_map_table_*.tsv)

The file reports available localization information for FlyBase genes.

It includes:

  • nuclear genes located to the sequence
  • mitochondrial genes
  • genes not located to the sequence

File format:

Column heading Content Description
organism_abbreviation Abbreviation (from the Species Abbreviations list) indicating the species of origin of the gene.
current_symbol Current FlyBase gene symbol.
primary_FBid Current FlyBase identifier (FBgn#) of gene.
recombination_loc recombination map location.
cytogenetic_loc cytogenetic location.
sequence_loc genomic location.

Best gene summaries (best_gene_summary*.tsv)

The single best available gene summary is reported for each D. melanogaster gene (available in the FB2022_05 release).
Gene summaries are taken from the following sources, in order of decreasing rank:

  • FlyBase gene snapshots
  • UniProtKB functional descriptions
  • InteractiveFly summaries
  • Alliance of Genome Resources automated descriptions
  • FlyBase automatically generated summaries

For other non-D. melanogaster genes, please see FlyBase's "automated_gene_summaries.tsv.gz" file.

File format:

Column heading Content Description
FBgn_ID Current FlyBase identifier number for the gene.
Gene_Symbol Current FlyBase symbol of the gene.
Summary_Source The source of the gene summary.
Summary The gene summary text.

Automated gene summaries (automated_gene_summaries.tsv)

The file contains the summaries found on gene report pages and the pop-ups in JBrowse and Interactions Browser in plain text.

It includes:

  • nuclear genes located to the sequence
  • mitochondrial genes
  • genes not located to the sequence

File format:

Column heading Content Description
- FlyBase ID. The Valid FlyBase identifier number for the gene.
- The gene summary as a string of plain text.

Gene Snapshots (gene_snapshots_*.tsv)

The file contains in plain text the gene snapshot information visible on gene report pages.

It includes only Dmel protein coding genes.

File format:

Column heading Content Description
FBgn_ID Current FlyBase identifier number for the gene.
GeneSymbol Current FlyBase symbol of the gene.
GeneName Current FlyBase name of the gene.
datestamp Date on which the information was last reviewed.
gene_snapshot_text Gene snapshot information for the gene. Cases that are in progress or are deemed to have insufficient data to summarize are stated as such.

Unique protein isoforms (dmel_unique_protein_isoforms_fb_*.tsv.gz)

The file reports D. melanogaster genes and their unique protein isoforms.

The file includes:

  • melanogaster genes located to the sequence

it excludes:

  • melanogaster genes not located to the sequence
  • non-melanogaster genes

File format:

Column heading Content Description
FBgn Current FlyBase identifier (FBgn#) of the gene.
FB_gene_symbol Current FlyBase gene symbol of the gene.
representative_protein Current FlyBase protein symbol of the representative protein isoform.
identical_protein(s) Current FlyBase protein symbol(s) of identical protein isoforms.


Non-coding RNAs (JSON) (ncRNA_genes_fb_*.json.gz)

This file reports all ncRNAs with gene models supported by FlyBase in JSON format, as submitted to RNAcentral. Pseudogenes are excluded. In addition to the symbols and IDs for ncRNAs, this file also includes their associated gene, genomic location, sequence, Sequence Ontology classification, etc. The full schema for this file is available here.

Note - from release FB2020_03 onward, this file reports only ncRNAs for D. melanogaster; earlier files include ncRNAs for D. ananassae, D. pseudoobscura pseudoobscura, D. simulans and D. virilis.


Enzyme data (dmel_enzyme_data_fb_*.tsv.gz)

This file reports nomenclature and functional data (GO annotations, EC annotations, gene group membership) for D. melanogaster genes encoding enzymes, as defined by membership of the ENZYMES (FBgg0001715) gene group. If a gene is a member of multiple enzyme gene groups, then that gene has separate entries for each group of which it is a member.

The file includes:

  • melanogaster genes located to the sequence

it excludes:

  • melanogaster genes not located to the sequence
  • non-melanogaster genes

File format:

Column heading Content Description
group_id FlyBase gene group (FBgg) ID of the relevant terminal group within the ENZYMES (FBgg0001715) hierarchy (only terminal groups contain members).
group_name FlyBase gene group (FBgg) name of relevant terminal group within the ENZYMES (FBgg0001715) hierarchy (only terminal groups contain members).
group_GO_ID The GO molecular function term ID on the given gene group. Multiple entries are separated with a pipe.
group_GO_name The GO molecular function term name on the given gene group. Multiple entries are separated with a pipe.
group_EC_number The EC number on the given gene group, if present. (This is computed, corresponding to the EC cross-reference on the GO molecular function term.)
group_EC_name The EC name on the given gene group, if present. (This is computed, corresponding to the EC cross-reference on the GO molecular function term.)
gene_id The current FlyBase gene ID (FBgn) of the gene.
gene_symbol The current FlyBase symbol of the gene.
gene_name The current FlyBase name of the gene.
gene_EC_number The EC number(s) associated with the gene, if present. Multiple entries are separated with a pipe. (This is computed, corresponding to the EC cross-reference(s) on any positive GO molecular function term(s) annotated to the gene.)
gene_EC_name The EC name(s) associated with the gene, if present. Multiple entries are separated with a pipe. (This is computed, corresponding to the EC cross-reference(s) on any positive GO molecular function term(s) annotated to the gene.)

Gene Ontology annotation files (go)

Files described in this section are in the "go" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/go/gene_association.fb.gz

Gene Association File - GAF (gene_association.fb.gz)

The file contains the Gene Ontology (GO) controlled vocabulary (CV) terms assigned to FlyBase genes.

The file includes the following Dmel genes:

  • nuclear genes located to the sequence
  • mitochondrial genes
  • genes not located to the sequence

The columns of the file are described in section G.3.1. of the Reference manual.

Gene Product Information - GPI (gp_information.fb.gz)

This file contains mapping information for FlyBase D.mel protein coding genes to UniProtKB IDs as specified by the GO consortium

Gene groups

Files described in this section are in the "genes" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/gene_group_data_fb_*tsv.gz

Gene group data (gene_group_data_fb_*.tsv)

This file reports Gene Groups in FlyBase, together with their hierarchical relationships (where relevant) and member genes. Note, that as of FB202206, this file no longer contains Pathway groups, which can be found in a separate file (pathway_group_data_fb_*.tsv)

File format:

Column heading Content Description
FB_group_id Current FlyBase identifier (FBgg##) of Gene Group.
FB_group_symbol Current FlyBase symbol of Gene Group.
FB_group_name Current FlyBase full name of Gene Group.
Parent_FB_group_id Current FlyBase identifier (FBgg##) of parent of given Gene Group (if relevant).
Parent_FB_group_symbol Current FlyBase symbol of parent of given Gene Group (if relevant).
Group_member_FB_gene_id Current FlyBase identifier (FBgn##) of member gene (if terminal group).
Group_member_FB_gene_symbol Current FlyBase symbol of member gene (if terminal group).

Notes:

  • Where groups are arranged into hierarchies:
    • the member genes are only associated with the terminal subgroups,
    • the immediate parent of any subgroup is identified in the ‘Parent_FB_group_id' and 'Parent_FB_group_symbol' columns.
  • Separate lines are used for each member gene, meaning that each terminal group is listed multiple times (equal to the number of member genes).

Gene groups with HGNC IDs (gene_groups_HGNC_fb_*.tsv)

This file reports all Gene Groups in FlyBase, together with the corresponding HGNC 'gene family' ID (where relevant).

File format:

Column heading Content Description
FB_group_id Current FlyBase identifier (FBgg##) of Gene Group.
FB_group_symbol Current FlyBase symbol of Gene Group.
FB_group_name Current FlyBase full name of Gene Group.
HGNC_family_ID HGNC ID of equivalent human 'gene family'.

Notes:

  • The absence of an HGNC_family_ID entry indicates there is no equivalent HGNC gene family for that FlyBase gene group.
  • Because of different sub-group structures (etc), a single HGNC family may be associated with multiple FlyBase gene groups.
  • Similarly, a single FlyBase gene group may be associated with multiple HGNC gene families - these are shown on separate lines.

Pathway group data (pathway_group_data_fb_*.tsv)

Pathway group data (pathway_group_data_fb_*.tsv)

This file reports all Pathway Gene Groups in FlyBase, together with their hierarchical relationships (where relevant) and member genes.

File format:

Column heading Content Description
FB_group_id Current FlyBase identifier (FBgg##) of Pathway Gene Group.
FB_group_symbol Current FlyBase symbol of Pathway Gene Group.
FB_group_name Current FlyBase full name of Pathway Gene Group.
Parent_FB_group_id Current FlyBase identifier (FBgg##) of parent of given Pathway Gene Group (if relevant).
Parent_FB_group_symbol Current FlyBase symbol of parent of given Pathway Gene Group (if relevant).
Group_member_FB_gene_id Current FlyBase identifier (FBgn##) of member gene (if terminal group).
Group_member_FB_gene_symbol Current FlyBase symbol of member gene (if terminal group).

Notes:

  • Where pathway groups are arranged into hierarchies:
    • the member genes are only associated with the terminal pathway subgroups,
    • the immediate parent of any subgroup is identified in the ‘Parent_FB_group_id' and 'Parent_FB_group_symbol' columns.
  • Separate lines are used for each member gene, meaning that each terminal group is listed multiple times (equal to the number of member genes).

Alleles and Stocks

Files described in this section are in the "alleles" or "stocks" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/alleles/allele_genetic_interactions_*tsv.gz
wget ftp://ftp.flybase.net/releases/current/precomputed_files/stocks/stocks_*.tsv.gz

Allele data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'alleles' data class.


Stock data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'stocks' data class.


Stock data (stocks_*.tsv.gz)

This file reports genetic components and related information about Stocks in FlyBase.

File format:

Column heading Content Description Example
FBst The unique identifier assigned to this stock by FlyBase. FBst0000002
collection_short_name A short name for the stock collection that holds the stock. Bloomington
stock_type_cv The controlled vocabulary term and unique identifier that describe the state of the stock. living stock ; FBsv:0000002
species Abbreviation (from the Species Abbreviations list) indicating the species of the stock. Dmel
FB_genotype Genetic components of the stock corresponding to alleles, aberrations, balancers, or insertions in FlyBase. May be empty. w[*]; betaTub60D[2] Kr[If-1]/CyO
description Genetic components of the stock as provided to FlyBase by the collection that holds the stock. FlyTrap: ZCL1796 III
stock_number The stock identifier provided to FlyBase by the collection that holds the stock. May be empty. 110818


Genetic interactions (allele_genetic_interactions_*.tsv)

The file reports controlled vocabulary (i.e. not free text) genetic interaction data associated with alleles. This is the data reported in the "Phenotypic Class" and "Phenotype Manifest in" subsections of the "Interactions" section of each Allele Report.

File format:

Column heading Content Description
allele_symbol Current FlyBase allele symbol.
allele_FBal# Current FlyBase identifier (FBal#) of allele.
interaction Interaction information associated with allele.
FBrf# Current FlyBase identifer (FBrf#) of publication from which data came.

Notes:

  • Each row contains information about a single interaction from a single reference. Thus if multiple genetic interactions have been reported for a given allele, or if multiple references report the same interaction for a given allele, multiple rows will exist for that allele in the file.


Phenotypic data (genotype_phenotype_data_*.tsv)

The file reports controlled vocabulary (i.e. not free text) phenotypic data associated with genotypes. This is the data reported in the Phenotypic Class and Phenotype Manifest in subsections of the Phenotypic Data section of each Allele Report.

File format:

Column heading Content Description
genotype_symbols Current FlyBase symbol(s) of the components that make up the genotype.
genotype_FBids Current FlyBase identifier(s) of the components that make up the genotype.
phenotype_name Phenotypic name associated with the genotype.
phenotype_id Phenotypic identifier associated with the genotype.
qualifier_names Qualifier name(s) associated with phenotypic data for genotype.
qualifier_ids Qualifier identifier(s) associated with phenotypic data for genotype.
reference Current FlyBase identifer (FBrf#) of publication from which data came.

Notes:

  • Each row contains information about a single phenotype from a single reference. Thus if multiple phenotypes have been reported for a given genotype, or if multiple references report the same phenotype for a given genotype, multiple rows will exist for that genotype in the file.
  • For cases where the genotype contains more than one component, then the components are separated as follows (columns 1 and 2):
 * Homozygous or transheterozygous combinations of classical/insertional alleles at a single locus are separated by a '/'.
 * Hemizygous combinations affecting a single locus (classical/insertional allele over a deficiency for that locus) are separated by a '/'.
 * Heterozygosity for a classical/insertional allele or aberration is represented by '/+'.
 * In all other cases, other genotype components (e.g. drivers, transgenic alleles) are separated by a space.
  • Where multiple qualifiers are used to add information to a phenotypic data, then these are separated by a pipe '|' (columns 5 and 6).
  • Where multiple entries/column can exist, the order and separation of the symbols and of the ids are preserved in the column pairs i.e. for genotype, columns 1 and 2 and qualifiers in columns 5 and 6.


  • Note: this file replaces 'allele_phenotypic_data_*.tsv' from FB2023_01 onward.

Alleles <=> Genes (fbal_to_fbgn_fb_*.tsv)

This file reports the relationship between gene identifiers and the identifiers used for alleles of these genes.

File format:

Column heading Content Description
AlleleID Current FlyBase identifier (FBal#) of the allele.
AlleleSymbol Current symbol of the allele.
GeneID Current FlyBase identifier (FBgn#) of the gene.
GeneSymbol Current symbol of the gene.

Homologs

Files described in this section are in the "orthologs" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/orthologs/dmel_paralogs_fb_*.tsv.gz

Drosophila Paralogs (dmel_paralogs_fb_*.tsv.gz)

The file reports D. melanogaster genes and their paralogs, as provided by DIOPT. (The version of DIOPT currently being used is shown in the 'Paralogs' -> 'Paralogs (via DIOPT)' section of a Gene Report.)

File format:

Column heading Content Description
FBgn_ID Current FlyBase identifier (FBgn#) of the D. melanogaster gene.
GeneSymbol Current FlyBase gene symbol of the D. melanogaster gene.
Arm/Scaffold Arm upon which the D. melanogaster gene is localized.
Location Location of D. melanogaster gene on the arm.
Strand Strand of D. melanogaster gene ('1' indicates the positive strand, '-1' indicates the negative strand).
Paralog_FBgn_ID Current FlyBase identifier (FBgn#) of the paralogous gene.
Paralog_GeneSymbol Current FlyBase gene symbol of the paralogous gene.
Paralog_Arm/Scaffold Arm upon which the paralogous gene is localized.
Paralog_Location Location of paralogous gene on the arm.
Paralog_Strand Strand of paralogous gene ('1' indicates the positive strand, '-1' indicates the negative strand).
DIOPT_score DIOPT 'score' for the paralog call (i.e. the number of individual algorithms that support the call).

Notes:

  • Each row is a pair-wise association between a given D. melanogaster and a paralog. Thus, two rows exist for each paralogous pair in the file.

Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)

This file reports the human orthologs of D. melanogaster genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines. Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.

File format:

Column heading Content Description
Dmel_gene_ID Current FlyBase identifier (FBgn#) of the D. melanogaster gene.
Dmel_gene_symbol Current FlyBase gene symbol of the D. melanogaster gene.
Human_gene_HGNC_ID HGNC ID of orthologous human gene.
Human_gene_OMIM_ID OMIM ID of orthologous human gene.
Human_gene_symbol HGNC gene symbol of orthologous human gene.
DIOPT_score DIOPT 'score' for orthology call (i.e. the number of individual algorithms that support the call).
OMIM_Phenotype_IDs OMIM Phenotype ID of orthologous human gene (comma separated values).
OMIM_Phenotype_IDs[name] OMIM Phenotype ID of orthologous human gene (with the corresponding OMIM name in square brackets). Multiple phenotype[name] entries are separated by a comma.

Human disease

Files described in this section are in the "human_disease" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/human_disease/disease_model_annotations_fb_*.tsv.gz

Human disease model data (disease_model_annotations_fb_*.tsv.gz)

This file reports (i) all experimental-based disease model annotations, associated with alleles; and (ii) all 'potential' disease models based on orthology to human disease genes in OMIM (see FBrf0241599 for more information on this pipeline) for D. melanogaster. 'Alleles' encompass both classical alleles and transgenic alleles; the latter may relate to transgenic constructs of D. melanogaster genes or non-D. melanogaster genes (often human genes) inserted into the D. melanogaster genome. These disease model annotations are reported in the "Human Disease Model Data" -> "Disease Ontology (DO) Annotations" section of the Gene and Allele Reports.

File format:

Column heading Content Description
FBgn ID Current FlyBase identifier (FBgn#) of the gene associated with the allele of an experimental annotation, or the D. melanogaster ortholog of a human gene associated with a disease in OMIM.
Gene symbol Current FlyBase symbol of the gene in column 1.
HGNC ID HGNC ID of the gene identified in column 1 where it is a human gene (experimental-based annotations only).
DO qualifier Type of association between the object of annotation and the disease - one of 'model of', 'ameliorates', 'exacerbates', 'DOES NOT model', 'DOES NOT ameliorate' or 'DOES NOT exacerbate'.
DO ID Disease Ontology (DO) ID.
DO term Disease Ontology (DO) term.
Allele used in model (FBal ID) Current FlyBase identifier (FBal#) of allele (experimental-based annotations only).
Allele used in model (symbol) Current FlyBase symbol of allele (experimental-based annotations only).
Based on orthology with (HGNC ID) HGNC ID of the human ortholog used for annotations based on orthology to human disease genes.
Based on orthology with (symbol) HGNC gene symbol of the human ortholog used for annotations based on orthology to human disease genes.
Evidence/interacting alleles Evidence code, with interacting allele(s) where appropriate. For experimental-based annotations, the evidence code is one of: 'inferred from mutant phenotype', 'in combination with', 'modeled by', 'is ameliorated by', 'is exacerbated by', 'is NOT ameliorated by' or 'is NOT exacerbated by'. Interacting alleles are give as 'FLYBASE:<allele_symbol>; FB:<FBal_ID>', with multiple alleles separated by a comma. For orthology-based annotations, the evidence code is 'inferred from electronic annotation'.
Reference (FBrf ID) Current FlyBase identifier (FBrf#) of the source publication.

Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)

This file reports the human orthologs of D. melanogaster genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines. Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.

This is identical to the file of the same name listed under the 'Orthologs' section above.

Organisms

Files described in this section are in the "species" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/species/organism_list_fb*.tsv.gz

Species list (organism_list_*.tsv.gz)

This file lists all the species for which FlyBase has some information.

FlyBase includes gene reports for genes derived from species within the family Drosophilidae, as well as gene reports for non-drosophilid genes that have been introduced into a Drosophila genome via either transposable-element based transgenic constructs or via targeted insertion of DNA by a technique such as homologous recombination or CRISPR/Cas9. In this case, there will be a species 'Abbreviation' in the table, a standard prefix that is used in FlyBase as the first part of the symbol (before the '\') of any object, e.g. a gene or allele, that originates from this species.

In addition, information about non-Drosophilid species is also included in orthology data that is diplayed on gene reports and on G/JBrowse. In this case, a species 'Abbreviation' is not automatically generated in the database for the species, and thus the column in the table may be blank.

The file thus includes information for both Drosophilid and non-Drosophilid species.


File format:

Column heading Content Description
Genus The genus designation of the organism.
Species name The species designation of the organism.
Abbreviation The standard FlyBase prefix for the species. This abbreviation is used in FlyBase as the first part of the symbol (before the '\') of any object, e.g. a gene or allele, that originates from this species. This column may be blank, if no individual report page exists for that species in FlyBase.
Common name The NCBI Taxonomy Database common name of the organism. This column may be blank.
Ncbi-taxon-id The NCBI Taxonomy Database Taxon ID for the organism. This column may be blank.
drosophilid If the species is from the family Drosophilidae, this column is filled in with 'y'.

Ontology Terms

The ontology files used by FlyBase are in the OBO format used by the Open Biomedical Ontology group, and may be viewed using the free OBO-Edit tool.

Ontologies undergo continual development. Links are provided to the 'frozen versions' used for the current release of FlyBase, together with links to the current 'live' versions at external sites.

Frozen files used for this release of FlyBase

List of ontologies available for download:

  • FBbt: fly_anatomy
  • FBdv: fly_development
  • FBcv: flybase controlled vocabulary
  • FBsv: stock ontology
  • GO: gene ontology
  • FBbi: image ontology
  • SO: sequence ontology
  • DO: human disease ontology


Current 'Live' Files

List of ontologies available for download:

  • FBbt: fly_anatomy

Note: link points to the ontology version fbbt-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'fly_anatomy.obo' version

  • FBdv: fly_development

Note: link points to the ontology version fbbt-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'fly_development.obo' version

  • FBcv: flybase controlled vocabulary

Note: link points to the ontology version fbcv-simple.obo, which lacks a few minor FlyBase specific changes that are present in the 'flybase_controlled_vocabulary.obo' version

  • FBsv: stock ontology
  • GO: gene ontology
  • FBbi: image ontology
  • SO: sequence ontology
  • DO: human disease ontology


Genomes: Annotation and Sequence

All Sequenced Drosophila Species

Links are available to the following FTP repositories:

  • Current FTP repository
  • Current FastA repository
  • Current GFF repository
  • FTP archive (previous releases)
  • Current list of individual FASTA files
  • Current list of individual GFF files


Individual Sequenced Drosophila Species

From release FB2020_03 onward, the above links are available for downloading only D. melanogaster data.

For releases FB2018_06 to FB2020_02, the above links are available for the following sequenced Drosophila species:

Species name Abbreviation
Drosophila melanogaster Dmel
Drosophila ananassae Dana
Drosophila pseudoobscura pseudoobscura Dpse
Drosophila simulans Dsim
Drosophila virilis Dvir


For earlier archived releases, the above links are also available for these additional species (other members of the original 12 sequenced Drosophila species):

Species name Abbreviation
Drosophila erecta Dere
Drosophila grimshawi Dgri
Drosophila mojavensis Dmoj
Drosophila persimilis Dper
Drosophila sechellia Dsec
Drosophila willistoni Dwil
Drosophila yakuba Dyak

FASTA files

The FlyBase FASTA files generally follow the FASTA format guidelines with one exception being that our header lines sometime exceed the 80 character limit. The FASTA filenames follow these formats:

dmel-all--r<release-number>.fasta.gz

or

dmel-<chromosome_arm>-<data_type>-r<release-number>.fasta.gz

Where data_type is one of the following entries in the table below. The all files contain sequences for those data types on all chromosome arms whereas the specific chromosome arm have only those features for that particular chromosome.

Data Type Content Description
aligned The region of genomic sequence that analysis features align to.
CDS The contiguous protein coding sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.
chromosome The sequence of each chromosome arm.
clones The sequence of full length cDNA, 3' and 5' ESTs, and partial length clones.
exon The sequence of each exon split up into individual FASTA records.
five_prime_UTR The sequence of 5' untranslated regions.
gene The sequence of the gene span.
gene_extended2000 The sequence of the gene span with 2000 base pairs added upstream and downstream.
intergenic The sequence of chromosomal regions between genes that do not contain known gene models.
intron The sequence of each intron split up into individual FASTA records.
miRNA The sequence of transcripts that are typed as micro RNAs.
miscRNA The sequence of transcripts that are typed as small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), or ribosomal RNA (rRNA). May also contain other transcript types that do not exist in their own individual files.
ncRNA The sequence of transcripts that are typed as non coding RNAs (ncRNA).
predicted The sequence of various features that are derived from a variety of prediction algorithms. These can encompass analyses conducted by FlyBase or by 3rd party groups.
pseudogene The sequence of transcripts that are typed as pseudogenes.
sequence_features The sequence of sequence features, which currently describe data about RNAi reagents. In the future, it will also contain natural genomic features (aside from transcribed regions), such as replication origins, transcription factor binding sites and boundary elements, and other experimental reagents that map to the genome, such as microarray oligonucleotides and rescue fragments.
synteny The sequence of syntenic regions between two species.
three_prime_UTR The sequence of 3' untranslated regions.
transcript The sequence of transcripts that are typed as messenger RNAs (mRNA).
translation The resulting protein sequence from protein coding transcripts.
transposon The sequence of transposable elements.
tRNA The sequence of transcripts that are typed as transfer RNAs (tRNA).


The typical format of our FASTA header begins with an ID followed by any number of fields that follow this format

field_name=value;

Multiple field values are separated by commas

field_name=value1,value2;

This table describes some of the field names found in our FASTA headers

Field Name Description
type The feature type of the FASTA sequence record.
loc The genomic location given in the NCBI's feature location format. Please see the NCBI's site for more information.
ID A unique ID. IDs in the form of FBxx[0-9]+ are a unique FlyBase object identifier.
name The name or symbol of the feature.
dbxref Database cross references relating to the FASTA record. The dbxref values use a 'dbname:dbid' format.
MD5 An MD5 checksum calculated from the sequence that can be used to identify identical sequences.
length The length of the sequence found in the FASTA record.
release The release number denotes the annotation release which this FASTA record corresponds to.
species The species abbreviation that this FASTA record corresponds to.


GFF files

The FlyBase GFF files follow the GFF v3 specification. The GFF files contain feature line definitions for gene models, predicted features, alignments, and many other features.

For melanogaster, there are 4 GFF files distributed:

dmel-all-r<release-number>.gff.gz
Contains all major chromosome arms (X, 2L, 2R, 3L, 3R, 4, Y, mitochondrion_genome) and ~1,860 minor scaffolds.
dmel-all-no-analysis-r<release-number>.gff.gz
Same as 'dmel-all' except all match and match_part features have been removed.
dmel-all-filtered-r<release-number>.gff.gz
Same as 'dmel-all' except all trans spliced (SO:0000459) and discistronics (SO:0000722) have been removed.
dmel-<chromosome_arm>-r<release-number>.gff.gz
Contains only a single chromosome arm or minor scaffold as identified by the filename. Included within the dmel-gff_all_scaffolds-r<release-number>.gff.gz folder.


The other species have the all chromosome arm file and also a tar and gzipped file containing the individual scaffolds. Please note that the tarball contains thousands of files in a single directory level so extracting them may result in filesystem performance issues.

The GFF files are produced for each species and can be downloaded from our FTP site using this URL form:

ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gff/

e.g. ftp://ftp.flybase.org/genomes/dmel/current/gff/

GTF files

The FlyBase GTF files follow the GTF v2.2 specification. The GTF files contain feature line definitions for gene models.

The GTF are produced for each species and can be downloaded from our FTP site using this URL form:

ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gtf/

e.g. ftp://ftp.flybase.org/genomes/dmel/current/gtf/


Transcripts and Polypeptides

Transcript data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'transcripts' data class.


Polypeptide data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'polypeptide' data class.


Non-coding RNAs (JSON) (ncRNA_genes_fb_*.json.gz)

This file reports all ncRNAs with gene models supported by FlyBase in JSON format, as submitted to RNAcentral. Pseudogenes are excluded. In addition to the symbols and IDs for ncRNAs, this file also includes their associated gene, genomic location, sequence, Sequence Ontology classification, etc. The full schema for this file is available here.

Note - from release FB2020_03 onward, this file reports only ncRNAs for D. melanogaster; earlier files include ncRNAs for D. ananassae, D. pseudoobscura pseudoobscura, D. simulans and D. virilis.

Transposons, Transgenic Constructs, and Insertions

Files described in this section are in the "insertions" subdirectory of the FTP site (unless otherwise noted). Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/insertions/insertion_mapping_fb_*.tsv.gz

Insertions (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'insertions' data class.


Transgenic Constructs (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'transgenic constructs' data class.


Transgenic construct maps (construct_maps.zip)

The construct_maps.zip file unpacks as a directory containing maps of recombinant constructs and transgenic transposons generated by FlyBase, that are based on the compiled sequence data curated by FlyBase. The name of each PNG image in the directory corresponds to the FlyBase identifier of the respective recombinant construct or transgenic transposon.

Please note: For transgenic transposons, the image may be a map of the corresponding plasmid form.


Map data for insertions (insertion_mapping_*.tsv)

The insertion mapping table reports available localization information for Dmel insertions.

File format:

Column heading Content Description
insertion_symbol Current symbol of insertion.
FBti# Current FlyBase identifier (FBti#) of insertion

.

genomic_location Genomic location of insertion.
range Range (t/f) indicates whether genomic location is range or single base.
orientation Orientation (1/0) indicates orientation of insertion on chromosome.
estimated_cytogenetic_location Estimated cytogenetic location based on correlation of genomic location and estimated genomic location of cytological bands.
observed_cytogenetic_location Observed cytogenetic location reported in the literature.


Transposable elements (canonical set) (transposon_sequence_set.*)

These files, in FASTA or GFF format, represent 'canonical' sequences of transposable elements of Drosophila species (primarily but not exclusively of D. melanogaster), including the protein sequences of encoded genes. Based on a file originally compiled by Michael Ashburner; currently maintained by Casey Bergman.
To download the latest files: wget ftp://ftp.flybase.net/releases/current/precomputed_files/transposons/transposon_sequence_set.fa.gz
wget ftp://ftp.flybase.net/releases/current/precomputed_files/transposons/transposon_sequence_set.gff.gz

Frequently-used GAL4 drivers table (JSON) (fu_gal4_table_fb_2018_06.json.gz)

This file reports a list of all GAL4 drivers that have been curated to at least 21 references and/or are among 150 most frequently requested GAL4 stocks from the Bloomington Drosophila Stock Center, in JSON format. In addition to the symbols and IDs for Scer\GAL4 alleles, this file also includes their associated transposon or insertion, associated gene, expression pattern in controlled vocabulary stage and anatomy terms, stocks, and publications, all with IDs, as well as free text expression pattern descriptions. This file, except for publications and stocks, is also available in TSV format here.

Aberrations

Aberration data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'aberrations' data class.


Balancer data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'balancers' data class.


Large dataset metadata

Files described in this section are in the "metadata" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/metadata/dataset_metadata_fb_*.tsv.gz


Dataset metadata members (dataset_metadata_fb_*.tsv.gz)

This file lists all features that are associated with a dataset/collection (e.g., genes, cDNA clones, TF_binding_sites, Affymetrix probes).

File format:

Column heading Content Description
Dataset_Metadata_ID The unique FlyBase ID for the dataset.
Dataset_Metadata_Name The official FlyBase symbol for the dataset.
Item_ID The unique FlyBase ID for the feature associated with this dataset.
Item_Name The official FlyBase symbol for the feature associated with this dataset.

Clones

Files described in this section are in the "clones" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/clones/cDNA_clone_data_fb_*.tsv.gz

Clone data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'clones' data class.


cDNAs: FBcl <=> acc. ID (cDNA_clone_data_*.tsv)

The file reports basic cDNA clone data in FlyBase.

File format:

Column heading Content Description
FBcl# Current FlyBase identifier (FBcl#) of cDNA clone.
organism_abbreviation Abbreviation (from the Species Abbreviations list) indicating the species of origin of the clone.
clone_name Clone name.
dataset_metadata_name Name of dataset associated with clone.
cDNA_accession(s) EMBL/GenBank/DDBJ cDNA accession number.
EST_accession(s) EMBL/GenBank/DDBJ EST accession number.


Genomic: FBcl <=> acc. ID (genomic_clone_data_*.tsv)

The file reports basic genomic clone data in FlyBase.

File format:

Column heading Content Description
FBcl# Current FlyBase identifier (FBcl#) of genomic clone.
organism_abbreviation Abbreviation (from the Species Abbreviations list) indicating the species of origin of the clone.
clone_name Clone name.
accession EMBL/GenBank/DDBJ cDNA accession number.

References

Files described in this section are in the "references" subdirectory of the FTP site. Download the latest file using a query of this form:
wget ftp://ftp.flybase.net/releases/current/precomputed_files/references/fbrf_pmid_pmcid_doi_fb*.tsv.gz

Combined reference data (Chado XML)

The chado XML file generated from the FlyBase PostgreSQL database for the 'references' data class.

FlyBase FBrf <=> PubMed ID <=> PMCID <=> DOI (fbrf_pmid_pmcid_doi_fb_*.tsv.gz)

This file lists all publications in the FlyBase bibliography that have a PubMed ID. Additional identifiers are listed as applicable.

File format:

Column heading Content Description
FBrf The unique FlyBase ID for this publication.
PMID The unique PubMed ID for this publication.
PMCID The unique PubMed Central ID for this publication, if applicable.
DOI The digital object identifier assigned to the publication.
pub_type The publication type (for example, paper, review, erratum, abstract, book, etc.)
miniref A short citation listing the first author, year of publication, journal, volume, issue and page numbers.
pmid_added The FlyBase release in which the publication was first incorporated into the FlyBase bibliography. Note: as this report first generated for fb_2012_01 release, all publications associated with a Pub Med ID prior to this release have pmid_added = fb_2011_10.

Map conversion tables

Cytological <=> Sequence (genome-cyto-seq.txt)

This is a tab delimited file that FlyBase uses to relate sequence coordinates from release 5 of the Drosophila melanogaster sequence assembly to published cytogenetic map positions. A description of how this is calculated is provided in section G.5.1. of the Reference manual.

The data for each chromosome arm is separated by a line starting with a '#' that lists the name of the chromosome arm and corresponding sequence scaffold.

The columns in the file are:

Column heading Content Description
- Cytogenetic map position as described by Bridges.
- First sequence coordinate for this map position in the sequence scaffold corresponding this chromosome arm.
- Last sequence coordinate for this map position in the sequence scaffold corresponding this chromosome arm.


Cytological <=> Genetic (cytotable.txt)

This is the table that FlyBase uses to infer a genetic map position from a published cytogenetic map position for Drosophila melanogaster.

The first six lines of the file describe the contents of the file or are blank. The data in the file is organized with the cytological position in first four characters of a line followed by a run of spaces and then the genetic map position.


Cyto <=> Genetic <=> Seq (cyto-genetic-seq.tsv)

This is a tab separated file generated from the cytotable.txt and genome-cyto-seq.txt files that infers the relationship between published cytogenetic map positions, genetic map positions and release 6 sequence assembly coordinates for Drosophila melanogaster. Please note that band numbers are not given in this file because they are absent in cytotable.txt.

File format:

Column heading Content Description
Cytogenetic map position Cytogenetic map position.
Genetic map position Genetic map position.
Sequence coordinates (release 6) Sequence coordinates (release 6) for the interval.
R6 conversion notes


An html version of this file is also available - see the Map Conversion Table page.


Genes map table (gene_map_table_fb_*.tsv)

This is identical to the file listed under the genes section above.