Difference between revisions of "FlyBase:Downloads Overview"

From FlyBase Wiki
Jump to navigation Jump to search
Line 556: Line 556:
 
#'''accession_id'''<br />INSDC accession ID
 
#'''accession_id'''<br />INSDC accession ID
 
#'''FB_gene_ID'''<br />Current FlyBase gene identifier (FBgn#)
 
#'''FB_gene_ID'''<br />Current FlyBase gene identifier (FBgn#)
#'''species_FB_annotation_ID(locus_tag)'''<br />Current FlyBase annotation ID, in the form "<species abbreviation>_<annotation_ID>, which equates to the 'locus tag' field in INSDC records.
+
#'''species_FB_annotation_ID(locus_tag)'''<br />Current FlyBase annotation ID, in the form "<species abbreviation>_<annotation_ID>", which equates to the 'locus tag' field in INSDC records.
  
 
===Gene groups===
 
===Gene groups===

Revision as of 20:53, 20 June 2017

Introduction

The "Precomputed data files for the current release" page (referred to herein as the "Precomputed files" page) lists data files generated from the current release of FlyBase and links to the current FTP repository. Data from the previous five releases can be found on the "Archived data files for recent releases" page (referred to herein as the "Archived data" page), as well as links to servers hosting older releases of FlyBase, and all the release notes and news archives from FB2006_01. If you are looking for old data and cannot find it on the "Archived data" page please follow a link to the FTP repository.

Known issues

Safari does not connect properly to the FlyBase FTP site. You should be able to download individual files using Safari but you will not be able to browse the FTP repository.


The FTP archive

The "Main Data Set" section of the "Precomputed files" page provides links to FlyBase FTP repository. The "Chado database" link leads to the psql directory of the current FTP repository where you can obtain a dump of the PostgreSQL Chado database. If you have a PostgreSQL client application installed and would like to access the latest FlyBase release without installing the database you can connect to the FlyBase public read only Chado database as:

$ psql -h chado.flybase.org -U flybase flybase

The version running on this service is identical to the current web site release.

The "Drosophila Data" section contains links to other sections of the FTP repository. The data can be accessed either by the version of FlyBase (The "Current FTP repository") or for sequence data by the annotation release of a particular Drosophila species (see the "Genomes FTP archive")

General information about the available files

File names

The first part of a filename always describes the content of the file, for example the file fbgn_annotation_ID_fb_2008_10.tsv.gz maps the primary FlyBase identifiers to the annotation symbols used for genes.

Many of the precomputed data filenames also contain a release or version number. In the example above "fb_2008_10" denotes version FB2008_10 of FlyBase. Sequence data files, such as dmel-all-CDS-r5.13.fasta.gz, denote the annotation release that the data relates to as '-r5.13'. The 'r5' indicates the data is taken from the release 5 of the sequence assembly and the '13' after the decimal point specifies the annotation release. Please see "A more frequent FlyBase update cycle" for further discussion of the difference between the current release of FlyBase and a genome annotation release.

The following notation is used for different file formats:

Tab separated files are indicated by the extension '.tsv'
OBO format files, which are used for the ontologies, are denoted '.obo'
Plain text files are given the extension '.txt'
Files containing nucleic acid or polypeptide sequence data in the FASTA format are listed as '.fasta'
GFF files include the suffix '.gff'
GTF files include the suffix '.gtf'
XML data files are denoted '.xml'

Most of the files listed on the "Precomputed files" page and the "Archived data" page are compressed with the GNU gzip program. These files end with the suffix '.gz'. The ontology files are compressed in the ZIP file format, as indicated by the suffix '.zip'.


Accessing files

Files can be downloaded either directly through the web interface, or by using an ftp client such as wget to obtain the file from the FTP repository. Please note that at present Safari fails to connect to the FTP repository correctly, so we recommend that you use another browser if you wish to access the files through the web interface. The ftp client wget accepts wild card patterns which means you can use a query of the following kind to obtain the latest file without having to specify the FlyBase release number:

$ wget ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz

The /releases/current/ path will always point to latest FlyBase release and this directory will have only one copy of the file. Archived copies of the files from previous releases can be obtained by including the FlyBase release in the path /releases/<RELEASE_NUMBER>/. For example to retrieve this file for the FB2008_05 release, type:

$ wget ftp://ftp.flybase.net/releases/FB2008_05/precomputed_files/genes/fbgn_annotation_ID_*.tsv.gz

or more specifically:

$ wget ftp://ftp.flybase.net/releases/FB2008_05/precomputed_files/genes/fbgn_annotation_ID_fb_2008_05.tsv.gz


Opening compressed files

Microsoft and Apple include built-in ZIP support in later versions of their operating systems. These files can also be opened with the free program 7-zip.

On OS X or Unix a GZIP compressed file can be extracted with the gunzip command. On a Windows machine we suggest that you use the program 7-zip to open these files, because several people have reported problems using WinZip. You should be able to open and read the resulting file with any text editor.


Types of files available

XML files

These files are generated from the PostgreSQL Chado database for each release of FlyBase. The Chado XML files contain all the information used to generate the report pages and reflect the organization of the data in the database, whereas the Reporting XML files contain a more compact version of the Chado XML. The DTDs for these XML files, listing the structure of the files, are posted in the chado-xml and reporting-xml directories of the FTP repository for each release. For the latest versions of the DTDs please see:

ftp://ftp.flybase.net/releases/current/chado-xml/

The XML files can also be obtained directly from the "Precomputed files" page by clicking on the "download" link under the ChadoXML heading in the appropriate section. The Chado XML files are available for genes, alleles, stocks, transcripts, polypeptides, insertions, transgenic (recombinant) constructs, aberrations, balancers, clones, and references.

Please note:

The chado_genes.xml.gz file is used to generate the FlyBase gene report pages. Data for the GMOD Common Gene Page project is provided in the chado_common_gene_page.xml.gz file, which is only available as Chado XML from the FTP repository.


Ontology files

The controlled vocabularies (aka ontologies) used by FlyBase are available under the Ontology Terms section of the "Precomputed files" page. Each controlled vocabulary is described in detail in section G.2. of the Reference Manual. The files are in the OBO format used by the Open Biomedical Ontology group, and are designed to be used with the free OBO-Edit tool.

Controlled vocabularies undergo continual development; terms and definitions are refined, added, merged, split and obsoleted in an effort to improve the way they represent their various subjects. On the "Precomputed files" page the frozen versions of the controlled vocabularies used for the current release of FlyBase are available, and there are also links to the current 'live' versions maintained by the Open Biomedical Ontology group.

Frozen versions of the controlled vocabularies used for previous releases of FlyBase are available on the "Archived data" page, and in the following directories of the FTP repository:

ftp://ftp.flybase.net/releases/<RELEASE_NUMBER>/precomputed_files/ontologies/

For example see:

ftp://ftp.flybase.net/releases/FB2008_03/precomputed_files/ontologies/

FASTA files

The FlyBase FASTA files generally follow the FASTA format guidelines with one exception being that our header lines sometime exceed the 80 character limit. The FASTA filenames follow these formats:

dmel-all--r<release-number>.fasta.gz

or

dmel-<chromosome_arm>-<data_type>-r<release-number>.fasta.gz

Where data_type is one of the following entries in the table below. The all files contain sequences for those data types on all chromosome arms whereas the specific chromosome arm have only those features for that particular chromosome.

Data Type Content Description
aligned The region of genomic sequence that analysis features align to.
CDS The contiguous protein coding sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.
chromosome The sequence of each chromosome arm.
clones The sequence of full length cDNA, 3' and 5' ESTs, and partial length clones.
exon The sequence of each exon split up into individual FASTA records.
five_prime_UTR The sequence of 5' untranslated regions.
gene The sequence of the gene span.
gene_extended2000 The sequence of the gene span with 2000 base pairs added upstream and downstream.
intergenic The sequence of chromosomal regions between genes that do not contain known gene models.
intron The sequence of each intron split up into individual FASTA records.
miRNA The sequence of transcripts that are typed as micro RNAs.
miscRNA The sequence of transcripts that are typed as small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), or ribosomal RNA (rRNA). May also contain other transcript types that do not exist in their own individual files.
ncRNA The sequence of transcripts that are typed as non coding RNAs (ncRNA).
predicted The sequence of various features that are derived from a variety of prediction algorithms. These can encompass analyses conducted by FlyBase or by 3rd party groups.
pseudogene The sequence of transcripts that are typed as pseudogenes.
sequence_features The sequence of sequence features, which currently describe data about RNAi reagents. In the future, it will also contain natural genomic features (aside from transcribed regions), such as replication origins, transcription factor binding sites and boundary elements, and other experimental reagents that map to the genome, such as microarray oligonucleotides and rescue fragments.
synteny The sequence of syntenic regions between two species.
three_prime_UTR The sequence of 3' untranslated regions.
transcript The sequence of transcripts that are typed as messenger RNAs (mRNA).
translation The resulting protein sequence from protein coding transcripts.
transposon The sequence of transposable elements.
tRNA The sequence of transcripts that are typed as transfer RNAs (tRNA).


FASTA header format

The typical format of our FASTA header begins with an ID followed by any number of fields that follow this format

field_name=value;

Multiple field values are separated by commas

field_name=value1,value2;

This table describes some of the field names found in our FASTA headers

Field Name Description
type The feature type of the FASTA sequence record.
loc The genomic location given in the NCBI's feature location format. Please see the NCBI's site for more information.
ID A unique ID. IDs in the form of FBxx[0-9]+ are a unique FlyBase object identifier.
name The name or symbol of the feature.
dbxref Database cross references relating to the FASTA record. The dbxref values use a 'dbname:dbid' format.
MD5 An MD5 checksum calculated from the sequence that can be used to identify identical sequences.
length The length of the sequence found in the FASTA record.
release The release number denotes the annotation release which this FASTA record corresponds to.
species The species abbreviation that this FASTA record corresponds to.


GFF files

The FlyBase GFF files follow the GFF v3 specification. The GFF files contain feature line definitions for gene models, predicted features, alignments, and many other features. The GFF files are produced for each species and can be downloaded from our FTP site using this URL form:

ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gff/

e.g. ftp://ftp.flybase.org/genomes/dmel/current/gff/

For melanogaster, there are 3 GFF files distributed, they include:

dmel-all-r<release-number>.gff.gz
Contains all chromosome arms
dmel-all-no-analysis-r<release-number>.gff.gz
Same as above except all match and match_part features have been removed
dmel-<chromosome_arm>-r<release-number>.gff.gz
Contains only a single chromosome arm as identifed by the filename

The other species have the all chromosome arm file and also a tar and gzipped file containing the individual scaffolds. Please note that the tarball contains thousands of files in a single directory level so extracting them may result in filesystem performance issues.

GTF files

The FlyBase GTF files follow the GTF v2.2 specification. The GTF files contain feature line definitions for gene models. The GTF are produced for each species and can be downloaded from our FTP site using this URL form:

ftp://ftp.flybase.org/genomes/<species abbreviation>/current/gtf/

e.g. ftp://ftp.flybase.org/genomes/dmel/current/gtf/

Precomputed data text files

Precomputed data files that contain useful sets of data are generated for every release of FlyBase. For example, the file fbgn_NAseq_Uniprot_fb_2008_10.tsv.gz contains the mapping between valid FlyBase identifiers and the corresponding nucleic acid and protein accession numbers used by DDBJ/EMBL/GenBank and UniprotKB/Swiss-Prot/TrEMBL. These files can be found under the "Other" heading of each section of the "Precomputed files" and "Archived data" pages, and are also available under the precomputed_files directory of each FlyBase release in the FTP repository.

Superscripts and subscripts are represented in the precomputed data files in the ASCII text format used by FlyBase, which is described in section 10.3 of the Nomenclature document.

At the top and bottom of each tab separated text file there are a few lines that describe the file. These lines start with a '#' symbol. The line immediately before the start of the data contains headings for each of the tab separated columns in the file. The file can also include some blank lines to separate information about the version of the file from the description of data in the file.

Each precomputed data file available for download on the "Precomputed files" page contains the complete data set for the FlyBase release. Please note, if you are only looking for information on a defined subset of genes, or other FlyBase data type, you can query the current set of precomputed data files through the Batch Download tool to obtain the data you require. This approach is described in more detail in the "How to Download Field Data" section of the help document at the bottom of the Batch Download page.


Contents of the precomputed data text files listed by section

Main data set

Postgres Chado database dump

Chado database (ftp://ftp.flybase.net/releases/FB*/psql)

The entire SQL Chado database is available for download. Follow the "README" directions herein.

Drosophila data

Current FTP repository (ftp://ftp.flybase.net/releases/FB*/)

All files for this current FlyBase release are available on this FTP site.

Current Chado-XML repository (ftp://ftp.flybase.net/releases/FB*/chado-xml)

All Chado XML files for this current FlyBase release are available on this FTP site.

Current Reporting-XML repository (ftp://ftp.flybase.net/releases/FB*/reporting-xml)

All Reporting XML for this current FlyBase release are available on this FTP site.

Genomes FTP archive (ftp://ftp.flybase.net/genomes/)

All FlyBase genome and genome annotation files are available for each of the 12 sequenced Drosophila species. Formats include Chado XML, DNA, FASTA, GFF and GTF. Files from both the current release and previous FlyBase releases are offered.

Synonyms

FlyBase synonyms (fb_synonym_*.tsv)

The file reports current symbols and synonyms for the following objects in FlyBase: genes, alleles, balancers, aberrations, insertions, recombinant constructs, transcripts, and proteins.

The file includes:

nuclear genes located to the sequence
mitochondrial genes
genes not located to the sequence

Columns are:

  1. Primary FlyBase identifier for the object
  2. Current symbol used in FlyBase for the object
  3. Current full name used in FlyBase for the object
  4. Non-current full name(s) associated with the object
  5. Non-current symbol(s) associated with the object


Genes

Genes data (Chado XML or Reporting XML)

Genetic interaction table (gene_genetic_interactions_*.tsv)

The file reports the summary of gene-level genetic interactions in FlyBase. This data is computed from the allele-level genetic interaction data captured by FlyBase curators.

The file includes information for Dmel genes.

Interactions involving any of the following kinds of allele are considered when the gene-level genetic interaction data is computed:

classical mutations,
alleles carried on transgenic constructs,
loss-of-function mutations,
gain-of-function mutations.

Columns are:

  1. Starting_gene(s)_symbol
    Current FlyBase symbol of gene(s) involved in the starting genotype.
  2. Starting_gene(s)_FBgn
    Current FlyBase identifier (FBgn#) of gene(s) involved in the starting genotype.
  3. Interacting_gene(s)_symbol
    Current FlyBase symbol of gene(s) involved in the interacting genotype.
  4. Interacting_gene(s)_FBgn
    Current FlyBase identifier (FBgn#) of gene(s) involved in the interacting genotype.
  5. Interaction_type
    Type of interaction observed, either 'suppressible' or 'enhanceable'.
  6. Publication_FBrf
    Current FlyBase identifier (FBrf#) of publication from which the data came.

'suppressible' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are suppressed by mutation of the gene(s) listed in the interacting genotype (column 3).

'enhanceable' in column 5 indicates that phenotypes caused by mutation of the gene(s) listed in the starting genotype (column 1) are enhanced by mutation of the gene(s) listed in the interacting genotype (column 3).

e.g.

Pten FBgn0026379 Akt1 FBgn0010379 suppressible FBrf0127089

indicates that phenotype(s) caused by a mutation of Pten are suppressed by a mutation of Akt1.

For cases where multiple genes are simultaneously mutated in either (or both) the starting and interacting genotype, then the genes involved are separated by a '|' in the relevant columns. In this case, the order of the list of symbols and of the list of ids in columns 1 and 2, or in columns 3 and 4 respectively are the same, so that the FBgn corresponding to the symbol for each gene can easily be identified.

e.g.

robo1|sli FBgn0005631|FBgn0264089 RhoGAP93B FBgn0038853 enhanceable FBrf0191476

indicates that:

  • phenotype(s) caused by a robo1, sli double mutant combination are enhanced by a mutation of RhoGAP93B.
  • FBgn0005631 corresponds to robo1, FBgn0264089 corresponds to sli

Each row contains information from a single reference. Thus if the same genetic interaction has been reported in mutiple references, multiple rows will exist for that genetic interaction in the file.

RNA-Seq RPKM values (gene_rpkm_report_fb_*.tsv.gz)

This file reports gene expression values based on RNA-Seq experiments, calculated as reads per kilobase per million reads (RPKM). RPKM values are calculated only for the unique exonic regions of the gene (excluding segments that overlap other genes), except for genes derived from dicistronic/polycistronic transcripts, in which case all exon regions are used in the RPKM expression calculation.

Columns are:

  1. Release_ID
    The D. melanogaster annotation set version from which the gene model used in the analysis derives.
  2. FBgn#
    The unique FlyBase gene ID for this gene.
  3. GeneSymbol
    The official FlyBase symbol for this gene.
  4. Parent_library_FBlc#
    The unique FlyBase ID for the dataset project to which the RNA-Seq experiment belongs.
  5. Parent_library_name
    The official FlyBase symbol for the dataset project to which the RNA-Seq experiment belongs.
  6. RNASource_FBlc#
    The unique FlyBase ID for the RNA-Seq experiment used for RPKM expression calculation.
  7. RNASource_name
    The official FlyBase symbol for the RNA-Seq experiment used for RPKM expression calculation.
  8. RPKM_value
    The RPKM expression value for the gene in the specified RNA-Seq experiment.
  9. Bin_value
    The expression bin classification of this gene in this RNA-Seq experiment, based on RPKM value. Bins range from 1 (no/extremely low expression) to 8 (extremely high expression).
  10. Unique_exon_base_count
    The number of exonic bases unique to the gene (not overlapping exons of other genes). Field will be blank for genes derived from dicistronic/polycistronic transcripts.
  11. Total_exon_base_count
    The number of bases in all exons of this gene.
  12. Count_used
    Indicates if the RPKM expression value was calculated using only the exonic regions unique to the gene and not overlapping exons of other genes (Unique), or, if the RPKM expression value was calculated based on all exons of the gene regardless of overlap with other genes (Total). RPKM expression values are typically reported for the "Unique" count, except for genes on dicistronic/polycistronic transcripts, in which case the "Total" count is reported.

Physical interaction table (physical_interactions_fb_*.tsv.gz)

This file reports unique gene pairs with curated support for some type of physical interaction. The file does not currently distinguish between genes that are involved in protein-protein or RNA-protein interactions (or both).

Columns are:

  1. gene_FBgn1
    The unique FlyBase gene ID for the first gene of the interacting pair.
  2. gene_symbol1
    The official FlyBase symbol for the first gene of the interacting pair.
  3. gene_FBgn2
    The unique FlyBase gene ID for the second gene of the interacting pair.
  4. gene_symbol2
    The official FlyBase symbol for the second gene of the interacting pair.
  5. FBrf(s)
    The unique FlyBase IDs for the publications supporting this interaction.
  6. FBig_id
    The unique FlyBase ID for this pairwise interaction.
  7. #_reported_interactions
    The number of distinct experiments in support of this interaction.

FBgn <=> DB accession IDs (fbgn_NAseq_Uniprot_*.tsv)

The file reports EMBL/GenBank/DDBJ nucleotide and protein accessions, UniProtKB/SwissProt/TrEMBL protein accessions, NCBI Entrez gene IDs and NCBI RefSeq transcript and protein accessions associated with FlyBase genes.

The file includes:

nuclear genes with sequence accession numbers
mitochondrial genes

it excludes:

genes without sequence accession numbers

Columns are:

  1. Current symbol of gene
  2. Current FlyBase identifier (FBgn#) of gene
  3. EMBL/GenBank/DDBJ nucleotide accession associated with the gene
  4. EMBL/GenBank/DDBJ protein accession associated with the gene and the nucleotide accession in column 3
  5. UniProtKB/SwissProt/TrEMBL protein accession associated with the gene
  6. NCBI Entrez ID associated with the gene
  7. NCBI RefSeq transcript accession associated with the gene
  8. NCBI RefSeq protein accession associated with the gene and the transcript accession in column 7

Each row contains information about a single accession associated with a gene, thus if a gene has multiple accessions associated with it, multiple rows will exist for that gene in the file.

A single row contains only information about an EMBL/GenBank/DDBJ accession or information about a UniProtKB/SwissProt/TrEMBL accession or an NCBI Entrez gene ID or an NCBI RefSeq transcript accession.

For rows containing information about a EMBL/GenBank/DDBJ accession, a nucleotide accession associated with the gene is listed in column 3. If there is also a EMBL/GenBank/DDBJ protein accession associated with that gene and with the nucleotide accession in column 3, this protein accession is listed in column 4. In this case, columns 5, 6, 7 and 8 are always empty.

For rows containing information about a UniProtKB/SwissProt/TrEMBL protein accession, a protein accession associated with the gene is listed in column 5. In this case, columns 3, 4, 6, 7 and 8 are always empty.

For rows containing information about an NCBI Entrez gene, an ID associated with the gene is listed in column 6. In this case, columns 3, 4, 5, 7 and 8 are always empty.

For rows containing information about an NCBI RefSeq accession, a transcript accession associated with the gene is listed in column 7. If there is also an NCBI RefSeq protein accession associated with that gene and with the transcript accession in column 7, this protein accession is listed in column 8. In this case, columns 3, 4, 5 and 6 are always empty.

FBgn <=> Annotation ID (fbgn_annotation_ID_*.tsv)

The file reports current and secondary FlyBase identifiers associated with FlyBase genes, including current and secondary gene identifiers (FBgn#), and current and secondary annotation identifiers (CG#).

The file includes:

nuclear genes located to the sequence
mitochondrial genes

it excludes:

genes not located to the sequence

Columns are:

  1. Current symbol of gene
  2. Current FlyBase identifier (FBgn#) of gene
  3. Secondary FlyBase identifier(s) (FBgn#) associated with the gene
  4. Current annotation identifier associated with the gene
  5. Secondary annotation identifier(s) associated with the gene

Please Note: If a gene has multiple secondary identifiers, all the values are stored within one tab separated column and are separated by commas (for example as: FBgn0034701,FBgn0034702).

FBgn <=> GLEANR IDs (fbgn_gleanr_*.tsv)

This file reports the relationship between the symbols and gene identifiers used by FlyBase for non-melanogaster genes identified by the AAA consortium, and the GLEANR identifier assigned to the gene during the initial annotation of the genome sequence.

The file includes:

non-melanogaster genes located to the sequence

it excludes:

D. melanogaster genes
non-melanogaster genes not located to the sequence

Columns are:

  1. Current FlyBase gene symbol
  2. Current FlyBase identifier (FBgn#) of the gene
  3. GLEANR identifier assigned by the AAA Consortium.

FBgn <=> FBtr <=> FBpp IDs (fbgn_fbtr_fbpp_*.tsv)

This file reports the relationship of gene identifiers used by FlyBase for sequence localized genes, and the identifiers used for the transcript and polypeptide products of these genes.

The file includes:

genes located to the sequence

it excludes:

genes not located to the sequence

Columns are:

  1. Current FlyBase identifier (FBgn#) of the gene
  2. Current FlyBase identifier (FBtr#) of a transcript encoded by the gene listed in column 1.
  3. Current FlyBase identifier (FBpp#) of a polypeptide encoded by the transcript listed in column 2, where this is relevant.


FBgn exons <=> Affy1 (fbgn_exons2affy1_overlaps.tsv)

The file is generated by testing for overlaps, no matter how small, of the locations of Affy1 oligos in the genome with the locations of gene exons, as defined by the Dmel gene models for the current release of FlyBase. If the location of an Affy1 oligo shows any kind of overlap with an exon of a gene, a Gene=>Affy reference is recorded in this file.

The extent of the overlap has no influence on the inclusion of a crossreference in this file. The overlap might be just one nucleotide, or it could be an exact match to the exon. For interpretation of the significance of a partial overlap please contact Affymetrix.

The file includes the following Dmel genes:

nuclear genes located to the sequence

it excludes:

genes not located to the sequence
mitochondrial genes

Each line of the file can contain many tab separated columns:

The first column of a line contains the valid FlyBase identifiers of a gene. Subsequent columns: Each Affy1 ID that overlaps with an exon of the gene, as described above, is listed in an additional tab separated column. Thus, this file does not contain a predefined number of columns.

FBgn exons <=> Affy2 (fbgn_exons2affy2_overlaps.tsv)

The file is generated from the location of Affy2 oligos exactly as described for Affy1 oligos above.

Genes GO data (gene_association.fb)

The file contains the Gene Ontology (GO) controlled vocabulary (CV) terms assigned to FlyBase genes.

The file includes the following Dmel genes:

nuclear genes located to the sequence
mitochondrial genes
genes not located to the sequence

The columns of the file are described in section G.3.1. of the Reference manual.

Genes map table (gene_map_table_*.tsv)

The file reports available localization information for FlyBase genes.

It includes:

nuclear genes located to the sequence
mitochondrial genes
genes not located to the sequence

Columns are:

  1. Current FlyBase gene symbol
  2. Current FlyBase identifier (FBgn#) of gene
  3. recombination map location
  4. cytogenetic location
  5. genomic location

Automated gene summaries (automated_gene_summaries.tsv)

The file contains the summaries found on gene report pages and the pop-ups in GBrowse and Interactions Browser in plain text.

It includes:

nuclear genes located to the sequence
mitochondrial genes
genes not located to the sequence

The tab delimited file contains two columns:

  1. FlyBase ID. The Valid FlyBase identifier number for the gene.
  2. The gene summary as a string of plain text.

Gene Snapshots (gene_snapshots_*.tsv)

The file contains in plain text the gene snapshot information visible on gene report pages.

It includes only Dmel protein coding genes.

Columns are:

  1. FBgn_ID
    Current FlyBase identifier number for the gene
  2. GeneSymbol
    Current FlyBase symbol of the gene
  3. GeneName
    Current FlyBase name of the gene
  4. datestamp
    Date in which the information has been reviewed
  5. gene_snapshot_text
    Gene snapshot information on the gene. Cases that are in progress or are deemed to have insufficient data to summarize are stated as such.

Unique protein isoforms (dmel_unique_protein_isoforms_fb_*.tsv.gz)

The file reports D. melanogaster genes and their unique protein isoforms.

The file includes:

melanogaster genes located to the sequence

it excludes:

melanogaster genes not located to the sequence
non-melanogaster genes

Columns are:

Current FlyBase identifier (FBgn#) of the D. melanogaster gene
Current FlyBase gene symbol of the D. melanogaster gene
Current FlyBase protein symbol of the representative protein isoform.
Current FlyBase protein symbol(s) of identical protein isoforms.

Non-coding RNA genes (ncRNA_genes_fb_*.tsv.gz)

This file reports all genes encoding ncRNAs for D. melanogaster and 11 other sequenced Drosophila species, as submitted to RNAcentral (http://rnacentral.org/). Pseudogenes are excluded.

Columns are:

  1. accession_id
    INSDC accession ID
  2. FB_gene_ID
    Current FlyBase gene identifier (FBgn#)
  3. species_FB_annotation_ID(locus_tag)
    Current FlyBase annotation ID, in the form "<species abbreviation>_<annotation_ID>", which equates to the 'locus tag' field in INSDC records.

Gene groups

Gene group data (gene_group_data_*.tsv)

This file reports all Gene Groups in FlyBase, together with their hierarchical relationships (where relevant) and member genes.

Where groups are arranged into hierarchies, note that: i) the member genes are only associated with the terminal subgroups; and ii) the immediate parent of any subgroup is identified in the ‘Parent_FB_group_id' and 'Parent_FB_group_symbol' columns.

Also note that separate lines are used for each member gene, meaning that each terminal group is listed multiple times (equal to the number of member genes).

Columns are:

  1. FB_group_id
    Current FlyBase identifier (FBgg##) of Gene Group
  2. FB_group_symbol
    Current FlyBase symbol of Gene Group
  3. FB_group_name
    Current FlyBase full name of Gene Group
  4. Parent_FB_group_id
    Current FlyBase identifier (FBgg##) of parent of given Gene Group (if relevant)
  5. Parent_FB_group_symbol
    Current FlyBase symbol of parent of given Gene Group (if relevant)
  6. Group_member_FB_gene_id
    Current FlyBase identifier (FBgn##) of member gene (if terminal group)
  7. Group_member_FB_gene_symbol
    Current FlyBase symbol of member gene (if terminal group)

Gene groups with HGNC IDs (gene_groups_HGNC_*.tsv)

This file reports all Gene Groups in FlyBase, together with the corresponding HGNC 'gene family' ID (where relevant).

The absence of an HGNC_family_ID entry indicates there is no equivalent HGNC gene family for that FlyBase gene group.

Because of different sub-group structures (etc), a single HGNC family may be associated with multiple FlyBase gene groups. Similarly, a single FlyBase gene group may be associated with multiple HGNC gene families - these are shown on separate lines.

Columns are:

  1. FB_group_id
    Current FlyBase identifier (FBgg##) of Gene Group
  2. FB_group_symbol
    Current FlyBase symbol of Gene Group
  3. FB_group_name
    Current FlyBase full name of Gene Group
  4. HGNC_family_ID
    HGNC ID of equivalent human 'gene family'

Alleles and stocks

Allele data (Chado XML or Reporting XML)

Stock data (eg. stocks_*.tsv.gz)

This file reports genetic components and related information about Stocks in FlyBase. The following provides a brief description of the columns in the file:

  1. FBst. The unique identifier assigned to this stock by FlyBase. Example: FBst0025115
  2. collection_short_name. A short name for the stock collection that holds the stock. Example: Szeged
  3. stock_type_cv. The controlled vocabulary term and unique identifier that describe the state of the stock. Example: living_stock ; FBcv:0010000
  4. species. The FlyBase four-letter Species Abbreviations for the species of the stock. Example: Dmel
  5. FB_genotype. Genetic components of the stock corresponding to alleles, aberrations, balancers, or insertions in FlyBase. May be empty. Example: P{EP}wun2[EP2217]
  6. description. Genetic components of the stock as provided to FlyBase by the collection that holds the stock. Example: EP(2)2217
  7. stock_number. The stock identifier provided to FlyBase by the collection that holds the stock. May be empty. Example: 0000-1006.01

Genetic interactions (allele_genetic_interactions_*.tsv)

The file reports controlled vocabulary (i.e. not free text) genetic interaction data associated with alleles. This is the data reported in the "Phenotypic Class" and "Phenotype Manifest in" subsections of the "Interactions" section of each Allele Report.

Columns are:

  1. Current FlyBase allele symbol
  2. Current FlyBase identifier (FBal#) of allele
  3. Interaction information associated with allele
  4. Current FlyBase identifer (FBrf#) of publication from which data came

Phenotypic data (allele_phenotypic_data_*.tsv)

The file reports controlled vocabulary (i.e. not free text) phenotypic data associated with alleles. This is the data reported in the "Phenotypic Class" and "Phenotype Manifest in" subsections of the "Phenotypic Data" section of each Allele Report.

Columns are:

  1. Current FlyBase allele symbol
  2. Current FlyBase identifier (FBal#) of allele
  3. Phenotypic data associated with allele
  4. Current FlyBase identifer (FBrf#) of publication from which data came

Alleles <=> Genes (fbal_to_fbgn_fb_*.tsv)

This file reports the relationship between gene identifiers and the identifiers used for alleles of these genes.

Columns are:

  1. Current FlyBase identifier (FBal#) of the allele
  2. Current symbol of the allele
  3. Current FlyBase identifier (FBgn#) of the gene
  4. Current symbol of the gene

Orthologs

Drosophila Orthologs (dmel_orthologs_in_drosophila_species_fb_*.tsv.gz)

The file reports D. melanogaster genes and their orthologs in other sequenced Drosophila genomes, as determined by OrthoDB. (The version of OrthoDB currently being used is shown in the 'Orthologs' -> 'Orthologs (via OrthoDB)' section of a Gene Report.)

The file includes:

nuclear genes located to the sequence

it excludes:

genes not located to the sequence
mitochondrial genes

Columns are:

  1. Current FlyBase identifier (FBgn#) of D. melanogaster gene
  2. Current FlyBase gene symbol D. melanogaster gene
  3. Arm upon which D. melanogaster gene is localized
  4. Location of D. melanogaster gene on the arm
  5. Strand of D. melanogaster gene ('1' indicates the positive strand, '-1' indicates the negative strand)
  6. Current FlyBase identifier (FBgn#) of non-melanogaster orthologous gene
  7. Current FlyBase gene symbol of non-melanogaster orthologous gene
  8. Arm upon which non-melanogaster orthologous gene is localized
  9. Location of non-melanogaster orthologous gene on the arm
  10. Strand of non-melanogaster orthologous gene ('1' indicates the positive strand, '-1' indicates the negative strand)
  11. OrthoDB orthology group ID to which the pair-wise association belongs.

Each row is a pair-wise association beween a D. melanogaster gene and a non-melanogaster ortholog. Thus, multiple rows exist for each D. melanogaster gene in the file.

Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)

This file reports the human orthologs of D. melanogaster genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines. Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.

Columns are:

  1. Current FlyBase identifier (FBgn#) of D. melanogaster gene
  2. Current FlyBase gene symbol of D. melanogaster gene
  3. HGNC ID of orthologous human gene
  4. OMIM ID of orthologous human gene
  5. HGNC gene symbol of orthologous human gene
  6. DIOPT 'score' for orthology call (i.e. the number of inidividual algorithms that support the call)
  7. OMIM Phenotype ID (and name in parentheses) - multiple phenotypes are separated by a comma

Human disease

Human disease model data (allele_human_disease_model_data_fb_*.tsv.gz)

This file reports all experimental-based disease model annotations, associated with alleles, that have been curated for D. melanogaster. 'Alleles' encompasses both classical alleles and transgenic alleles; the latter may relate to transgenic constructs of D. melanogaster genes or non-D. melanogaster genes, often human genes. These are the data reported in the "Human Disease Model Data" -> "Disease Ontology" section of the Allele Report, which are repeated in the "Human Disease Model Data" -> "Alleles Reported to Model Human Disease (Disease Ontology)" section of the Gene Report.

Columns are:

  1. Current FlyBase identifier (FBal#) of allele
  2. Current FlyBase symbol of allele
  3. Annotation qualifier - one of 'model of', 'ameliorates', 'exacerbates', 'DOES NOT model', 'DOES NOT ameliorate' or 'DOES NOT exacerbate'
  4. Disease Ontology term
  5. Disease Ontology ID
  6. Evidence code, with interacting allele(s) where appropriate. Evidence code is one of: 'inferred from mutant phenotype', 'in combination with', 'modeled by', 'is ameliorated by', 'is exacerbated by', 'is NOT ameliorated by' or 'is NOT exacerbated by'. Interacting alleles are give as 'FLYBASE:<allele_symbol>; FB:<FBal_ID>', with multiple alleles separated by a comma
  7. Current FlyBase identifier (FBrf#) of the publication from which the data came


Human Orthologs (dmel_human_orthologs_disease_fb_*.tsv.gz)

This file reports the human orthologs of D. melanogaster genes using the DIOPT dataset. Each line reports a single orthologous pair, which means that each human and D. melanogaster gene can appear in multiple lines. Note that ortholog calls supported by only 1 or 2 algorithms (DIOPT score <3) have been removed. Human genes are also associated with diseases (OMIM phenotypes) using the OMIM dataset.

Columns are:

  1. Current FlyBase identifier (FBgn#) of D. melanogaster gene
  2. Current FlyBase gene symbol of D. melanogaster gene
  3. HGNC ID of orthologous human gene
  4. OMIM ID of orthologous human gene
  5. HGNC gene symbol of orthologous human gene
  6. DIOPT 'score' for orthology call (i.e. the number of inidividual algorithms that support the call)
  7. OMIM Phenotype ID (and name in parentheses) - multiple phenotypes are separated by a comma

Nomenclature

Species abbreviation list (species-ab.gz)

The species-abbreviations.txt file lists all the species for which FlyBase has some information. FlyBase includes gene reports for genes derived from species within the family Drosophilidae, as well as gene reports for non-drosophilid genes ("foreign genes") that have been introduced into Drosophila via transgenic constructs and for engineered objects such as a fusion gene between two D.melanogaster genes. In addition, information about non-Drosophilid species is also displayed in GBrowse, for example in the "Similarity: Proteins" evidence tier. Thus, the file contains information for both Drosophilid and non-Drosophilid species.

There are 8 columns of data in the file, each separated by " | ".

  1. Internal_id. The Primary FlyBase identifier of the organism.
  2. Taxgroup. A grouping term, currently one of "drosophilid", "non-drosophilid eukaryote", "prokaryote", "transposable element" or "virus".
  3. Abbreviation. The standard FlyBase prefix for the species. This abbreviation is used in FlyBase as the first part of the symbol (before the '\') of any object, e.g. a gene or allele, that originates from this species. This column may be blank, if data from a species is displayed in an evidence tier on GBrowse but no individual report page exists for that species in FlyBase.
  4. Genus. The genus name of the organism.
  5. Species name. The species name of the organism.
  6. Common name. The common name of the organism. This column may be blank.
  7. Comment. A free text field for additional comments. This column may be blank.
  8. Ncbi-taxon-id. The NCBI Taxonomy Database Taxon ID for the organism. This column may be blank.

An html version of this file is also available - see the Species Abbreviations page.

Ontology Terms

Frozen files used for this release of FlyBase

FBbt: fly_anatomy (fly_anatomy.obo.gz)
FBdv: fly_development (fly_development.obo.gz)
FBcv: controlled vocab (flybase_controlled_vocabulary.obo.gz)
FBsv: stock ontology (flybase_stock_vocabulary.obo.gz)
GO: gene ontology (go-basic.obo.gz)
FBbi: image ontology (image.obo.gz)
SO: sequence ontology (so.obo.gz)
DO: human disease ontology (doid.obo.gz

Current 'Live' Files

FBbt: fly_anatomy (fly_anatomy.obo)

This link points to the ontology version fbbt-simple.obo, which doesn't have a few minor FlyBase specific changes, as compared to fly_anatomy.obo.

FBdv: fly_development (fly_development.obo)

This link points to the ontology version fbdv-simple.obo, which doesn't have a few minor FlyBase specific changes, as compared to fly_development.obo.

FBcv: controlled vocab (flybase_controlled_vocabulary.obo)

This link points to the ontology version fbcv-simple.obo, which doesn't have a few minor FlyBase specific changes, as compared to flybase_controlled_vocabulary.obo.

GO: gene ontology (go-basic.obo)
FBbi: image ontology) image.obo)
SO: sequence ontology (so-xp.obo)
DO: human disease ontology (doid.obo)

Genomes: annotation and sequence

All sequenced Drosophila species

Current FTP repository (ftp://ftp.flybase.net/releases/FB*)
FTP archive (ftp://ftp.flybase.net/genomes/)

Drosophila melanogaster (Dmel)

Drosophila ananassae (Dana)

Drosophila erecta (Dere)

Drosophila grimshawi (Dgri)

Drosophila mojavensis (Dmoj)

Drosophila persimilis (Dper)

Drosophila pseudoobscura pseudoobscura (Dpse)

Drosophila sechellia (Dsec)

Drosophila simulans (Dsim)

Drosophila virilis (Dvir)

Drosophila willistoni (Dwil)

Drosophila yakuba (Dyak)

Transcripts and polypeptides

Transcript data (Chado XML and Reporting XML)

Polypeptide data (Chado XML and Reporting XML)

Transposons, transgenic constructs, and insertions

Transgenic construct maps (construct_maps.zip)

The construct_maps.zip file unpacks as a directory containing maps of recombinant constructs and transgenic transposons generated by FlyBase, that are based on the compiled sequence data curated by FlyBase. The name of each PNG image in the directory corresponds to the FlyBase identifier of the respective recombinant construct or transgenic transposon.

Please note: For transgenic transposons, the image may be a map of the corresponding plasmid form.

Map data for insertions (insertion_mapping_*.tsv)

The insertion mapping table reports available localization information for Dmel insertions.

Columns are:

  1. Current symbol of insertion
  2. Current FlyBase identifier (FBti#) of insertion
  3. Genomic location of insertion
  4. Range (t/f) indicates whether genomic location is range or single base
  5. Orientation (1/0) indicates orientation of insertion on chromosome
  6. Estimated cytogenetic location based on correlation of genomic location and estimated genomic location of cytological bands
  7. Observed cytogenetic location reported in the literature

Transposable elements (canonical set) (transposon_sequence_set.embl.txt)

This is a file of 'canonical' sequences of the transposable elements from Drosophila maintained by M. Ashburner.

The first section of the file outlines the history and revisions to the file and also lists the current set of elements, their size and whether the subsequent sequence data is complete.

The second section of the file, which is separated from the first by a line of "_" characters contains the sequence data of all the elements in EMBL format. The record for each element starts with a line prefixed by "ID" and ends with a line containing "//".

Aberrations

Aberration data (Chado XML and Reporting XML)

Balancer data (Chado XML and Reporting XML)

Large dataset metadata

Dataset metadata members (dataset_metadata_fb_*.tsv.gz)

This file lists all features that are associated with a dataset/collection (e.g., genes, cDNA clones, TF_binding_sites, Affymetrix probes).

  1. Dataset_metadata_ID
    The unique FlyBase ID for the dataset.
  2. Dataset_metadata_name
    The official FlyBase symbol for the dataset.
  3. Item_ID
    The unique FlyBase ID for the feature associated with this dataset.
  4. Item_name
    The official FlyBase symbol for the feature associated with this dataset.

Clones

Clone data (Chado XML and reporting XML)

cDNAs: FBcl <=> acc. ID (cDNA_clone_data_*.tsv)

The file reports basic cDNA clone data in FlyBase.

Columns are:

  1. Current FlyBase identifier (FBcl#) of cDNA clone
  2. Clone name
  3. Name of library associated with clone
  4. EMBL/GenBank/DDBJ cDNA accession number
  5. EMBL/GenBank/DDBJ EST accession number

Each row contains information about a single cDNA or EST associated with a gene, thus if a gene has multiple cDNAs or ESTs associated with it, multiple rows will exist for that gene in the file.

A single row contains either information about a cDNA associated with the gene in column 1 or information about an EST associated with the gene in column 1, not both.

Genomic: FBcl <=> acc. ID (genomic_clone_data_*.tsv)

The file reports basic genomic clone data in FlyBase.

Columns are:

  1. Current FlyBase identifier (FBcl#) of genomic clone
  2. Clone name
  3. EMBL/GenBank/DDBJ accession number

References

Combined reference data (Chado XML and Reporting XML)

FlyBase FBrf <=> PubMed ID <=> PMCID <=> DOI (fbrf_pmid_pmcid_doi_fb_*.tsv.gz)

This file lists all publications in the FlyBase bibliography that have a PubMed ID. Additional identifiers are listed as applicable. The FlyBase version during which the pu

Columns are:

  1. FBrf
    The unique FlyBase ID for this publication.
  2. PMID
    The unique PubMed ID for this publication.
  3. PMCID
    The unique PubMed Central ID for this publication, if applicable.
  4. DOI
    The digital object identifier assigned to the publication.
  5. pub_type
    The publication type (for example, paper, review, erratum, abstract, book, etc.)
  6. miniref
    A short citation listing the first author, year of publication, journal, volume, issue and page numbers.
  7. pmid_added
    The FlyBase release in which the publication was first incorporated into the FlyBase bibliography. As this report first generated for fb_2012_01 release, all publications associated with a Pub Med ID prior to this release have pmid_added = fb_2011_10.

Drosophila researchers

Addresses of Drosophila researchers are copyrighted (GSA) material and only provided for official business of the Fly Board.

Map conversion tables

Cytological <=> Sequence (genome-cyto-seq.txt)

This is a tab delimited file that FlyBase uses to relate sequence coordinates from release 5 of the Drosophila melanogaster sequence assembly to published cytogenetic map positions. A description of how this is calculated is provided in section G.5.1. of the Reference manual.

The data for each chromosome arm is separated by a line starting with a '#' that lists the name of the chromosome arm and corresponding sequence scaffold.

The columns in the file are:

  1. Cytogenetic map position as described by Bridges.
  2. First sequence coordinate for this map position in the sequence scaffold corresponding this chromosome arm.
  3. Last sequence coordinate for this map position in the sequence scaffold corresponding this chromosome arm.

Cytological <=> Genetic (cytotable.txt)

This is the table that FlyBase uses to infer a genetic map position from a published cytogenetic map position for Drosophila melanogaster.

The first six lines of the file describe the contents of the file or are blank. The data in the file is organized with the cytological position in first four characters of a line followed by a run of spaces and then the genetic map position.


Cyto <=> Genetic <=> Seq (cyto-genetic-seq.tsv)

This is a tab separated file generated from the cytotable.txt and genome-cyto-seq.txt files that infers the relationship between published cytogenetic map positions, genetic map positions and release 5 sequence assembly coordinates for Drosophila melanogaster. Please note that band numbers are not given in this file because they are absent in cytotable.txt.

  1. Cytogenetic map position
  2. Genetic map position
  3. Sequence coordinates (release 5) for the interval

An html version of this file is also available - see the Map Conversion Table page.

Genes map table (gene_map_table_fb_*.tsv)

This is identical to the file listed under the genes section above.