Difference between revisions of "FlyBase:Computed cytological data"
Lauraponting (talk | contribs) |
|||
Line 3: | Line 3: | ||
Objects which have been precisely mapped to the genome (such as genes with annotations, or insertions of transposable elements with flanking sequence) have an inferred cytological location which is computed by FlyBase based on their sequence location. | Objects which have been precisely mapped to the genome (such as genes with annotations, or insertions of transposable elements with flanking sequence) have an inferred cytological location which is computed by FlyBase based on their sequence location. | ||
− | The system used is based on estimates that Sorsa published a few years ago of the size in kb of each polytene band. These estimates can be summed to give the length (according to Sorsa) in kb of a region between two very well-mapped entities ('anchors') that are also identified on the genome. The genome sequence gives a different number for that length, so we then apply a scaling factor, i.e. we calculate the cytology of each mapped object in the region between the anchors by interpolation from its sequence coordinates. The anchors we use are a set of over 1200 P-element insertions that have been localised on the genome by sequencing flanking DNA and on polytene chromosomes by Todd Laverty of the Berkeley Drosophila Genome Project. The scaling works out to be slightly different for each inter-anchor region, but we estimate that even in the middle of a region the error in the computed location should never be more than a band or two. As the remaining gaps in the genome sequence are filled, some currently unmappable stretches of sequence (especially near centromeres) will be joined up with the main sequence, and this will shift all the coordinates. Smaller changes will occur as a result of other gap-filling in the middle of arms. These will be reflected in updates to the map locations. | + | The system used is based on estimates that Sorsa published a few years ago of the size in kb of each polytene band [http://{{flybaseorg}}/reports/FBrf0195567 Heino et al., 1993]. These estimates can be summed to give the length (according to Sorsa) in kb of a region between two very well-mapped entities ('anchors') that are also identified on the genome. The genome sequence gives a different number for that length, so we then apply a scaling factor, i.e. we calculate the cytology of each mapped object in the region between the anchors by interpolation from its sequence coordinates. The anchors we use are a set of over 1200 P-element insertions that have been localised on the genome by sequencing flanking DNA and on polytene chromosomes by Todd Laverty of the Berkeley Drosophila Genome Project. The scaling works out to be slightly different for each inter-anchor region, but we estimate that even in the middle of a region the error in the computed location should never be more than a band or two. As the remaining gaps in the genome sequence are filled, some currently unmappable stretches of sequence (especially near centromeres) will be joined up with the main sequence, and this will shift all the coordinates. Smaller changes will occur as a result of other gap-filling in the middle of arms. These will be reflected in updates to the map locations. |
FlyBase currently only computes cytological data in this way for objects that have been mapped to the D.melanogaster genome. | FlyBase currently only computes cytological data in this way for objects that have been mapped to the D.melanogaster genome. |
Revision as of 15:46, 18 October 2022
Computed cytological locations of objects which have been mapped to the genome.
Objects which have been precisely mapped to the genome (such as genes with annotations, or insertions of transposable elements with flanking sequence) have an inferred cytological location which is computed by FlyBase based on their sequence location.
The system used is based on estimates that Sorsa published a few years ago of the size in kb of each polytene band Heino et al., 1993. These estimates can be summed to give the length (according to Sorsa) in kb of a region between two very well-mapped entities ('anchors') that are also identified on the genome. The genome sequence gives a different number for that length, so we then apply a scaling factor, i.e. we calculate the cytology of each mapped object in the region between the anchors by interpolation from its sequence coordinates. The anchors we use are a set of over 1200 P-element insertions that have been localised on the genome by sequencing flanking DNA and on polytene chromosomes by Todd Laverty of the Berkeley Drosophila Genome Project. The scaling works out to be slightly different for each inter-anchor region, but we estimate that even in the middle of a region the error in the computed location should never be more than a band or two. As the remaining gaps in the genome sequence are filled, some currently unmappable stretches of sequence (especially near centromeres) will be joined up with the main sequence, and this will shift all the coordinates. Smaller changes will occur as a result of other gap-filling in the middle of arms. These will be reflected in updates to the map locations.
FlyBase currently only computes cytological data in this way for objects that have been mapped to the D.melanogaster genome.
Cytology computed in this way is currently displayed on FlyBase in the following places on the relevant Report:
Gene Report
- Cytogenetic map field of the GENOMIC LOCATION section.
- FLYBASE COMPUTED CYTOLOGICAL LOCATION section of the DETAILED MAPPING DATA section.
Insertion Report
- Cytological location (computed by FlyBase) field of the DETAILED MAPPING DATA section, marked in parentheses with "inferred by FlyBase from sequence location".
GBrowse
- Cytologic band evidence tier of the GBrowse display.
Computed cytological location of insertions based on the gene in which it is inserted.
Insertions of transposable elements that do not have flanking sequence may have a computed cytological location which is based on the computed cytological location of the gene into which they have inserted in the genome (displayed in the Affected gene(s) section of the Insertion Report).
If the affected gene has a computed cytological location based on its sequence location (as described in 1. above) then this is displayed in the Insertion Report in the following field:
- Cytological location (computed by FlyBase) field of the DETAILED MAPPING DATA section, marked in parentheses with "near gene of known cytology".
Computed cytological locations of objects based upon data from the literature.
Genes that do not have a computed cytology based on their mapping to the genome (described in 1. above) may instead have a computed cytology based upon data from the literature. Aberrations may also have computed cytological breakpoints based upon data from the literature.
Five categories of information are used to compute the cytological location of genes and aberration breakpoints:
- Polytene localization of genes by chromosome in situ hybridization (reported in the EXPERIMENTALLY DETERMINED CYTOLOGICAL LOCATION section of the DETAILED MAPPING DATA section of the relevant Gene Report).
- Polytene localization of aberration breakpoints (orcein data) (reported in the Breakpoints field of the NATURE OF THE ABERRATION section of the relevant Aberration Report).
- Genetic (recombination) mapping data on gene order (reported in the EXPERIMENTALLY DETERMINED RECOMBINATION DATA section of the DETAILED MAPPING DATA section of the relevant Gene Report).
- Complementation data between alleles and aberrations (reported in the GENE DELETION & DUPLICATION DATA section of the relevant Aberration Report).
- Molecular data on gene order (reported in the MOLECULAR MAP DATA section of the DETAILED MAPPING DATA section of the relevant Gene Report) and proximity.
Recombination, complementation and molecular information does not reveal polytene locations directly, but can be combined with orcein and in situ data to derive inferred polytene locations. FlyBase has produced software which produces a synthesis of the primary data, resulting in a computed cytological location that is a best guess of the polytene location of each gene or aberration breakpoint for which any relevant data are known to FlyBase. However, since this type of analysis is non-trivial when conducted on a large dataset, the statements computed in this way should be treated with caution, and users should also consult the five categories of information listed above to see the full extent of the primary data.
The computed cytological location is presented as a range of uncertainty, whose ends are either polytene bands (such as 22F1) or lettered subdivisions (such as 22F). Heterochromatic bands (such as h41) are also used.
Wherever possible, the computed range of uncertainty of a gene or breakpoint is the range consistent with ALL the data known to FlyBase. Thus, if in one publication a gene has been reported to lie in 35B1-4, and in another publication it is reported to lie in 35B3-6, and there is no other relevant information in FlyBase, the computed location will be 35B3-4. More complex situations arise from complementation and recombination data. For example, if Df(1)xyz is stated to have its proximal breakpoint at 15A1-4, and Df(1)pqr is stated to have its distal breakpoint at 15A3-6, and the Deficiencies are known to overlap (because there is a gene, abc, that they both delete), then both those breakpoints will be computed to lie in 15A3-4 -- as will the gene abc itself.
If however two publications report cytological ranges that do NOT overlap, a choice must be made regarding which report to prioritize. This is done case-by-case, going back to the original literature. Certain guidelines are used: for example, genetic data on deficiencies are usually favored over cytological data, since point lesions very near to a deficiency are rare. However, inevitably some decisions are wrong -- especially when there is nothing to favor one report over another.
Because of the inherent complexity of these computations, the basis for the computed range is often not obvious at first sight. FlyBase therefore includes one-line descriptions of the primary data from which each end of the range was determined.
Some examples:
For gene abc:
- Computed cytological location: 15A3-4
- Left limit from inclusion in Df(1)pqr (FBrf0012345)
- Right limit from inclusion in Df(1)xyz (FBrf0054321)
For Df(1)xyz:
- Computed breakpoints: 14D;15A3-4
- Limits of break 1 from polytene analysis (FBrf0013579)
- Left limit of break 2 from inclusion of abc (FBrf0056789)
- Right limit of break 2 from polytene analysis (FBrf0098765)
For Df(1)pqr:
- Computed breakpoints: 15A3-4;15D
- Left limit of break 1 from polytene analysis (FBrf0034567)
- Limits of break 2 from polytene analysis (FBrf0097531)
Note that there is no requirement that any two data items derive from the same reference.
Notation
If a computed cytological range is inferred from recombination data (for genes) or complementation (for breakpoints) they are enclosed in square brackets when no range (even a wider one) can be determined by other means (thus square brackets specifically denote the unavailability of any direct data). This is most commonly found for breakpoints of cytologically invisible deficiencies and for genes which were mapped by recombination but never cloned or mapped by complementation.
'One-ended' limits. The commonest example of this is when a deficiency is stated to delete certain genes, thus giving it a minimum extent, but no flanking undeleted genes are specified, so no 'maximum extent' can be computed. In such cases, if there is also no explicit cytology for the deficiency (and if it is also not stated to be cytologically invisible -- see below) the 'half-open' range is denoted by 'less than' and 'greater than' signs, as follows:
For a deficiency that deletes three genes, all localized to 28D-E: Computed breakpoints: <28E;>28D Right limit of break 1 from inclusion of abc (FBrf0076543) Left limit of break 2 from inclusion of abc (FBrf0056789)
Note that there is no 'limit line' for the left limit of break 1 or the right limit of break 2. Note also the superficially odd, but logically sound, mention of 28E for the left break and 28D for the right break.
Proximity rather than order
There are two cases in which locations are computed based on close proximity of a pair of objects, rather than on their chromosomal order. One is when two genes are reported to lie within 20kb or less on a molecular map. For example, if a gene xyz is stated to lie in 22F1-2 and a second gene, pqr, is stated to lie a few kilobases away from xyz (and there is no other relevant information in FlyBase), the computed location of pqr will be 22F1-2, even if there is no information on the chromosomal order of the two genes.
The other case concerns cytologically invisible deficiencies. If a deficiency is stated to be cytologically invisible, the computation makes the assumption that it is less than a band in extent, so that the ranges of uncertainty of the left and right breakpoint should be identical. For example: if the deficiency in the previous example, which deletes a gene in 28D-E, were said to be cytologically invisible then its computed data would appear as follows:
Computed breakpoints: [28D-E];[28D-E]
- Left limit of break 1 from cytological invisibility (FBrf0002468)
- Right limit of break 1 from inclusion of abc (FBrf0076543)
- Left limit of break 2 from inclusion of abc (FBrf0056789)
- Right limit of break 2 from cytological invisibility (FBrf0002468)
Note the use of square brackets as described under "Notation", since this is a case where no explicit cytology is available. A statement that a deficiency is less than 20kb long is, for this purpose, treated as a statement that it is cytologically invisible.
Cytology computed in this way is currently displayed on FlyBase in the following places on the relevant Report:
Gene Report
- FLYBASE COMPUTED CYTOLOGICAL LOCATION section of the DETAILED MAPPING DATA section.
Note: the one-line description of the primary data from which the range was determined is displayed in the Evidence for location column of the above section.
Aberration Report
- Computed Breakpoints include field.
Note: the one-line description of the primary data from which the range was determined is displayed in the COMMENTS ON CYTOLOGY section.
Tools
Map-based searches using CytoSearch use computed cytological locations, rather than the primary data reported in the literature. For this reason, it is always advisable to search using a slightly broader range than the one of interest, so as to match entities which have been placed by multiple investigators in slightly varying locations.
The Cytolocation Advanced Search option in GBrowse uses computed cytological locations of objects which have been mapped to the genome (as described in 1. above).