Difference between revisions of "WBGene information and status pipeline"
Line 241: | Line 241: | ||
Went live on tazendra 2013 10 25. | Went live on tazendra 2013 10 25. | ||
− | This script is looking at the genes.json | + | This script is looking at the genes.json and genes.ace.gz files to update WBGene IDs, CGC names, Sequence names, Synonyms (not from Other_name but from a comparison between the CGC name in the json dump and the existing locus and synonym names in postgres) and the Status. |
This script is not species-restricted. | This script is not species-restricted. |
Revision as of 20:08, 5 November 2013
Contents
Table Summarizing Current/Future Postgres Population
AceDB tag | Postgres table | Old (pre-10/2013) - Nameserver nightly dump | Old (pre-10/2013) - WS bimonthly release | Current - Nameserver nightly dump | Current - Geneace nightly dump | Current - WS bimonthly release | Use - Paper or meeting abstract gene connection | Use - OA data type curation | Use - Dumping scripts -- could be wrong, but I don't think any gin_ tables are used in dumping scripts since we store WBGene IDs. except maybe gin_dead if people want those suppressed or to have some kind of error message or to map to Historical_gene or something like that) | Use - Protein2GO data conversion | Use - GSA Markup | Comment |
---|---|---|---|---|---|---|---|---|---|---|---|---|
WBGene identifier | gin_wbgene | Yes | Yes (First line in each entry) | Yes | Yes | Yes | ||||||
CGC_name | gin_locus | Yes (CGC) | If it has this tag, gene is considered good (What does 'good' mean?) | Yes (CGC) | No | No | Yes | No | ||||
Other_name | gin_synonyms | Yes (checking for different CGC name - see lines 132-137 in script) | Yes | Yes (checking for different CGC name) | Yes - add to what is already populated from nightly nameserver | No | Yes | No | ||||
Sequence_name | gin_seqname | Yes (Sequence) | No | Yes (Sequence) | No | No | Yes | No | ||||
Status | gin_dead | Yes (0) | only if value is dead and species ~ elegans$ | Yes (0) | only if value is dead in the nameserver nightly; populate with Merged_into and Split_into values | No | Yes | Yes | Yes | |||
Suppressed | gin_dead | If Status from nightly geneace = Suppressed, populate gin_dead with Dead Suppressed (if Status in nightly nameserver is 0 (dead)) or populate gin_dead with Suppressed (if Status in nightly nameserver is 1 (live)). Also add Merged_into and/or Split_into if nightly nameserver Status is 0 (dead). | ||||||||||
Merged_into | gin_dead | No | Yes | No | Yes - add only when status is dead from nightly nameserver | No | Historical_gene tag uses this when dumping files | |||||
Split_into | gin_dead | No | Yes | No | Yes - add only when status is dead from nightly nameserver | No | ||||||
Corresponding_transcript | gin_sequence | No | Yes | No | No | Yes | ||||||
Corresponding_CDS | gin_sequence + gin_seqprot | No | Yes | No | No | Yes | ||||||
Corresponding_protein | gin_protein, gin_seqprot | No | Yes | No | No | Yes | Yes, but we'll need isoform data in WB | |||||
Molecular_name | gin_molname | No | Yes | No | No | Yes | No | Maybe | ||||
Species | This could perhaps be used to populate a future species tag for papers, but this is not an immediate need. Other use cases? | |||||||||||
Version_change | No | Yes, to make sure we don't attach GO annotations to pseudogenes. | One use case would be to know when genes change class, e.g. CDS ->Pseudogene. We may not need to actually store this in postgres, though. | |||||||||
Public_name | gin_wbgene | Yes (but only when no CGC_name or Sequence_name) | If it has this tag, gene is considered good | Don't need (Public_name also in Other_name - confirm this is always the case) | No | Not if also in Other_name | Not if also in Other_name | Not if also in Other_name | No | I think we can now ignore the Public_name tag as long as there's always an Other_name value as well -- so if there is no Other_name then we'd look at Public_name ? looking at the script, we're not doing anything with this value) |
Previous (pre-nameserver move) Scripts:
- /home/acedb/cron/populate_gin_locus.pl - updates information from nameserver nightly dumps
- /home/acedb/cron/populate_gin.pl - updates information from WS releases
How this all works (10/2013)
Location of files:
The nightly nameserver and geneace dumps are on the sanger ftp site:
ftp://ftp.sanger.ac.uk/pub2/wormbase/STAFF/mh6/nightly_geneace/
There are two json dumps (genes and variations) from the respective nameservers, and six .ace dumps (clones, genes, laboratories, rearrangements, strains, and variations) from geneace.
Note that the gene nameserver and geneace are not synchronized. Some changes are evident in the nightly nameserver dump before they are included in the geneace dump.
For updating from nightly gene nameserver dumps:
/home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/populate_gin_nightly.pl
Called from:
0 20 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl
Went live on tazendra 2013 10 25.
This script is looking at the genes.json and genes.ace.gz files to update WBGene IDs, CGC names, Sequence names, Synonyms (not from Other_name but from a comparison between the CGC name in the json dump and the existing locus and synonym names in postgres) and the Status.
This script is not species-restricted.
Key features of this script:
- Locus and Synonym Names (Lines 72-78) - this part of the script handles populating the gin_locus and gin_synonyms tables.
- The script queries postgres for a list of locus names and synonyms and puts these into a hash.
- New synonym entries are appended. This was done so that we wouldn't lose new synonyms because they were evident in the nameserver dump (from a changed CGC_name value) but not yet included in the nightly geneace dump (the Other_name value).
- When a gene is killed and merged into another, synonyms are removed from the Dead gene and added to the gin_synonyms table for the live gene into which it was merged.
- Status
- Live genes have no entry in the gin_dead table.
- Dead genes have different entries depending on the information that is available for them:
- Dead
- Dead, Merged_into
- Dead, Merged_into, Split_into
- Suppressed genes are a special case.
- They can have Dead Suppressed or Live Suppressed
- Right now (10/2013), we only have Dead Suppressed
For updating from nightly geneace dumps:
/home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/nightly_geneace.pl
Called from:
0 20 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl
This script is not species-restricted.
For updating from latest WS release:
There are 2 cronjobs and 3 scripts:
in the acedb account:
cronjob 1 on script 1 (runs daily at 3am PST) : 0 3 * * * /home/acedb/cron/update_ws_tazendra.pl
Also triggers to run 2nd script : /home/acedb/cron/dump_from_ws.sh # dump .ace files from WS to update postgres tables
and write to a timestamp file my $dateOfWsDumpFile = /home3/acedb/cron/dump_from_ws/files/latestDate; echo "$date" >> $dateOfWsDumpFile;
in the postgres account :
cronjob 2 on script 3 (runs daily at 5am PST) : 0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl generates files to populate postgres tables for some gin_ tables and obo_ tables for exprcluster when the date in the timestamp file is more recent than the latest timestamp in gin_molname
This script is species-restricted for C. elegans only (we can change this, if needed).
This script populates information that can only be derived from the WS build:
- Corresponding_transcript
- Corresponding_CDS
- Corresponding_protein
- Molecular_name
We'll know when WS141 comes out (should be mid-to-late December 2013) and cronjob 1 picks it up, if this pipeline works automatically.
Some Relevant Postgres Queries:
SELECT * FROM gin_dead WHERE gin_dead ~ 'merged' AND gin_dead ~ 'split';
SELECT * FROM gin_dead WHERE gin_dead ~ 'split';
Back to Caltech documentation