Difference between revisions of "WBGene information and status pipeline"

From WormBaseWiki
Jump to navigationJump to search
 
(110 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=Table Summarizing Current/Future Postgres Population=
+
=Table Summarizing Former and Current Postgres Gene Information (gin_ ) Population=
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
!AceDB tag
 
!AceDB tag
 
!Postgres table
 
!Postgres table
 +
!Old (pre-10/2013) - Nameserver nightly dump
 +
!Old (pre-10/2013) - WS bimonthly release
 
!Current - Nameserver nightly dump
 
!Current - Nameserver nightly dump
 +
!Current - Geneace nightly dump
 
!Current - WS bimonthly release
 
!Current - WS bimonthly release
!Future - Geneace nightly dump
 
!Future - WS bimonthly release
 
 
!Use - Paper or meeting abstract gene connection
 
!Use - Paper or meeting abstract gene connection
 
!Use - OA data type curation
 
!Use - OA data type curation
 
!Use - Dumping scripts -- could be wrong, but I don't think any gin_ tables are used in dumping scripts since we store WBGene IDs. except maybe gin_dead if people want those suppressed or to have some kind of error message or to map to Historical_gene or something like that)
 
!Use - Dumping scripts -- could be wrong, but I don't think any gin_ tables are used in dumping scripts since we store WBGene IDs. except maybe gin_dead if people want those suppressed or to have some kind of error message or to map to Historical_gene or something like that)
 
!Use - Protein2GO data conversion
 
!Use - Protein2GO data conversion
 +
!Use - Textpresso and/or GSA Markup
 
!Comment
 
!Comment
 
|-
 
|-
Line 17: Line 19:
 
| Yes
 
| Yes
 
|
 
|
| Yes
+
| Yes (First line in each entry)
 +
|
 
|  
 
|  
 
| Yes
 
| Yes
 
| Yes
 
| Yes
 
|  
 
|  
 +
| Yes
 
| Yes
 
| Yes
 
|
 
|
Line 27: Line 31:
 
| CGC_name
 
| CGC_name
 
| gin_locus
 
| gin_locus
| Yes
+
| Yes (CGC)
| If it has this tag, gene is considered good     
+
| If it has this tag, gene is considered good (What does 'good' mean?)      
| Yes
+
| Yes (CGC)
|  
+
| No
| Yes
+
| No
 
| Yes
 
| Yes
 
|     
 
|     
 
| No
 
| No
 +
|
 +
| Yes
 
|
 
|
 
|-
 
|-
 
| Other_name
 
| Other_name
 
| gin_synonyms
 
| gin_synonyms
| No
+
| Yes (checking for different CGC name - see lines 132-137 in script)
| Yes
 
 
| Yes
 
| Yes
 +
| Yes (checking for different CGC name)
 +
| Yes - add to what is already populated from nightly nameserver
 
| No
 
| No
| Yes
 
 
| Yes
 
| Yes
 
|  
 
|  
 
| No
 
| No
 +
|
 +
| Yes
 
|
 
|
 
|-
 
|-
 
| Sequence_name
 
| Sequence_name
 
| gin_seqname
 
| gin_seqname
| Yes
+
| Yes (Sequence)
 +
| No
 +
| Yes (Sequence)
 +
| No
 
| No
 
| No
| Yes
 
|
 
| Yes
 
 
| Yes
 
| Yes
 
|  
 
|  
 
| No
 
| No
 
|
 
|
|-
+
| Yes
| Public_name
 
| gin_wbgene
 
| Yes (but only when no CGC_name or Sequence_name)
 
| If it has this tag, gene is considered good
 
| Don't need (Public_name also in Other_name - confirm this is ''always'' the case)
 
 
|
 
|
| Not if also in Other_name
 
| Not if also in Other_name
 
| Not if also in Other_name
 
| No
 
| I think we can now ignore the Public_name tag as long as there's always an Other_name value as well -- so if there is no Other_name then we'd look at Public_name ? looking at the script, we're not doing anything with this value)
 
 
|-
 
|-
| Molecular_name
+
| Status
| gin_molname
+
| gin_dead
 +
| Yes (0)
 +
| only if value is dead and species ~ elegans$
 +
| Yes (0)
 +
| only if value is dead in the nameserver nightly; populate with Merged_into and Split_into values
 
| No
 
| No
 
| Yes
 
| Yes
| No
 
 
| Yes
 
| Yes
 
| Yes
 
| Yes
| No
 
 
|
 
|
| Maybe
+
|
 
|
 
|
 
|-
 
|-
| Status
+
| Suppressed
 
| gin_dead
 
| gin_dead
| Yes
+
|
| only if value is dead
+
|
| Yes
+
|
|  
+
|If Status from nightly geneace = Suppressed, populate gin_dead with Dead Suppressed (if Status in nightly nameserver is 0 (dead)) or populate gin_dead with Suppressed (if Status in nightly nameserver is 1 (live)).  Also add Merged_into and/or Split_into if nightly nameserver Status is 0 (dead).
| Yes
+
|
| Yes
+
|
| Yes
+
|
| Yes
+
|
 +
|
 +
|
 
|
 
|
 
|-
 
|-
Line 101: Line 103:
 
| No
 
| No
 
| Yes
 
| Yes
| Yes
+
| No
 +
| Yes - add only when status is dead from nightly nameserver
 
| No
 
| No
 
|  
 
|  
|
+
| Historical_gene tag uses this when dumping files
| Historical_gene tag?
 
 
|  
 
|  
 
|  
 
|  
 +
|
 +
|
 
|-
 
|-
 
| Split_into
 
| Split_into
Line 113: Line 117:
 
| No
 
| No
 
| Yes
 
| Yes
| Yes
+
| No
 +
| Yes - add only when status is dead from nightly nameserver
 
| No
 
| No
 
|  
 
|  
 
|  
 
|  
| Historical_gene tag?
 
 
|  
 
|  
 +
|
 
|  
 
|  
 +
|
 
|-
 
|-
 
| Corresponding_transcript
 
| Corresponding_transcript
Line 125: Line 131:
 
| No
 
| No
 
| Yes
 
| Yes
 +
| No
 
| No
 
| No
 
| Yes
 
| Yes
| ''Confirm''
 
 
|
 
|
 
|
 
|
 
|
 
|
 +
|
 +
| Yes
 
|
 
|
 
|-
 
|-
Line 137: Line 145:
 
| No
 
| No
 
| Yes
 
| Yes
 +
| No
 
| No
 
| No
 
| Yes
 
| Yes
| ''Confirm''
 
 
|
 
|
 
|
 
|
 
|
 
|
 +
|
 +
| Yes
 
|
 
|
 
|-
 
|-
 
| Corresponding_protein
 
| Corresponding_protein
| gin_protein, gin_seqprotein (Need to check about this. -- it's gin_seqprot)
+
| gin_protein, gin_seqprot
 
| No
 
| No
 
| Yes
 
| Yes
 +
| No
 
| No
 
| No
 
| Yes
 
| Yes
| ''Confirm''
 
 
|
 
|
 
|
 
|
 
| Yes, but we'll need isoform data in WB
 
| Yes, but we'll need isoform data in WB
 +
|
 +
| Yes
 +
|
 +
|-
 +
| Molecular_name
 +
| gin_molname
 +
| No
 +
| Yes
 +
| No
 +
| No
 +
| Yes
 +
| No
 +
|
 +
| Maybe
 +
|
 +
| Yes
 
|
 
|
 
|-
 
|-
 
| Species
 
| Species
 +
| gin_species
 +
| No
 +
| No
 +
| No
 +
| Yes
 
|
 
|
 
|
 
|
| used in gin_dead only if value matches ending in elegans
+
| Yes - added to display of WBGene info in Term information window of OA
 +
| May be used to add species information to automated concise description files and also to add taxon IDs to BioGrid interaction file
 
|
 
|
 
|
 
|
 +
|
 +
|-
 +
| Pseudogene status
 +
| gin_pseudogene
 +
| No
 +
| No
 +
| No
 +
| ??
 +
|
 +
|
 
|
 
|
 
|
 
|
 
|
 
|
|
+
|  
| This could perhaps be used to populate a future species tag for papers, but this is not an immediate need.  Other use cases?
 
 
|-
 
|-
 
| Version_change
 
| Version_change
Line 179: Line 220:
 
|
 
|
 
| Yes, to make sure we don't attach GO annotations to pseudogenes.
 
| Yes, to make sure we don't attach GO annotations to pseudogenes.
 +
|
 +
|
 
| One use case would be to know when genes change class, e.g. CDS ->Pseudogene.  We may not need to actually store this in postgres, though.
 
| One use case would be to know when genes change class, e.g. CDS ->Pseudogene.  We may not need to actually store this in postgres, though.
 +
|-
 +
| Public_name
 +
| gin_wbgene
 +
| Yes (but only when no CGC_name or Sequence_name)
 +
| If it has this tag, gene is considered good
 +
| Don't need (Public_name also in Other_name - confirm this is ''always'' the case)
 +
| No
 +
| Not if also in Other_name
 +
| Not if also in Other_name
 +
| Not if also in Other_name
 +
| No
 +
|
 +
|
 +
| I think we can now ignore the Public_name tag as long as there's always an Other_name value as well -- so if there is no Other_name then we'd look at Public_name ? looking at the script, we're not doing anything with this value)
 
|-
 
|-
 
|}
 
|}
  
='''Current Scripts:'''=
+
='''Previous (pre-nameserver move) Scripts:'''=
 +
 
 +
# /home/acedb/cron/populate_gin_locus.pl - updates information from nameserver nightly dumps
 +
# /home/acedb/cron/populate_gin.pl - updates information from WS releases
 +
 
 +
='''How this all works (10/2013 - updated 04/2015)'''=
 +
 
 +
'''Location of files:'''
 +
 
 +
The nightly nameserver and geneace dumps are on the sanger ftp site:
 +
 
 +
ftp://ftp.sanger.ac.uk/pub/consortia/wormbase/STAFF/mh6/nightly_geneace/
 +
 
 +
There are two json dumps (genes and variations) from the respective nameservers, and six .ace dumps (clones, genes, laboratories, rearrangements, strains, and variations) from geneace.
 +
 
 +
Note that the gene nameserver and geneace are not synchronized.  Some changes are evident in the nightly nameserver dump before they are included in the geneace dump.
 +
 
 +
 
 +
'''For updating from nightly gene nameserver and geneace dumps:'''
 +
 
 +
/home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/populate_gin_nightly.pl
 +
 
 +
Called from:
 +
 
 +
0 20 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl
 +
 
 +
Went live on tazendra 2013 10 25.
 +
 
 +
This script is looking at the genes.json and genes.ace.gz files to update WBGene IDs, CGC names, Sequence names, Synonyms (not from Other_name but from a comparison between the CGC name in the json dump and the existing locus and synonym names in postgres) and the Status.
 +
 
 +
This script is not species-restricted.
 +
 
 +
Key features of this script:
 +
 
 +
*Locus and Synonym Names
 +
**The script queries postgres for a list of locus names and synonyms and puts these into a hash.
 +
**New synonym entries are appended.  This was done so that we wouldn't lose new synonyms because they were picked up in the nameserver dump (from a changed CGC_name value) but not yet included in the nightly geneace dump (the Other_name value). 
 +
**When a gene is killed and merged into another, synonyms are removed from the Dead gene and added to the gin_synonyms table for the live gene into which it was merged.
 +
 
 +
*Species
 +
**On 2015-04-02 Michael Paulini added Species information to the genes.ace file dumped nightly from geneace.
 +
**The script now populates a gin_species table in postgres to record species names, e.g. Caenorhabditis elegans.
 +
 
 +
*Status
 +
**Live genes have no entry in the gin_dead table.
 +
**Dead genes have different entries depending on the information that is available for them:
 +
***Dead
 +
***Dead, Merged_into
 +
***Dead, Merged_into, Split_into
 +
**Suppressed genes are a special case.
 +
***They can have Dead Suppressed or Live Suppressed
 +
***Right now (10/2013), we only have Dead Suppressed
 +
 
 +
*Sanity Check
 +
**Since this script wipes out and re-populates postgres data, there is a sanity check so that if the nameserver file contains information on less than 20,000 genes, the script will not run and an email alert is sent.  20,000 is admittedly arbitrary, but hopefully is a reasonable cut-off.
 +
 
 +
 
 +
 
 +
'''For updating from latest WS release:'''
 +
 
 +
See [[Updating_Postgres_with_New_WS_Information | this page]] for details
 +
 
 +
There are 2 cronjobs and 3 scripts:
 +
 
 +
in the acedb account:
 +
 
 +
cronjob 1 on script 1 (runs daily at 3am PST) :  0 3 * * * /home/acedb/cron/update_ws_tazendra.pl
 +
 
 +
Also triggers to run 2nd script :
 +
/home/acedb/cron/dump_from_ws.sh          # dump .ace files from WS to update postgres tables
 +
 
 +
and write to a timestamp file
 +
my $dateOfWsDumpFile = /home3/acedb/cron/dump_from_ws/files/latestDate;
 +
echo "$date" >> $dateOfWsDumpFile;
 +
 
 +
in the postgres account :
 +
 
 +
cronjob 2 on script 3 (runs daily at 5am PST) :
 +
0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl
 +
 
 +
This script went live on tazendra on 2013 10 21
 +
 
 +
This script generates files to populate postgres tables for some gin_ tables and obo_ tables for exprcluster when the date in the timestamp file (see above) is more recent than the latest timestamp in gin_molname.  In other words, if WS is more recent than the postgres timestamp, the tables are updated (Lines 34 - 55).
 +
 
 +
For genes (gin_ tables), this script is species-restricted for C. elegans only (we can change this, if needed).
 +
 
 +
This script populates information that can only be derived from the WS build:
 +
 
 +
#Corresponding_transcript
 +
#Corresponding_CDS
 +
#Corresponding_protein
 +
#Molecular_name
 +
 
 +
If the WS timestamp is more recent than the postgres timestamp, then the subroutine processGeneCds (Line 57) will run.
 +
 
 +
Values and timestamps are written to a file, then the postgres tables are deleted, values from the files are copied to the postgres tables, and then the files are closed.
 +
 
 +
For populating gin_protein, the subroutine looks at the CDS objects (WSCDS.ace) and retrieves the Corresponding_protein.  To map genes to proteins, CDS and protein are matched (gin_seqprot), genes are matched to CDS (gin_seqprot) and then mapped to protein (gin_protein).
 +
 
 +
For populating gin_molname (Molecular_name), the subroutine looks at gene objects (WSGene.ace) and adds the value of Molecular_name to gin_molname.
 +
 
 +
For populating gin_sequence (Corresponding_transcript and Corresponding_CDS) the subroutine looks at gene objects (WSGene.ace) and adds the value to gin_sequence.
 +
 
 +
 
 +
We'll know when WS141 comes out (should be mid-to-late December 2013) and cronjob 1 picks it up, if this pipeline works automatically.
 +
 
 +
='''Some Relevant Postgres Queries:'''=
 +
 
 +
SELECT * FROM gin_dead WHERE gin_dead ~ 'merged' AND gin_dead ~ 'split';
 +
                                         
 +
SELECT * FROM gin_dead WHERE gin_dead ~ 'split';
 +
 
  
# /home/acedb/cron/populate_gin_locus.pl
 
# /home/acedb/cron/populate_gin.pl
 
  
  
  
='''New Scripts:'''=
+
Back to [[Caltech documentation]]

Latest revision as of 14:10, 29 May 2015

Table Summarizing Former and Current Postgres Gene Information (gin_ ) Population

AceDB tag Postgres table Old (pre-10/2013) - Nameserver nightly dump Old (pre-10/2013) - WS bimonthly release Current - Nameserver nightly dump Current - Geneace nightly dump Current - WS bimonthly release Use - Paper or meeting abstract gene connection Use - OA data type curation Use - Dumping scripts -- could be wrong, but I don't think any gin_ tables are used in dumping scripts since we store WBGene IDs. except maybe gin_dead if people want those suppressed or to have some kind of error message or to map to Historical_gene or something like that) Use - Protein2GO data conversion Use - Textpresso and/or GSA Markup Comment
WBGene identifier gin_wbgene Yes Yes (First line in each entry) Yes Yes Yes Yes
CGC_name gin_locus Yes (CGC) If it has this tag, gene is considered good (What does 'good' mean?) Yes (CGC) No No Yes No Yes
Other_name gin_synonyms Yes (checking for different CGC name - see lines 132-137 in script) Yes Yes (checking for different CGC name) Yes - add to what is already populated from nightly nameserver No Yes No Yes
Sequence_name gin_seqname Yes (Sequence) No Yes (Sequence) No No Yes No Yes
Status gin_dead Yes (0) only if value is dead and species ~ elegans$ Yes (0) only if value is dead in the nameserver nightly; populate with Merged_into and Split_into values No Yes Yes Yes
Suppressed gin_dead If Status from nightly geneace = Suppressed, populate gin_dead with Dead Suppressed (if Status in nightly nameserver is 0 (dead)) or populate gin_dead with Suppressed (if Status in nightly nameserver is 1 (live)). Also add Merged_into and/or Split_into if nightly nameserver Status is 0 (dead).
Merged_into gin_dead No Yes No Yes - add only when status is dead from nightly nameserver No Historical_gene tag uses this when dumping files
Split_into gin_dead No Yes No Yes - add only when status is dead from nightly nameserver No
Corresponding_transcript gin_sequence No Yes No No Yes Yes
Corresponding_CDS gin_sequence + gin_seqprot No Yes No No Yes Yes
Corresponding_protein gin_protein, gin_seqprot No Yes No No Yes Yes, but we'll need isoform data in WB Yes
Molecular_name gin_molname No Yes No No Yes No Maybe Yes
Species gin_species No No No Yes Yes - added to display of WBGene info in Term information window of OA May be used to add species information to automated concise description files and also to add taxon IDs to BioGrid interaction file
Pseudogene status gin_pseudogene No No No ??
Version_change No Yes, to make sure we don't attach GO annotations to pseudogenes. One use case would be to know when genes change class, e.g. CDS ->Pseudogene. We may not need to actually store this in postgres, though.
Public_name gin_wbgene Yes (but only when no CGC_name or Sequence_name) If it has this tag, gene is considered good Don't need (Public_name also in Other_name - confirm this is always the case) No Not if also in Other_name Not if also in Other_name Not if also in Other_name No I think we can now ignore the Public_name tag as long as there's always an Other_name value as well -- so if there is no Other_name then we'd look at Public_name ? looking at the script, we're not doing anything with this value)

Previous (pre-nameserver move) Scripts:

  1. /home/acedb/cron/populate_gin_locus.pl - updates information from nameserver nightly dumps
  2. /home/acedb/cron/populate_gin.pl - updates information from WS releases

How this all works (10/2013 - updated 04/2015)

Location of files:

The nightly nameserver and geneace dumps are on the sanger ftp site:

ftp://ftp.sanger.ac.uk/pub/consortia/wormbase/STAFF/mh6/nightly_geneace/

There are two json dumps (genes and variations) from the respective nameservers, and six .ace dumps (clones, genes, laboratories, rearrangements, strains, and variations) from geneace.

Note that the gene nameserver and geneace are not synchronized. Some changes are evident in the nightly nameserver dump before they are included in the geneace dump.


For updating from nightly gene nameserver and geneace dumps:

/home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/populate_gin_nightly.pl

Called from:

0 20 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl

Went live on tazendra 2013 10 25.

This script is looking at the genes.json and genes.ace.gz files to update WBGene IDs, CGC names, Sequence names, Synonyms (not from Other_name but from a comparison between the CGC name in the json dump and the existing locus and synonym names in postgres) and the Status.

This script is not species-restricted.

Key features of this script:

  • Locus and Synonym Names
    • The script queries postgres for a list of locus names and synonyms and puts these into a hash.
    • New synonym entries are appended. This was done so that we wouldn't lose new synonyms because they were picked up in the nameserver dump (from a changed CGC_name value) but not yet included in the nightly geneace dump (the Other_name value).
    • When a gene is killed and merged into another, synonyms are removed from the Dead gene and added to the gin_synonyms table for the live gene into which it was merged.
  • Species
    • On 2015-04-02 Michael Paulini added Species information to the genes.ace file dumped nightly from geneace.
    • The script now populates a gin_species table in postgres to record species names, e.g. Caenorhabditis elegans.
  • Status
    • Live genes have no entry in the gin_dead table.
    • Dead genes have different entries depending on the information that is available for them:
      • Dead
      • Dead, Merged_into
      • Dead, Merged_into, Split_into
    • Suppressed genes are a special case.
      • They can have Dead Suppressed or Live Suppressed
      • Right now (10/2013), we only have Dead Suppressed
  • Sanity Check
    • Since this script wipes out and re-populates postgres data, there is a sanity check so that if the nameserver file contains information on less than 20,000 genes, the script will not run and an email alert is sent. 20,000 is admittedly arbitrary, but hopefully is a reasonable cut-off.


For updating from latest WS release:

See this page for details

There are 2 cronjobs and 3 scripts:

in the acedb account:

cronjob 1 on script 1 (runs daily at 3am PST) : 0 3 * * * /home/acedb/cron/update_ws_tazendra.pl

Also triggers to run 2nd script : /home/acedb/cron/dump_from_ws.sh # dump .ace files from WS to update postgres tables

and write to a timestamp file my $dateOfWsDumpFile = /home3/acedb/cron/dump_from_ws/files/latestDate; echo "$date" >> $dateOfWsDumpFile;

in the postgres account :

cronjob 2 on script 3 (runs daily at 5am PST) : 0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl

This script went live on tazendra on 2013 10 21

This script generates files to populate postgres tables for some gin_ tables and obo_ tables for exprcluster when the date in the timestamp file (see above) is more recent than the latest timestamp in gin_molname. In other words, if WS is more recent than the postgres timestamp, the tables are updated (Lines 34 - 55).

For genes (gin_ tables), this script is species-restricted for C. elegans only (we can change this, if needed).

This script populates information that can only be derived from the WS build:

  1. Corresponding_transcript
  2. Corresponding_CDS
  3. Corresponding_protein
  4. Molecular_name

If the WS timestamp is more recent than the postgres timestamp, then the subroutine processGeneCds (Line 57) will run.

Values and timestamps are written to a file, then the postgres tables are deleted, values from the files are copied to the postgres tables, and then the files are closed.

For populating gin_protein, the subroutine looks at the CDS objects (WSCDS.ace) and retrieves the Corresponding_protein. To map genes to proteins, CDS and protein are matched (gin_seqprot), genes are matched to CDS (gin_seqprot) and then mapped to protein (gin_protein).

For populating gin_molname (Molecular_name), the subroutine looks at gene objects (WSGene.ace) and adds the value of Molecular_name to gin_molname.

For populating gin_sequence (Corresponding_transcript and Corresponding_CDS) the subroutine looks at gene objects (WSGene.ace) and adds the value to gin_sequence.


We'll know when WS141 comes out (should be mid-to-late December 2013) and cronjob 1 picks it up, if this pipeline works automatically.

Some Relevant Postgres Queries:

SELECT * FROM gin_dead WHERE gin_dead ~ 'merged' AND gin_dead ~ 'split';

SELECT * FROM gin_dead WHERE gin_dead ~ 'split';



Back to Caltech documentation