Difference between revisions of "SOP for generating GO files for citace and GO consortium uploads"

Revision as of 14:45, 13 May 2019

1 List of Files for Upload
2 Generating GO files for citace
3 Generating a gene association file (since Nov 2013) for GOC upload
- 3.1 Uploading the gene association file to the GO consortium repository
- 3.2 File Uploads and Revisions
4 Generating a Gene Association File from UniProtKB with a 1:1 WBGene:Protein Mapping: 2014-08-12
5 Old SOPs
- 5.1 Generating a .ace file
- 5.2 Generating a gene association file
6 Updates for WS267
- 6.1 go_gpad_parser modifications
- 6.2 go_oa_parser modifications

List of Files for Upload

These files are deposited to citpub@spica in /home/citpub/Data_for_citace/Data_from_Kimberly/
The order of file production matters, as after the first file is dumped, the parsing scripts look to the previous .ace file to begin assigning GO annotation ids
- gp_annotation.ace (from Protein2GO gpad - this file gets dumped first, and a copy put in /home/acedb/kimberly/citace_upload/go so the next script can access the final annotation number)
- gocam_annotation.ace (from Noctua models - this file gets dumped second, and a copy put in /home/acedb/kimberly/citace_upload/go so the next script can access the final annotation number)
- go_annotation.ace (from OA - orphan genes and ncRNA annotations - this file gets dumped third and last)
  - Note that as of WS261, the Phenotype2GO annotations will no longer be included in the uploads, although they are still in the OA for reference
  - Also, the phenotype2go_mappings_WSNNN.ace (file containing current mapping of WPO to GO mappings) is no longer needed
- go_terms.ace (.ace file of GO obo)
- ro_terms.ace

This file is deposited to citpub@spica in /home/citpub/citace/Data_for_Ontology/

gene_ontology.WSNNN.obo

Generating GO files for citace

Generating a .ace file for annotations in Protein2GO (Nov 2013 onwards)

All scripts and files: /home/acedb/ranjana/citace_upload/go_curation

Converting from gpad to ace:

--the gpad files from protein2GO are available for download here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/

--the WB gpad file is listed towards the bottom of the page as: gp_association.6239_wormbase.gz. This file is dumped weekly, on Mondays.

--download and unzip the WB gpad file and transfer it to /home/acedb/ranjana/citace_upload/go_curation/ptgo_to_ace

--for the conversion script (gpToAce.pl) rename the WB gpad file to gp_association.wb (I've been making a directory for each upload's files for archive purposes, e.g. 2014_January)

--from /ptgo_to_ace, run the gpToAce.pl script, this generates gp_association.ace (annotations from Protein2GO)

--right now, the script will print to the screen error messages for GO_REFs that don't map to a WBPaper ID and Assigned_by values that don't map to WB. It will also give a few errors for not mapping to a curator ID. Another source of errors is when annotations have been made to a specific UniProt isoform ID that is not represented in the gp2protein file. Right now, I'm adding these isoform IDs as an additional line in the gp2protein file; hopefully with a new model, they will be handled appropriately.

--check the resulting .ace file for any annotations that are at the top of the file and map to an unknown gene, i.e. Gene : " " These usually are a result of an obsolete or out-of-sync mapping between WB and UniProtKB IDs

--for now, the easiest way to fix these seems to be to check the annotations to the associated paper in Protein2GO (by entering the PMID into the search box) and then update the corresponding IDs in the gp2protein.wb file that is used for the gpToAce.pl script and is in the same directory

--scp gp_association.ace to your local machine and rename in the format: gp_association.WS243.ace

--read in an empty citace mirror with the release-appropriate models file already read in and saved in the mirror

Generating a .ace file for annotations in the OA

--run the wrapper.pl script at /home/acedb/ranjana/citace_upload/go_curation, this generates go.ace.<date>, at /go_dumper_files (annotations that are left in the OA). (RNA genes, uncloned genes)

--the count_stuff_for_ace.pl script at /go_dumper_files, generates numbers for the go.ace.<date> file.

--scp go.ace.<date> to your local machine and rename as go_oa_WSXXX.ace

--Test syntax of files and #of objects in local citace mirror

--scp files to citpub@spica.caltech.edu:/home/citace/Data_for_citace/Data_from_Ranjana/.

Generating a .ace for the GO terms (ontology)

A new GO terms ontology file is generated every build: at /home/acedb/kimberly/citace_upload/go/ontology2ace/go_obo2ace
The script to run is: go_obo_to_go_ace.pl
This script will download the latest ontology file at: http://current.geneontology.org/ontology/go.obo
- (Used to be from: http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo)
Once downloaded, the script will generate a new .ace output file, go_terms.ace
Rename the output file with the upload number, as go_terms_WSXXX.ace
scp the file to citpub on spica, /home/citpub/Data_for_citace/Data_from_Kimberly

Obo file of GO terms for citace upload

--At /home/citpub/Data_for_Ontology/ at citpub@spica.caltech.edu, use 'wget' to get gene_ontology_edit.obo file from:

http://current.geneontology.org/ontology/go.obo

(Old file we used to upload: http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo)

--Rename file as gene_ontology.WSnnn.obo.

List of files for citace upload

These files are deposited to citpub@spica in /home/citpub/Data_for_citace/Data_from_Kimberly/

1. go_annotation.ace (manual annotations in the OA)

2. gp_association.ace (manual annotations from Protein2GO)

3. go_terms_WSXXX.ace (GO ontology)

4. phenotype2go_mappings.ace (consolidated phenotype2go mappings for any given build).

submitted to:/home/citpub/Data_for_Ontology/:

5. gene_ontology.WSXXX.obo.

No longer submitted:

WBPaper00038491_genes.ace added genes to paper connection for Daniel Shaye, these genes were added to the paper editor, so this file is no longer manually being put into citace.

Numbers for citace upload

As reported by testing files in the local empty citace mirror:

GO file numbers
	WS242	WS243	WS244	WS245	WS246
gp_association	2,749 genes 43,578 lines	2,765 genes 43,914 lines	2,909 genes 45,691 lines	3006 genes 46,470 lines	3069 genes 47,731 lines
go_oa	169 genes 1,293 lines	169 genes 1,293 lines	180 genes 1,347 lines	180 genes 1,348 lines	181 genes 1,368 lines
total #genes	2,918	2,934 genes	3,089 genes	3,186 genes	3,250 genes
go_terms	40,402 terms 1,928,733 lines	40,647 terms 1,944,916 lines	41,096 terms 1,968,936 lines	41,392 terms 1,990,179 lines	41,865 terms 2,022,149

Generating a gene association file (since Nov 2013) for GOC upload

--In the tmp directory on a local machine:

--Download elegans annotation file, 9.C_elegans from the UniProt ftp site (annotations from Protein2GO): ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/

2014-08-05 - convert all UniProtKB IDs to WBGene IDs to allow for more streamlined view of C. elegans annotations in AmiGO and easier use of annotations in enrichment analyses

Will need a script that does this:
- Using gp2protein.wb and 9.C_elegans.goa files as input file, converts values in 9.C_elegans.goa file as follows:
  - Column 1 in 9.C_elegans.goa from UniProtKB -> WB
  - Column 2 in 9.C_elegans.goa from six-character UniProtKB identifier to corresponding WBGene identifier in gp2protein file
    - Error message for conversion errors: "UniProtKB identifier "nnnnnn" cannot be converted to a WBGene"
    - This error message could be at the top of the file (or an errors output file - doesn't matter to me)
  - Column 12 in 9.C_elegans.goa from protein -> gene

2014-08-06 - above script (convert_uniprot_to_wb.pl) works well, but we need to try to get the number of mapping errors down, if we can.
We could try a new pipeline for generating our gp2protein file using the UniProtKB idmapping file (available here: ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/). Other issues (e.g., pseudogenes still represented as proteins in UniProt), will have to be dealt with on a case-by-case basis.
We'll need a new script to generate the gp2protein file from the UniProtKB idmapping file. Here are specs for that script:
- Input file: 6239.idmapping on tazendra in /home/acedb/ranjana/citace_upload/go_curation/ptgo_to_go/gp2protein
- Output file: gp2protein.wb (can go in the same gp2protein directory as the input file)
- For each WBGene in column 3 of input file, prefix the value with 'WB:' and make an entry in column 1 of output file
- For each WBGene in column 3 of input file, prefix the corresponding value in column 1 of input file with 'UniProtKB:' and place in column 2 of output file
- When a WBGene in column 3 of input file has more than one row in the input file (i.e., more than one corresponding UniProtKB identifier), place each UniProtKB id in column 2 of the output file, separated by a semi-colon.
- Examples:
  - Input file: A0A9S0 WormBase WBGene00044742
  - Output file: WB:WBGene00044742 UniProtKB:A0A9S0
  - Input file: H2L2A7 WormBase WBGene00001650
  - Input file: Q19791 WormBase WBGene00001650
  - Output file: WB:WBGene00001650 UniProtKB:H2L2A7;UniProtKB:Q19791
- Also, for the errors output of convert_uniprot_to_wb.pl, I think, given the numbers, it would be better to write them to a separate file.

--For phenotype2go annotations: Download Hinxton gene association file for Ce from the appropriate folder, eg: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS242/ONTOLOGY/, grep for ‘WBPhenotype' and deposit in file, 'phenotype2go'

--For manual annotations left in the OA: If the wrapper.pl has already been run for the citace upload, the gene association file for annotations in the OA are dumped as the file, go.go.<date> at Tazendra: /home/acedb/ranjana/citace_upload/go_curation/go_dumper_files/ scp file to the tmp directory on the local machine

--Concatenate files and rename resulting file gene_association.wb: cat 9.C_elegans, phenotye2go and go.go.<date> > gene_association.wb

--remove header from go.go file, place it on top of file, correct space in between $Date:$
!Version: $Revision: $
!Organism: Caenorhabditis elegans
!date: $Date:$
!From: WormBase

--gzip file

Uploading the gene association file to the GO consortium repository

Use SVN commands to upload to the GO:

maya:tmp ranjana$ svn co svn+ssh://ranjana@ext.geneontology.org/share/go/svn/trunk/gene-associations/submission

Prompts for password:
Downloads the 'submission/' directory into the tmp directory.
Copy the new gene_association.wb.gz into the submissions directory. This overwrites the older file.

Commit submissions directory: (from the base tmp directory)

maya:tmp ranjana$ svn commit submission/ -m "upload of wormbase file"
ranjana@ext.geneontology.org's password:
Sending submission/gene_association.wb.gz
Transmitting file data.

FOR Updating README:

maya:tmp ranjana$ svn co svn+ssh://ranjana@ext.geneontology.org/share/go/svn/trunk/gene-associations/readme
cd to readme, make changes to WormBase.README, save and then cd up to tmp and then
maya:tmp ranjana$ svn commit readme/ -m "changed date and version numbers only"
Transmitting file data .

File Uploads and Revisions

Annotations file:

Committed revision 4045.
Committed revision 16362, March 11, 2014
Committed revision 16371, March 11, 2014, now includes phenotype2go annotations from Hinxton file.
Commited revision 16756, March 31st, 2014, with new release of UniProt file only, other files stay the same,now has sequence identifier for dhs-16, Github #2523
Committed revision 18045, new upload for June 4th, 2014, with WS243 phenotype2go and OA go annots, and UniProt file dated 5/12/14.
Committed revision 20708, new upload for Sept, 30th, 2014 with WS245 phenotype2go, and WS246 UniProt and OA go annots, Uniprot file dated 9/2/2014 (no changes for README)

README file: Committed revision 4046.
committed revision 16363, March 11, 2014.
For the June 4th upload, nothing to change in the README, will leave as is, so no revision change.

Generating a Gene Association File from UniProtKB with a 1:1 WBGene:Protein Mapping: 2014-08-12

This file is needed to circumvent errors generated by the GOTermFinder enrichment tool as a result of multiple mappings between a WBGene ID in Column 2 and a Gene Symbol in Column 3 of the WBGene-based GAF.
This file also reduces the redundancy that results when using UniProtKB accessions to perform enrichment analyses (although there may be experiments where annotations to specific transcripts or gene products is the appropriate set of annotations to use).
Thoughts - supply multiple files: WBGene only; UniProtKB only (would lose ncRNA annotations); both files with annotations only to the most granular terms; add ND annotations to all gene products that have no references.
Files needed:
- Tab-delimited text file from UniProtKB with accessions, status, and length: http://www.uniprot.org/uniprot/?query=organism%3a6239+keyword%3a1185&format=tab&columns=id,entry%20name,reviewed,protein%20names,genes,organism,length
  - File available on tazendra here: ranjana/citace_upload/go_curation/ptgo_to_go/wb_uniprot_single_mapping/uniprot_summary.wb
- Tab-delimited gp2protein file for WBGene to UniProtKB accession mappings
  - File available on tazendra here: ranjana/citace_upload/go_curation/ptgo_to_go/wb_uniprot_single_mapping/gp2protein.wb
General idea: Creating another version of the gp2protein file that has single mappings between WBGenes and UniProtKB Accessions.
- Key issue is how to select the single UniProtKB entry we will use.
- Output file we be another gp2protein file; we can call this one gp2protein_single.wb
- I think about this as creating a file or table with all the possible values and then pruning the table to have a single representation of each WBGene and UniProtKB pair and then lastly creating the gp2protein_single.wb file as output.

Step 1: Create this mapping:

WBGene ID (from gp2protein.wb, column 1, populate based on entry in gp2protein.wb, column 2 match entry after 'UniProtKB:' prefix)	UniProtKB Accession (from uniprot_summary.wb, column 1, Entry)	Status (from uniprot_summary.wb, column 3, Status)	Length (from uniprot_summary.wb, column 7, Length)
WB:WBGene00000464	Q9BL02	reviewed	1273
WB:WBGene00000464	Q8IA98	reviewed	621
WB:WBGene00000536	P48376	reviewed	187
	V6CJ36	unreviewed	1090
	V6CK82	unreviewed	1079
WB:WBGene00000942	P19826	reviewed	1010
WB:WBGene00000942	Q8MPS2	unreviewed	999
WB:WBGene00000942	Q9BI31	unreviewed	369
WB:WBGene00001377	O17670	unreviewed	503
WB:WBGene00001377	Q564S5	unreviewed	469
WB:WBGene00021697	B3CJ34	unreviewed	2651

Step 2: Prune the mapping to eliminate rows without a WBGene and eliminate redundant WBGene entries according to the following criteria:
- If one WBGene entry only, leave that one entry.
- If more than one entry and all entries are either reviewed or all entries are unreviewed, keep one entry with the longest length.
- If more than one entry and one is reviewed, but others are not, keep the entry that is reviewed.
- If more than one entry and some are reviewed and some are not reviewed, keep the longest, reviewed entry.

WBGene ID (from gp2protein.wb, column 1, populate based on entry in gp2protein.wb, column 2 match entry after 'UniProtKB:' prefix)	UniProtKB Accession (from uniprot_summary.wb, column 1, Entry)	Status (from uniprot_summary.wb, column 3, Status)	Length (from uniprot_summary.wb, column 7, Length)
WB:WBGene00000464	Q9BL02	reviewed	1273
WB:WBGene00000536	P48376	reviewed	187
WB:WBGene00000942	P19826	reviewed	1010
WB:WBGene00001377	O17670	unreviewed	503
WB:WBGene00021697	B3CJ34	unreviewed	2651

Step 3: Final output is a gp2protein file that contains the entries in Columns 1 and 2, with column 2 prefixed with'UniProtKB:', sorted in ascending order based on WBGene ID.
- Other outputs: two separate file listing entries that were kept and entries that were pruned in Step 2, i.e. those with no WBGene or those that were redundant, with all four columns of data.

    WB:WBGene00000464     UniProtKB:Q9BL02
    WB:WBGene00000536     UniProtKB:P48376
    WB:WBGene00000942     UniProtKB:P19826
    WB:WBGene00001377     UniProtKB:O17670
    WB:WBGene00021697     UniProtKB:B3CJ34

Step 4: The resulting gp2protein file will be used as the input for the script that converts the UniProtKB goa file to a WB .go file (I'll rename it for the purposes of running the script). I'll run this script manually.

Old SOPs

Generating a .ace file

On Tazendra, acedb account:

--run the ./wrapper.pl script at /home/acedb/ranjana/citace_upload/go_curation/

--./wrapper.pl dumps both go.ace and go.go files under /home/acedb/ranjana/citace_upload/go_curation/go_dumper_files/ with dates appended

-- go.go.20090731 and go.ace.20090731.091726 files created under /go_dumper_files

--Run the check_go_ace.pl script as './check_go_ace.pl filename' ./check_go_ace.pl (NOTE: THIS SCRIPT NO LONGER RUN) then strips out errors that don't have to do with the Gene header, and puts all errors in the error_files/go.err.time (if it's in the go.ace.time format it replaces the ace part with err)

--As of now the script is removing only the erroneous line but not the curator_confirmed line associated and directly under this line, which needs to be removed manually. Need to think about this.

--Run the count_stuff_for_ace.pl on the script to get the numbers Note***Worked with JC to modify check_go_ace.pl, actually this script is no longer relevant and could be skipped, since we are using the OA.

--scp file to maya.caltech.edu and rename file in format: 032107_WS174_go_dump.ace

--Manually remove these annotations that are actually 'NOT'annotations of:

mtm-9 WBGene00003479 GO:0004438

vha-2 WBGene00006911 GO:0009790--looks like annotation was removed manually, no longer in dump

vha-3 WBGene00006912 GO:0009790--looks like annotation was removed manually, no longer in dump

hsp-60 WBGene00002025 GO:0009408 (added from WS194 upload)

hsp-12.3 WBGene00002012 GO:0051082 (added from WS202 upload)

hsp-12.6 WBGene00002013 GO:0051082 and GO:0006950

--Test file syntax and #of objects in local citace mirror on Juno:

Read in file for syntax errors

Count #of WBGenes, Papers, WBPersons before and after loading ace file

--scp file to citace@spica.caltech.edu:/home/citace/Data_for_citace/Data_from_Ranjana/.

The following files are submitted to the citace account on citace@spica.caltech.edu every build:

To: /home/citace/Data_for_citace/Data_from_Ranjana/

1. date_WSXXX_go_dump.ace (dumped from postgres, from the manual curation via Phenote)

2. variation2goterm_VarID.ace. This is the file where allele names have been converted to WBVarIDs by Wen. Use this file until this data is read into Postgres.

3. phenotype2go_mappings.ace (consolidated phenotype2go mappings for any given build).

4. A new GO terms ontology file is generated at /home/acedb/ranjana/citace_upload/go_curation/go_obo2ace using the go_obo_to_go_ace.pl script, rename with upload number, eg. go_terms_WS240.ace: (We used to submit a WSXXXGOterms.ace file that Wen dumped, no longer used)

All of the above files are submittd to: /home/citace/Data_for_Ontology/ at citace@spica.caltech.edu

NOTE:These genes were added to the paper editor, so this file is no longer manually being put into citace.

5. WBPaper00038491_genes.ace added genes to paper connection for Daniel Shaye

Change directory to: Data_for_Ontology/, under /home/citace/.

Here use 'wget' to get gene_ontology_edit.obo file from

http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology.1_2.obo.

Rename file in the format: gene_ontology.WS231.obo.

Generating a gene association file

In the acedb user account on Tazendra at:/home/acedb/ranjana/GO: --Use ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS211/ONTOLOGY/gene_association.WS211.wb.ce

--use'grep IEA gene_association.WSXXX.wb.ce>gene_association.wb.electronic to separate the IEAs.

--grep WBPhenotype gene_association.WSXXX.wb.ce > gene_association.wb.rnai2go(to get i.e both Erich's earlier RNAi2GO ones and the new associations based on allele phenotypes that went into WormBase WS186).

--copy the right go.go.<date> file from /home/acedb/ranjana/citace_upload/go_curation/go_dumper_files/ to this directory,change name to gene_association.wb.manual.

--new GOA elegans file, from 04.02.12, for external annots (use 'wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/9.C_elegans.goa')

--Run the ./wrapper.pl script Output will include the various error types

--Run ./strip_errors_and_concatenate.pl

Scp the generated gene association file to a local machine for post-processing and upload to the GOC In the tmp directory on Maya: --scp file to Maya

--removed 'NOT' annotations from mtm-9, vha-2, vha-3, hsp-60, hsp-12.3, hsp-12.6. (We do not take out NOT annotations anymore)

--removed header from the middle of concatenated file in two places (on top of UniProt file too, search for 'gaf-version') and placed on top of file (correct minor mistake in header--space after the $ on one of the lines)

--And move the following header from the middle of file to the top of file:

!Version: $Revision: $

!Organism: Caenorhabditis elegans

!date: $Date: $

!From: WormBase

--Add these two lines at the bottom of header:

!DataBase_Project_Name: WormBase WS215/WS216

!gaf-version: 2.0

--Remove the header 'gaf 2.0', from the top of the UniProt file

--gzip file

--Copy file to the tmp directory

Use SVN commands to upload to the GO, also update README file every upload.

Updates for WS267

These changes were made to accommodate the use of Relations Ontology terms in the 'Annotation_relation' and 'Annotation_relation_not' fields in the ?GO_annotation model

go_gpad_parser modifications

/home/acedb/kimberly/citace_upload/go/gpad2ace/2018_June_test/go_gpad_parser.pl

Line 130 - Will instead need a mapping between annotation relation and RO term id

When just relation -> Annotation_relation

When NOT|relation -> Annotation_relation_not

This may be temporary as it was proposed at the NYC GO meeting to start using RO ids in place of text in the qualifier/relation column of at least the GPAD file

Eventually this should also be the case for the Annotation extension relations, but there are some used in AEs that are not in RO

Mappings (as of June 12th):

colocalizes_with RO:0002325 110
contributes_to RO:0002326 277
enables RO:0002327 22903
involved_in RO:0002331 32498
part_of BFO:0000050 33772

NOT annotations 224

go_oa_parser modifications

/home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm

Line 56 - relations are stored in the gop_qualifier table

Will need to output the same as above

Mappings (as of June 12th):

colocalizes_with RO:0002325 0
contributes_to RO:0002326 0
enables RO:0002327 20
involved_in RO:0002331 334
part_of BFO:0000050 29

NOT annotations 0

Back to Gene Ontology