Difference between revisions of "Transgene curation pipeline"

From WormBaseWiki
Jump to navigationJump to search
 
(29 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
back to [[Transgenes]]
 
back to [[Transgenes]]
 
back to [[Caltech documentation]]
 
back to [[Caltech documentation]]
 +
 +
==Importing transgenes from textpresso==
 +
every day at 4am :
 +
/home/postgres/work/pgpopulation/textpresso/wrapper.sh
 +
 +
call :
 +
/home/postgres/work/pgpopulation/textpresso/transgene/update_textpresso_transgene.pl
 +
/home/postgres/work/pgpopulation/textpresso/antibody/update_textpresso_antibody.pl
 +
/home/postgres/work/pgpopulation/afp_papers/find_passwd_@.pl
 +
/home/postgres/public_html/cgi-bin/data/ccc_gocuration/get_newset.pl
 +
 +
Relevant script:
 +
/home/postgres/work/pgpopulation/textpresso/transgene/update_textpresso_transgene.pl
 +
 +
gets data from
 +
http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out
 +
 +
--[[User:Kyook|kjy]] 19:35, 24 May 2012 (UTC)
  
 
==Curating transgenes==
 
==Curating transgenes==
Invoke the phenote transgene configuration interface and access postres
+
All transgenes entered through the Textpresso pipeline are annotated with Arun as curator - these can be pulled out by searching for Arun in the curator field.
go to directory with phenote
+
 
  $./phenote -c worm-transgene.cfg
+
Transgenes are also entered by other curators in the course of curating Expression pattern (Daniela), Gene Regulation (Xiaodong), Gene interation and overexpression of gene interaction (Chris, Gary, Karen).
 +
 
 +
Curators should only enter transgenes that have a public name that follows the accepted nomenclature. These transgenes can be retrieved by searching for the curator names in the curator field.
 +
 
 +
Curators should look for information on new transgenes in the paper and supplementary files.
 +
 
 +
A daily cron job looks for new transgenes entered into the trp tables. The queries run retrieve transgenes based on trp_curator, trp_paper and cns_newtransgene entered in the last 24 hours that are not attributed to the transgene curator
 +
 
 +
SELECT * FROM trp_curator WHERE trp_curator != 'WBPerson712' AND trp_timestamp > now() - interval '1 day';
 +
 
 +
SELECT * FROM trp_paper WHERE trp_timestamp > now() - interval '1 day';
 +
 
 +
SELECT * FROM cns_newtransgene WHERE cns_timestamp > now() - interval '1 day';
 +
 
  
If you want to see all the current 'new' transgenes picked up by Textpresso, go to Tab 3 and press the "Search New Transgene" retrieve button. This action with retrieve all transgene objects that have data in the Summary or Remark fields. Usually there will be paper object info already since it was entered from the Textpresso search. 
 
  
Curators should look for information of new transgenes in the paper document provided by Textpresso (main paper or supplementary file).
 
  
Sometimes papers do not provide any information on the transgenes, only the name is provided. Then "No transgene info in original publication." should be entered into the Remark field so that it will not be identified as a new transgene again.
+
If authors do not use proper nomenclature for their transgene
 +
* look in referenced papers for the name (be sure to check if the transgene already exists in postgres under that name, if so, attached the paper to it).
 +
* assign a WBPaper<##>Is# or WBPaper<##>Ex# name
  
Here is the controlled vocabulary for the transgene remark field:  
+
Here is some controlled vocabulary for the transgene summary field:  
  
 
* Remark  "Conflicting mapping info: ..."
 
* Remark  "Conflicting mapping info: ..."
Line 22: Line 53:
 
* Remark  "Mapping info: "
 
* Remark  "Mapping info: "
  
==Phenote transgene.cfg==
+
==Assigning WBTransgene IDs==
 +
All transgenes should have a unique WBTransgeneID, except the ones that have been annotated as FALSE objects. These are assigned automatically when creating a NEW line in the OA. If objects are duplicated in the OA, you need to make sure the WBTransgeneID is not copied along with the other information, delete the copied WBTransgeneID if it was.  In these cases, WBTransgeneIDs should be assigned through a cron job that is set to run every morning at 4am. The WBTransgeneID is assigned based on the pgid.
 +
0 4 * * * /home/acedb/karen/cronjobs/assign_transgene_IDs.pl
 +
/home/postgres/work/pgpopulation/transgene/20121004_assign_transgene_IDs/assign_transgene_IDs.pl
 +
The script looks at data from  trp_name  trp_objpap_falsepos, and trp_curator .  Anything that exists in  trp_curator  and has neither a trp_name  nor a  trp_objpap_falsepos  gets an ID assigned by padding
 +
the joinkey to 8 digits, adding WBTransgene in front, and adding to trp_name  and  trp_name_hst . 
 +
NOTE: if we ever change any of the those table names this script will not work properly
 +
NOTE: interaction and protein call the "False Positive" tables 'falsepositive' instead of  objpap_falsepos
 +
 
 +
--[[User:Kyook|kjy]] ([[User talk:Kyook|talk]]) 00:23, 20 February 2014 (UTC)
 +
 
 +
==Dumping Transgenes==
 +
Transgenes are dumped every Sun at 4 am and are picked up by spica every Mon at 8am.
 +
0 4 * * sun cd /home/acedb/karen/transgene; ./use_package.pl
 +
 
 +
transgene use_package.pl  writes to  /home/postgres/work/citace_upload/transgene/transgene.ace and transgnene.ace<date>
 +
 
 +
transgene.ace in the directory  /home/acedb/public_html/karen/WS_upload_scripts/transgene is sym linked
 +
transgene.ace -> /home/acedb/karen/transgene/transgene.ace
 +
 
 +
==Transgene OA==
 +
*Autocomplete
 +
*Multi-ontology
 +
*Big text = free text, editable expanded box
 +
*Selection list
 +
*Selection list multiple list field
 +
*multi-drop down
 +
*Free text, pipe separated values
 +
 
 +
=====Tab 1 (Transgene)=====
 +
*'''Pgdbid''' - postgres database ID,  autogenerated
 +
*'''Name''' -- trp_name -- assigned WBTransgeneID
 +
*'''Public_name''' -- trp_public_name text
 +
*'''Synonym''' -- trp_synonym -- text, pipe separated values
 +
*'''Summary''' -- trp_summary --  bigtext
 +
*'''Construct'''
 +
*'''Coinjection construct'''
 +
*'''Coinjection marker'''  -- trp_coinjection
 +
*'''Construction Summary'''
 +
*'''Remark''' -- trp_remark -- bigtext, pipe separated values
 +
 
 +
=====Tab 2 (Isolation)=====
 +
*'''Integrated from'''
 +
*'''Integration method''' -- trp_integration_method -- multidrop down list -> used to be integrated by
 +
*'''Corresponding variation'''
 +
*'''Map''' -- trp_map -- multidropdown list
 +
*'''Map Paper''' -- trp_map_paper -- multiontology paper
 +
*'''Map Person''' -- trp_map_person -- multiontology person
 +
*'''Laboratory''' -- trp_laboratory -- multiontology laboratory, used be be called Location
 +
*'''Strain''' -- trp_strain -- text, pipe separated values
 +
 
 +
=====Tab 3 (Expression)=====
 +
*'''Curator''' -- trp_curator -- dropdown
 +
*'''Marker for'''-- trp_marker_for -- text
 +
*'''Marker Paper'''-- trp_maker_for_paper -- multiontology paper
 +
*'''Species'''-- trp_species -- text
 +
*'''Fail''' -- trp_objpap_falsepos -- toogle -- for marking transgenes that are falsely attributed to a given paper.
 +
*'''CGC remark*'''
 +
*'''Person''' -- trp_person -- multiontology
 +
*'''Paper''' -- trp_paper -- multiontology paper
 +
--[[User:Kyook|Kyook]] ([[User talk:Kyook|talk]]) 23:53, 29 December 2014 (UTC)
 +
 
 +
==Original Phenote transgene.cfg==
 +
Invoke the phenote transgene configuration interface and access postres
 +
go to directory with phenote
 +
  $./phenote -c worm-transgene.cfg
 +
 
 +
 
 
'''T'''= free text;  
 
'''T'''= free text;  
 
'''M'''= multiple values, separate values with a pipe(|);  
 
'''M'''= multiple values, separate values with a pipe(|);  
Line 72: Line 170:
 
* SQL-use?
 
* SQL-use?
  
==Transfer from phenote to OA==
 
*Autocomplete
 
*Multi-ontology
 
*Big text = free text, editable expanded box
 
*Selection list
 
*Selection list multiple list field
 
*multi-drop down
 
*Free text, pipe separated values
 
 
=====Tab 1 (Transgene)=====
 
*'''Pgdbid'''->''keep'' - postgres database ID,  autogenerated -> no change
 
 
*'''Curator'''
 
 
*'''Reference'''
 
 
*'''Name'''->''keep'' free text ->no change
 
 
*'''Synonym'''->''keep'' -free text, pipe separated values->no change (but moved from tab 2
 
 
*'''Summary''' ->''keep''- Big text -> no change
 
 
*'''Driven by Gene'''->''keep'' -> change to multi-ontology WBGene, I will want to enter gene by public/sequence name, postgres stores WBGeneID, make sure entries are unique
 
 
*'''Reporter Product'''->''keep'' -multi-drop down list -> no change, but I do need to know where the file is so I can edit it.
 
 
*'''Other Reporter'''->''keep'' - free text, pipe separated values -> no change
 
 
*'''Gene'''->''keep-> change to multi-ontology WBGene, I will want to enter gene by public/sequence name, postgres stores WBGeneID, make sure entries are unique
 
 
*'''Rescues''' ->NEW field -> multi-ontology WBGene
 
 
=====Tab 2 (Isolation)=====
 
 
 
*'''Remark'''->''keep''-> Big text, pipe separated values ->no change
 
 
*'''Clone/plasmid'''->''NEW Field''-> multi -ontology using Clone list from acedb select a, a->general_remark, a->positive_gene from a in class clone  where a->type = "Plasmid" (remove all "sjj_" clones).  also see e-mail. (for now this will be a no dump field).
 
 
*'''Integrated by'''->''keep'' -multi-drop down list -> no change, but I do need to know where the file is so I can edit it.
 
 
*'''Map'''->''keep''-> multi-drop down list -> no change
 
 
*'''Map Paper'''->''keep'' ->multi-ontology paper
 
 
*'''Map Person'''->''keep''->multi-ontology person
 
 
*'''Location'''->''keep''-> multi-ontology laboratory
 
 
*'''Strain'''->''keep'' ->free text, pipe separated values ->no change
 
 
=====Tab 3 (Expression)=====
 
*'''Marker for'''->''keep''->free text->no change
 
 
*'''Marker Paper'''->''keep''->multi-ontology paper
 
 
*'''Species'''->''keep''->no change
 
 
*'''Driven by Construct'''->''keep''-> free text->no change
 
 
*'''Movie'''-> ''keep''->no change
 
 
*'''Picture'''-> ''keep''->no change
 
 
=====Tab 4 (Postgres)=====
 
*'''Search New Transgene'''-> ''keep''-> no change
 
 
*'''SQL'''-> ''keep''->no change
 
  
 
=====Way in the future=====
 
=====Way in the future=====
*'''Co-injection marker'''-> ''New field''
 
  
 
*'''Genomic Expression'''->''New field'' - for developing a standardizing transgene expression nomenclature,  eventually we will want to fill this by script composition based on other fields (promoters, reporters, clones)
 
*'''Genomic Expression'''->''New field'' - for developing a standardizing transgene expression nomenclature,  eventually we will want to fill this by script composition based on other fields (promoters, reporters, clones)
 +
 +
[[Category:Caltech Documentation]]
 +
[[Category:Phenotype Curation]]

Latest revision as of 21:59, 3 May 2020

back to Transgenes back to Caltech documentation

Importing transgenes from textpresso

every day at 4am :

/home/postgres/work/pgpopulation/textpresso/wrapper.sh

call :

/home/postgres/work/pgpopulation/textpresso/transgene/update_textpresso_transgene.pl
/home/postgres/work/pgpopulation/textpresso/antibody/update_textpresso_antibody.pl
/home/postgres/work/pgpopulation/afp_papers/find_passwd_@.pl
/home/postgres/public_html/cgi-bin/data/ccc_gocuration/get_newset.pl

Relevant script:

/home/postgres/work/pgpopulation/textpresso/transgene/update_textpresso_transgene.pl

gets data from

http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out

--kjy 19:35, 24 May 2012 (UTC)

Curating transgenes

All transgenes entered through the Textpresso pipeline are annotated with Arun as curator - these can be pulled out by searching for Arun in the curator field.

Transgenes are also entered by other curators in the course of curating Expression pattern (Daniela), Gene Regulation (Xiaodong), Gene interation and overexpression of gene interaction (Chris, Gary, Karen).

Curators should only enter transgenes that have a public name that follows the accepted nomenclature. These transgenes can be retrieved by searching for the curator names in the curator field.

Curators should look for information on new transgenes in the paper and supplementary files.

A daily cron job looks for new transgenes entered into the trp tables. The queries run retrieve transgenes based on trp_curator, trp_paper and cns_newtransgene entered in the last 24 hours that are not attributed to the transgene curator

SELECT * FROM trp_curator WHERE trp_curator != 'WBPerson712' AND trp_timestamp > now() - interval '1 day';

SELECT * FROM trp_paper WHERE trp_timestamp > now() - interval '1 day';

SELECT * FROM cns_newtransgene WHERE cns_timestamp > now() - interval '1 day';



If authors do not use proper nomenclature for their transgene

  • look in referenced papers for the name (be sure to check if the transgene already exists in postgres under that name, if so, attached the paper to it).
  • assign a WBPaper<##>Is# or WBPaper<##>Ex# name

Here is some controlled vocabulary for the transgene summary field:

  • Remark "Conflicting mapping info: ..."
  • Remark "Conflicting genotype: ..."
  • Remark "No transgene info in original publication."
  • Remark "Other integration method: ..."
  • Remark "Clone = "
  • Remark "Mapping info: "

Assigning WBTransgene IDs

All transgenes should have a unique WBTransgeneID, except the ones that have been annotated as FALSE objects. These are assigned automatically when creating a NEW line in the OA. If objects are duplicated in the OA, you need to make sure the WBTransgeneID is not copied along with the other information, delete the copied WBTransgeneID if it was. In these cases, WBTransgeneIDs should be assigned through a cron job that is set to run every morning at 4am. The WBTransgeneID is assigned based on the pgid.

0 4 * * * /home/acedb/karen/cronjobs/assign_transgene_IDs.pl
/home/postgres/work/pgpopulation/transgene/20121004_assign_transgene_IDs/assign_transgene_IDs.pl

The script looks at data from trp_name trp_objpap_falsepos, and trp_curator . Anything that exists in trp_curator and has neither a trp_name nor a trp_objpap_falsepos gets an ID assigned by padding the joinkey to 8 digits, adding WBTransgene in front, and adding to trp_name and trp_name_hst . NOTE: if we ever change any of the those table names this script will not work properly NOTE: interaction and protein call the "False Positive" tables 'falsepositive' instead of objpap_falsepos

--kjy (talk) 00:23, 20 February 2014 (UTC)

Dumping Transgenes

Transgenes are dumped every Sun at 4 am and are picked up by spica every Mon at 8am.

0 4 * * sun cd /home/acedb/karen/transgene; ./use_package.pl

transgene use_package.pl writes to /home/postgres/work/citace_upload/transgene/transgene.ace and transgnene.ace<date>

transgene.ace in the directory /home/acedb/public_html/karen/WS_upload_scripts/transgene is sym linked

transgene.ace -> /home/acedb/karen/transgene/transgene.ace

Transgene OA

  • Autocomplete
  • Multi-ontology
  • Big text = free text, editable expanded box
  • Selection list
  • Selection list multiple list field
  • multi-drop down
  • Free text, pipe separated values
Tab 1 (Transgene)
  • Pgdbid - postgres database ID, autogenerated
  • Name -- trp_name -- assigned WBTransgeneID
  • Public_name -- trp_public_name text
  • Synonym -- trp_synonym -- text, pipe separated values
  • Summary -- trp_summary -- bigtext
  • Construct
  • Coinjection construct
  • Coinjection marker -- trp_coinjection
  • Construction Summary
  • Remark -- trp_remark -- bigtext, pipe separated values
Tab 2 (Isolation)
  • Integrated from
  • Integration method -- trp_integration_method -- multidrop down list -> used to be integrated by
  • Corresponding variation
  • Map -- trp_map -- multidropdown list
  • Map Paper -- trp_map_paper -- multiontology paper
  • Map Person -- trp_map_person -- multiontology person
  • Laboratory -- trp_laboratory -- multiontology laboratory, used be be called Location
  • Strain -- trp_strain -- text, pipe separated values
Tab 3 (Expression)
  • Curator -- trp_curator -- dropdown
  • Marker for-- trp_marker_for -- text
  • Marker Paper-- trp_maker_for_paper -- multiontology paper
  • Species-- trp_species -- text
  • Fail -- trp_objpap_falsepos -- toogle -- for marking transgenes that are falsely attributed to a given paper.
  • CGC remark*
  • Person -- trp_person -- multiontology
  • Paper -- trp_paper -- multiontology paper

--Kyook (talk) 23:53, 29 December 2014 (UTC)

Original Phenote transgene.cfg

Invoke the phenote transgene configuration interface and access postres

go to directory with phenote
 $./phenote -c worm-transgene.cfg


T= free text; M= multiple values, separate values with a pipe(|); S= selection list

Tab 1

  • Pgdbid- postgres database ID, entered automatically when curator enters a new transgene
  • Name(T)-approved name following Lab-prefix (or WBPaperID), Is or Ex, number
  • Summary(T)- genotype, including co-injection marker and relevant information about making the construct, if papers rport conflicting genotypes use Remark field and controlled vocabulary "Conflicting genotype: ...", if no information enter "No transgene info in original publication." in Remark field.
  • Driven by Gene(T,M)- enter WBGeneID used for promoters in every promoter driven construct of the transgene
  • Reporter Product(S,M)- list has common reporter genes, GFP, RFP, LacZ, etc.
  • Other Reporter(T,M)- enter other products encoded as reporters that do not appear in the drop down list
  • Gene(T,M)- enter WBGeneID for protein output of construct, which isn't considered a reporter product
  • Integrated by(S)- choose integration method if known, use 'not integrated' for Ex transgenes, if integration method is not listed, use Remark field and controlled vocabulary: "Other integration method: ..."
  • Strain(T,M)- not used consistently, enter approved strain names for those strains that contain the transgene
  • Map(S,M)- choose LG(s) of integrated array if known, if papers report differing map positions use Remark field and controlled vocabulary "Conflicting mapping info: ..."

Tab 2

  • Map Paper(T,M?)- WBPaperID for paper that reports mapping info
  • Map Person(T,M)- WBPersonID? or Name, person evidence
  • Marker for(T,M)- not used, Wen's expression data
  • Marker Paper(T,M)- WBPaperID, not used, Wen's expression data
  • Reference(T,M)- WBPaperID, generally autofilled by Textpresso cron job script and bulk upload of Ex search script, add new paper if necessary
  • Remark(T,M)- catch all used for clarifying info from other fields, and for entering construct specifics, in some cases use controlled vocabulary
    • "Conflicting mapping info: ..."
    • "Conflicting genotype: ..."
    • "No transgene info in original publication."
    • "Other integration method: ..."
    • "Clone = "
    • "Mapping info: "
  • Species(T,M?)- not used?
  • Synonym(T,M)- other names for the transgene or construct
  • Driven by Construct(T,M)- not sure what this is
  • Location(T,M)- Lab designations for people who have the transgene, not sure about this.

Tab 3

  • Movie(T)- used?
  • Picture(T)- used?
  • Search New Transgene(T)- use to retrieve all transgenes that do not have any summary or remark data
  • SQL-use?


Way in the future
  • Genomic Expression->New field - for developing a standardizing transgene expression nomenclature, eventually we will want to fill this by script composition based on other fields (promoters, reporters, clones)