Difference between revisions of "Transgenes"
(11 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
==Transgene objects== | ==Transgene objects== | ||
− | We curate all integrated (Is, In, Si), and | + | We curate all integrated (Is, In, Si), and extrachromosomal (Ex) transgenes. Note: Ti objects are considered transposons and are covered under variations, and In is sometimes ambiguous with Inversions, which should be in Rearragements. |
For each transgene, we extract the following information:<br> | For each transgene, we extract the following information:<br> | ||
Line 21: | Line 21: | ||
* gene product(s) | * gene product(s) | ||
* etc. | * etc. | ||
+ | |||
+ | [[User:Kyook|Kyook]] ([[User talk:Kyook|talk]]) 17:28, 24 October 2024 (UTC) | ||
==Model Changes== | ==Model Changes== | ||
Line 67: | Line 69: | ||
==Transgene first pass:== | ==Transgene first pass:== | ||
+ | NOTE: Updating scripts for the dockerized textpresso-curation machine - Valerio will generate a script to find all In/Is/Si|Ex transgenes, append WBPaperIDs to the matched transgenes in the Postgres tables, then enter unmatched transgenes on their own line (new pgid, new trp_name WBTransgeneID), filling in trp_publicname, trp_paper, and trp_curator "WBPerson4793" (Arun) | ||
+ | |||
+ | From Valerio:This is the new repo: https://github.com/WormBase/entity-extraction-transgene. This is the ticket for the work to do: https://github.com/WormBase/entity-extraction-transgene/issues/1 and this is the ticket to update ACKnowledge: https://github.com/WormBase/ACKnowledge/issues/315 | ||
+ | [[User:Kyook|Kyook]] ([[User talk:Kyook|talk]]) 17:34, 24 October 2024 (UTC) | ||
+ | |||
===In/Is/Si transgenes=== | ===In/Is/Si transgenes=== | ||
Arun and Wen have automated the identification of papers that contain transgenes by using Textpresso to scan the C. elegans corpus of papers for the regular expressions (1-3 capital letter)Is, In, or Si (1-4 digits). This script will miss any transgenes that do not have a standard name using fitting this regex. | Arun and Wen have automated the identification of papers that contain transgenes by using Textpresso to scan the C. elegans corpus of papers for the regular expressions (1-3 capital letter)Is, In, or Si (1-4 digits). This script will miss any transgenes that do not have a standard name using fitting this regex. | ||
Line 97: | Line 104: | ||
All bad transgene objects are now entered into the trp tables and appended with "FAIL". Transgenes that match a pre-existing Transgene public_name-paper association marked as FAIL are ignored during trp table population. | All bad transgene objects are now entered into the trp tables and appended with "FAIL". Transgenes that match a pre-existing Transgene public_name-paper association marked as FAIL are ignored during trp table population. | ||
+ | |||
+ | postgres query to find all PGIDs with FAIL in trp_objpap_falsepos | ||
+ | SELECT * FROM trp_objpap_falsepos WHERE joinkey IN ( | ||
+ | SELECT joinkey FROM trp_publicname WHERE trp_publicname IN ( | ||
+ | SELECT trp_publicname AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1)); | ||
==Curation== | ==Curation== | ||
Line 112: | Line 124: | ||
3. enters paper into trp_paper | 3. enters paper into trp_paper | ||
4. enters Arun into trp_curator | 4. enters Arun into trp_curator | ||
+ | |||
+ | * Dealing with Duplicates | ||
+ | **Postgres query to find all transgenes with the same public_names and that are not erroneous/typos by authors (those with trp_objpap_falsepos value FAIL) | ||
+ | |||
+ | SELECT * FROM trp_publicname WHERE | ||
+ | trp_publicname IN ( | ||
+ | SELECT trp_publicname AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1) | ||
+ | AND trp_publicname NOT IN ( | ||
+ | SELECT trp_publicname FROM trp_publicname WHERE joinkey IN ( | ||
+ | SELECT joinkey FROM trp_objpap_falsepos WHERE joinkey IN ( | ||
+ | SELECT joinkey FROM trp_publicname WHERE trp_publicname IN ( | ||
+ | SELECT trp_publicname AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1)))) | ||
+ | ORDER BY trp_publicname; | ||
[[All_OA_tables#trp_tables_Transgene | trp tables]] | [[All_OA_tables#trp_tables_Transgene | trp tables]] | ||
Line 152: | Line 177: | ||
| |Promoter||myo-3||Driven by gene||cns_drivenbygene||ontology||gin_table||as in uP form||Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank|||| | | |Promoter||myo-3||Driven by gene||cns_drivenbygene||ontology||gin_table||as in uP form||Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank|||| | ||
|- | |- | ||
− | | |Expressed gene||RIN-1||Gene||cns_gene||ontology||gin_table||as in uP form||Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank|||| | + | | |Expressed gene||RIN-1||Gene||cns_gene||ontology||gin_table||as in uP form||Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank||||"Only enter genes found in WormBase" |
+ | |- | ||
+ | | | Non-WormBase gene||human alpha-synuclein||Other reporter||cns_otherreporter||text||-||No||-||"Enter any genes in your construct from species other than those represented in WormBase" | ||
|- | |- | ||
| |Reporter||GFP||Reporter||cns_reporter||ontology||cns_reporter||-||Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank||if more than one value is entered (comma separated) enter all of them into the cns_reporter table||like in uP form: cns_reporter list, add "If you need to enter more than one reporter, please separate them with commas." | | |Reporter||GFP||Reporter||cns_reporter||ontology||cns_reporter||-||Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank||if more than one value is entered (comma separated) enter all of them into the cns_reporter table||like in uP form: cns_reporter list, add "If you need to enter more than one reporter, please separate them with commas." | ||
+ | |- | ||
+ | | | Non-WormBase reporter||-||Other reporter||cns_otherreporter||text||-||No||-||"Enter any reporter in your construct that isn't included on the reporter list above" | ||
|- | |- | ||
| |3'UTR||pie-1||Construct-3UTR||cns_threeutr||ontology||gin_tables||as in uP form||no|||| | | |3'UTR||pie-1||Construct-3UTR||cns_threeutr||ontology||gin_tables||as in uP form||no|||| | ||
Line 179: | Line 208: | ||
==Transgene .ace dumper== | ==Transgene .ace dumper== | ||
− | The transgene.ace dumper was written by Juancarlos and Wen to translate the transgene postgres data into .ace format for uploading into AcEDB. | + | Before running the dumper, clean up the tables of duplicates. Find all duplicate public names for transgenes with the postgres query |
− | The script is located on tazendra | + | <pre> SELECT trp_publicname, COUNT(*) AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1;</pre> |
− | The output file is dumped into the same directory. | + | |
+ | Also consider resolving objects with the same synonym | ||
+ | <pre> SELECT trp_synonym, COUNT(*) AS count FROM trp_synonym GROUP BY trp_synonym HAVING COUNT(*) > 1; </pre> | ||
+ | |||
+ | |||
+ | The ./use_package.pl transgene.ace dumper was written by Juancarlos and Wen to translate the transgene trp postgres data into .ace format for uploading into AcEDB. | ||
+ | The script is located on tazendra | ||
+ | The output file transgene.ace is dumped weekly on Sundays into the same directory. | ||
+ | Three files are output : transgene.ace.<date>, transgene.ace –overwriting the file from the last time the script was run, and an err.out file. | ||
+ | The err.out file currently contains transgenes with | ||
+ | * invalid WBPaperIDs - these are mainly due to paper IDs that were merged into other papers. The transgenes in postgres have already been attributed with the current paper ID however, they still retain the invalid paper ID, so that should be removed when there is time. These transgenes are still dumped with the correct IDs. | ||
+ | * public names but are not attributed with an array status of Extrachromosomal or Integrated. In these cases the issue is mainly a poorly formed name, which needs fixing. | ||
Test the .ace in CitaceMinus<br> | Test the .ace in CitaceMinus<br> | ||
Line 246: | Line 286: | ||
==Cross curation with other curators== | ==Cross curation with other curators== | ||
− | Transgenes are used by other datatypes; Expr_pattern, Phenotype, Gene_regulation, Interaction. All curators that use transgenes will need to run a script before each upload to make sure all their transgenes are valid objects. | + | Transgenes are used by other datatypes; Expr_pattern, Phenotype, Gene_regulation, Interaction. <br> |
+ | '''All curators that use transgenes will need to run a script before each upload to make sure all their transgenes are valid objects.''' | ||
<br> | <br> | ||
Line 263: | Line 304: | ||
The interaction curator can create new transgenes using the Transgene OA or request them from the transgene curator. | The interaction curator can create new transgenes using the Transgene OA or request them from the transgene curator. | ||
− | ===Strain=== | + | ===CGC Transgenes from Strain via Hinxton=== |
− | ==== | + | ====Processing the periodically released transgene_report==== |
− | + | Strain data from the CGC has relevant remarks about the transgene in the strain. Often, there are discrepancies between how an author describes a transgene's genotype and how a CGC strain submitter enters the transgene details. | |
− | + | ||
− | Strain data from the CGC has relevant remarks about the transgene in the strain. Often, there are discrepancies between how an author describes a transgene's genotype and how a CGC strain submitter enters the transgene details. | + | Quarterly, Mary Ann processes a strain report from the CGC and creates a file '''transgene_report.<date>.txt''' that maps transgenes mentioned in the CGC file to WBTransgene IDs, then sends the file to the WB transgene curator. <br> |
− | + | ||
+ | '''/home/postgres/work/pgpopulation/transgene/20131204_maryann_transgene/populate_maryann_transgene.pl*''' <br> | ||
+ | ''NOTE: This script was written before the construct model took over many of the tags in the transgene model"<br> | ||
+ | 5/9/17 Changes to the script: | ||
+ | *comment out any part of the script that refers to these missing tables: trp_gene and trp_driven_by_gene | ||
+ | |||
+ | Script only populates trp tables and does not immediately create any corresponding construct ID; these have to be done manually for any new transgene as well as curate construct details. | ||
+ | |||
+ | The transgene_report comes formatted like this: | ||
+ | <pre> | ||
+ | {\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210 | ||
+ | {\fonttbl\f0\fswiss\fcharset0 Helvetica;} | ||
+ | {\colortbl;\red255\green255\blue255;} | ||
+ | \paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0 | ||
+ | \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural | ||
+ | |||
+ | \f0\fs24 \cf0 Transgene : shEx34\ | ||
+ | Strain : ABR14\ | ||
+ | Description : shEx34 [myo-3p::mCherry]. Pick mCherry+ to maintain. This strain serves as a control strain to ABR16. Reference: Han S, et al. Nature 2017 doi: 10.1038/nature21686.\ | ||
+ | WBID : NONE\ | ||
+ | \ | ||
+ | Transgene : shEx1\ | ||
+ | Strain : ABR16\ | ||
+ | Description : shEx1 [ges-1p::fat-7 + myo-3p::mCherry]. Pick mCherry+ to maintain. FAT-7 over-expressing strain. ABR14 serves as a control strain for this strain. Reference: Han S, et al. Nature 2017 doi: 10.1038/nature21686.\ | ||
+ | WBID : NONE\ | ||
+ | </pre> | ||
+ | |||
+ | The input file for the script needs to look like this: | ||
<pre> | <pre> | ||
Transgene : staIs1 | Transgene : staIs1 | ||
Line 277: | Line 345: | ||
WBID : WBTransgene00017210 | WBID : WBTransgene00017210 | ||
</pre> | </pre> | ||
+ | |||
NOTE: extraneous characters such as "\" need to be removed before running the script.<br> | NOTE: extraneous characters such as "\" need to be removed before running the script.<br> | ||
The file needs to be placed on tazendara (note name change) postgres/work/pgpopulation/transgene/'''transgene_report.txt'''<br> | The file needs to be placed on tazendara (note name change) postgres/work/pgpopulation/transgene/'''transgene_report.txt'''<br> | ||
Line 282: | Line 351: | ||
The populate_maryann_transgene.pl script<br> | The populate_maryann_transgene.pl script<br> | ||
− | 1. Enters transgenes from CGC file that are missing from WB. <br> | + | 1. Enters transgenes from the CGC file that are missing from WB. <br> |
Start with transgene name, map to WBTransgene ID from trp_name/trp_publicname. | Start with transgene name, map to WBTransgene ID from trp_name/trp_publicname. | ||
− | *If no transgene match, add trp_name (generate trp_name (WBTransgeneID)if it didn't exist already) trp_publicname, trp_strain, trp_summary. | + | *If no transgene match, add trp_name (generate trp_name (WBTransgeneID) if it didn't exist already) trp_publicname, trp_strain, trp_summary. |
*If there is a transgene match only add/change whatever is new/different <br> | *If there is a transgene match only add/change whatever is new/different <br> |
Latest revision as of 17:40, 24 October 2024
back to Caltech documentation
Contents
Transgene objects
We curate all integrated (Is, In, Si), and extrachromosomal (Ex) transgenes. Note: Ti objects are considered transposons and are covered under variations, and In is sometimes ambiguous with Inversions, which should be in Rearragements.
For each transgene, we extract the following information:
- name of the transgene*
- genomic expression summary
- construct <as of WS245>
- co-injection markers (as noted as part of the genomic expression summary)
- method of integration, and to what LG it was mapped, if known
- which papers use the transgene as part of an experiment
- other names for the transgene reported by different authors
{*In cases where a transgene has not been assigned a name, or one that does not adhere to standard nomenclature, we assign a name based on the WBPaperID.}
The construct contains the specific information for the array elements
- plasmid name
- promoter(s)
- reporters (GFP, YFP, LacZ, etc.),
- gene product(s)
- etc.
Kyook (talk) 17:28, 24 October 2024 (UTC)
Model Changes
WS251
Added tag for associating transgenes with disease
Transgene_for_disease ?DO_term XREF Associated_transgene #Evidence
WS245
?Transgene Evidence #Evidence Public_name UNIQUE ?Text Summary UNIQUE ?Text Synonym ?Text Corresponding_variation UNIQUE ?Variation XREF Corresponding_transgene //put in to unambiguously associate the allele/transgene Construction Construct ?Construct XREF Transgene_construct Coinjection ?Construct XREF Transgene_coinjection Coinjection_other ?Text //for coinjection markers that are not specified as a construct Integration_method UNIQUE ?Text Integrated_from ?Transgene XREF Transgene_derivation Laboratory ?Laboratory #Lab_Location Author ?Author Construction_summary ?Text Genetic_information Extrachromosomal Integrated Map ?Map #Map_position //needed for transgenes with no granular mapping, e.g., just mapped to a LG Mapping_data 2_point ?2_point_data //deleted for WS245, rolled back for WS246 Map_evidence #Evidence Phenotype ?Phenotype XREF Transgene #Phenotype_info Phenotype_not_observed ?Phenotype XREF Not_in_Transgene #Phenotype_info Used_for Transgene_derivation ?Transgene XREF Integrated_from Expr_pattern ?Expr_pattern XREF Transgene Marker_for ?Text #Evidence Interactor ?Interaction Associated_with Marked_rearrangement ?Rearrangement XREF By_transgene Strain ?Strain XREF Transgene Reference ?Paper XREF Transgene Species UNIQUE ?Species Remark ?Text #Evidence Historical_gene ?Gene #Evidence
--Kyook (talk) 22:25, 29 December 2014 (UTC)
Transgene first pass:
NOTE: Updating scripts for the dockerized textpresso-curation machine - Valerio will generate a script to find all In/Is/Si|Ex transgenes, append WBPaperIDs to the matched transgenes in the Postgres tables, then enter unmatched transgenes on their own line (new pgid, new trp_name WBTransgeneID), filling in trp_publicname, trp_paper, and trp_curator "WBPerson4793" (Arun)
From Valerio:This is the new repo: https://github.com/WormBase/entity-extraction-transgene. This is the ticket for the work to do: https://github.com/WormBase/entity-extraction-transgene/issues/1 and this is the ticket to update ACKnowledge: https://github.com/WormBase/ACKnowledge/issues/315 Kyook (talk) 17:34, 24 October 2024 (UTC)
In/Is/Si transgenes
Arun and Wen have automated the identification of papers that contain transgenes by using Textpresso to scan the C. elegans corpus of papers for the regular expressions (1-3 capital letter)Is, In, or Si (1-4 digits). This script will miss any transgenes that do not have a standard name using fitting this regex.
Transgene names are extracted during a cron job and stored in
http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out
"...paper.sup.1" means the transgene name was mentioned in the supplementary file.
Transgene-paper information is picked up by a cron job every Monday at 4am and entered into the trp tables with curator "Arun". Script is found here:
/home/postgres/work/pgpopulation/textpresso/transgene/update_textpresso_transgene.pl
Only the 'Is' and 'Si' lines are entered into postgres. Only paper connections are attached to pre-existing 'In' lines. 'In' is not an accepted nomenclature for transgenes; however, this nomenclature had been used in the past, so there are a handful of transgenes that are valid and should be consistently curated for papers that use report them.
Ex transgenes
Ex transgenes are curated in association with expression patterns, overexpression gene function phenotype, genetic interaction and gene regulation experiments. Although a script was employed to find all Ex transgenes along with genomic summaries in the C. elegans corpus; these have not been consistently curated as a rule. When an extrachromosomal array with a standardized name has been used, it should be curated like all integrated arrays.
This output file can be viewed here:
http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out
Dealing with false positive transgenes
False positive transgenes include
- "In" lines that are chromosomal inversions rather than transgenes.
- typos in transgene names
- formatting-induced textpresso extraction errors of transgene names, such as when transgenes are referred to as syIs13-19 in the paper, in which case Textpresso reports it as syIs1319 (i.e., the hyphen disappears sometimes during pdf2Text conversion).
Initially, these transgenes were added to:
/home/acedb/wen/phenote_transgene/ObsoleteTg.txt
which was called on by update_textpreso_transgene.pl to identify objects that shouldn't be entered into the trp tables
All bad transgene objects are now entered into the trp tables and appended with "FAIL". Transgenes that match a pre-existing Transgene public_name-paper association marked as FAIL are ignored during trp table population.
postgres query to find all PGIDs with FAIL in trp_objpap_falsepos
SELECT * FROM trp_objpap_falsepos WHERE joinkey IN ( SELECT joinkey FROM trp_publicname WHERE trp_publicname IN ( SELECT trp_publicname AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1));
Curation
- The Textpresso transgene scripts
textpresso-dev.cacr.caltech.edu:/data1/Users/arunr/Curator_related/transgene/transgenes_summary.pl textpresso-dev.cacr.caltech.edu:/data1/Users/arunr/Curator_related/transgene/02transgenes.pl
identifies transgenes based on regex \b([a-z]{1,2}(Is|In)[0-9]+([a-z]{1}?))\b/
extracts the transgene and deposits transgene and paper into http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out
- postgres script
- matches transgene names to trp_publicname and attaches paper to associated trp_paper
- if no match then
1. creates new pgid and new trp_name 2. enters transgene name from file to trp_publicname 3. enters paper into trp_paper 4. enters Arun into trp_curator
- Dealing with Duplicates
- Postgres query to find all transgenes with the same public_names and that are not erroneous/typos by authors (those with trp_objpap_falsepos value FAIL)
SELECT * FROM trp_publicname WHERE trp_publicname IN ( SELECT trp_publicname AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1) AND trp_publicname NOT IN ( SELECT trp_publicname FROM trp_publicname WHERE joinkey IN ( SELECT joinkey FROM trp_objpap_falsepos WHERE joinkey IN ( SELECT joinkey FROM trp_publicname WHERE trp_publicname IN ( SELECT trp_publicname AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1)))) ORDER BY trp_publicname;
Community submission form
Form label | Grayed out example text | OA label | OA pgtable | field/ontology_type | ontology_table | Term Info | Required field | Further actions | help instructions |
Your name | Andrew Chisholm | Person | trp_person, cns_person | ontology | WBPerson ontology | as in uP form | YES and this form should be restricted to people who have a true WBPersonID | Should trigger the entry of Community Curator in trp_curator and cns_curator | "If you do not have a WBPerson ID please contact WormBase to have one assigned." |
Your e-mail address | chisholm@ucsd.edu | E-mail address | trp_email, cns_email | smalltext | - | - | yes | tables need to be created | |
PubMed ID | e.g.,4366476 Please enter only one ID | Paper | ? | ? | ? | as in PubMed ID in phenotype.cgi | either this OR uP needs to be filled | should be mapped to WBPaperID and entered into trp_paper and cns_paper | as for PubMed ID in phenotype.cgi |
For uPublication? | Micropublication | trp_micropublication | toggle | - | - | either this OR Paper needs to be filled | should trigger uP specific response upon submission that highlights transgene name to use for uP form and returns user to uP form when transgene is submitted | ||
Transgene name | juIs32 or pwEx135, or oxSi19 | Public name | trp_publicname | smalltext | - | - | yes | create pgid and assign trp_name | "Please assign a name to your transgene using your lab’s allele designation, and Ex, Is or Si for extrachromosomal, integrated, and single site insertions, respectively i.e., otEx103, utIs18, oxSi48" |
Genotype Summary | [Ppie-1::GFP::RAB-7; unc-119(+)] | Summary | trp_summary | smalltext | - | - | yes | when pulled into table please put it in square brackets if they aren't already there | |
Strain | RT733 | Strain | trp_strain | text | - | - | no | ||
Coinjected with | Cbr-unc-119(+), pRF4 | Coinjection | trp_coinjectionconstruct | text | - | - | no | ||
Integrated by | Particle bombardment | Integrated by | trp_integration_method | ontology | trp_integration_method | - | yes | as in uP form: cns_integrationmethod | |
Laboratory | CZ | Laboratory | trp_laboratory | ontology | WB laboratory | as in uP form | yes | "Start typing the PI name and select an entry from the list. Click here to request a lab designation." | |
Construct name | pCZ255 | Public name | cns_publicname, trp_construct | text | - | - | yes | 1. create pgid for construct and assign WBConstructID cns_name; 2. trigger population of trp_construct with new cns_name value 3. trp_construct is a multiontology field, when more than one construct is entered in this form, each construct needs to be added to trp_construct | |
Promoter | myo-3 | Driven by gene | cns_drivenbygene | ontology | gin_table | as in uP form | Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank | ||
Expressed gene | RIN-1 | Gene | cns_gene | ontology | gin_table | as in uP form | Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank | "Only enter genes found in WormBase" | |
Non-WormBase gene | human alpha-synuclein | Other reporter | cns_otherreporter | text | - | No | - | "Enter any genes in your construct from species other than those represented in WormBase" | |
Reporter | GFP | Reporter | cns_reporter | ontology | cns_reporter | - | Yes- there needs to be a value in at least one of the three fields, promoter, gene or reporter- that is all of these fields can't be simultaneously blank | if more than one value is entered (comma separated) enter all of them into the cns_reporter table | like in uP form: cns_reporter list, add "If you need to enter more than one reporter, please separate them with commas." |
Non-WormBase reporter | - | Other reporter | cns_otherreporter | text | - | No | - | "Enter any reporter in your construct that isn't included on the reporter list above" | |
3'UTR | pie-1 | Construct-3UTR | cns_threeutr | ontology | gin_tables | as in uP form | no | ||
DNA sequence | Sequence(s) of relevant amplified region(s) | Construct-DNA | cns_dna | multi-bigtext allow comma separation | - | - | Yes | - | "Enter the DNA sequence used to drive reporter expression -excluding backbone vector and reporter itself. If you used a translational fusion, please add all pertinent DNA sequence in the box. If you want to enter 2 non-contiguous DNA sequences please enter them both in the DNA sequence box, comma separate the two sequences. If you have primers you can use the WormBase e-PCR tool located here." Note: the e-PRC tool link is http://www.wormbase.org/tools/epcr |
Fusion Type | Translational fusion | Construct type | cns_constructtype | ontology | cns_constructtype | - | yes | as in uP form: cns_fusiontype list | |
Backbone Vector | pPD89.03 | Clone | cns_clone | ontology | clone tables | as in construct OA clone field: id: D1001 name: "D1001"remark: "C. elegans genomic clone not used in the final reference genome, submitted by The C. elegans Genome Sequencing Consortium."type: "Cosmid" | no | ||
Injection Concentration | - | Construction Details | cns_constructionsummary | text, will be put after whatever text there is adding pipe | - | - | yes | ||
Construction Details | Construction summary | cns_constructionsummary | bigtext | - | - | no | "Example: [pkd-2::GFP] translational fusion. The pkd-2-GFP plasmid was made using plasmid pPD95.75 as parent vector, and a fusion of a long range PCR fragment of genomic pkd-2 (promoter and 5’-end) with a 3’-end fragment derived from yk219e1 to produce a 7.153-kb fusion containing the full-length pkd-2 gene." | ||
Construct Comments | - | Remark | cns_remark | bigtext | - | no | |||
Submit button |
mappings from construct tables to trp tables
Curation tools
Transgene .ace dumper
Before running the dumper, clean up the tables of duplicates. Find all duplicate public names for transgenes with the postgres query
SELECT trp_publicname, COUNT(*) AS count FROM trp_publicname GROUP BY trp_publicname HAVING COUNT(*) > 1;
Also consider resolving objects with the same synonym
SELECT trp_synonym, COUNT(*) AS count FROM trp_synonym GROUP BY trp_synonym HAVING COUNT(*) > 1;
The ./use_package.pl transgene.ace dumper was written by Juancarlos and Wen to translate the transgene trp postgres data into .ace format for uploading into AcEDB.
The script is located on tazendra
The output file transgene.ace is dumped weekly on Sundays into the same directory.
Three files are output : transgene.ace.<date>, transgene.ace –overwriting the file from the last time the script was run, and an err.out file.
The err.out file currently contains transgenes with
- invalid WBPaperIDs - these are mainly due to paper IDs that were merged into other papers. The transgenes in postgres have already been attributed with the current paper ID however, they still retain the invalid paper ID, so that should be removed when there is time. These transgenes are still dumped with the correct IDs.
- public names but are not attributed with an array status of Extrachromosomal or Integrated. In these cases the issue is mainly a poorly formed name, which needs fixing.
Test the .ace in CitaceMinus
- make sure the file reads in fine.
- look at all the transgene objects to make sure there are no strange looking ones.
- do a count of objects before and after the read-in to make sure the number of new objects is reasonable.
A cron job set up by Juancarlos and Wen runs the Transgene .ace dumper script on Thursday mornings at 6am and deposits it on citace at 8am. If there has been any new data or changes in data between testing the file and the Thursday morning dump, make sure to rerum the script and transfer the new data dump to citace for upload.
Requested changes
Changes recorded on https://bitbucket.org/kyook/ky_wbprojects/wiki/transgene_dump_ace.pl
--kjy 23:26, 20 June 2012 (UTC)
- change dump cron job to Wed morning, spica still calls it at 8am on Thursday
- fix source of allele codes for obo_laboratory, URL for the lab-allele designations
http://www.cbs.umn.edu/cgc/lab-allele Juancarlos will parse out the page and create a local copy. A cron job will compare the page to the local copy, if the page is altered in format the table will revert to the local copy.
Handling Dead Genes During Dump Process
The dumper script will now (as of May, 2013) run an automatic check for dead genes in any gene field (for transgenes this applies to "Driven_by_gene", "Gene" and "3_UTR" fields/tags). Any genes that are considered dead that are referenced in an Transgene object in the OA will be handled in the following manner:
1) If there is a replacement for the gene (i.e. the gene has merged into another gene), the dead gene will be dumped into a "Historical_gene" field in the .ACE file, the replacement gene will fill the original gene field. A comment will be added to the Historical_gene field via the #Evidence hash. The original gene field (now with the updated gene reference) will be printed with an "Inferred_automatically" tag after the gene. So, for example, if WBGene00001234 is now a dead gene that has been merged into WBGene00002345:
Gene "WBGene00001234"
becomes
Gene "WBGene00002345" Inferred_automatically Historical_gene "WBGene00001234" Remark "Note: This object originally referred to WBGene00001234. WBGene00001234 is now considered dead and has been merged into WBGene00002345. WBGene00002345 has replaced WBGene00001234 accordingly."
2) If there is no replacement for the gene (Dead or Suppressed), we would dump the following:
Historical_gene "WBGene00001234" Remark "Note: This object originally referred to a gene (WBGene00001234) that is now considered dead. Please interpret with discretion."
OR
Historical_gene "WBGene00001234" Remark "Note: This object originally referred to a gene (WBGene00001234) that has been suppressed. Please interpret with discretion."
and lastly,
3) If the gene has undergone a split, the interaction will still get dumped, but such genes will be printed out in the error output file of the dumping script for a curator to go back and manually change according to best judgement.
Gene Examples:
A split gene: WBGene00012507
A merged gene: WBGene00007524
A dead gene: WBGene00007814
A suppressed gene: WBGene00015490
Cross curation with other curators
Transgenes are used by other datatypes; Expr_pattern, Phenotype, Gene_regulation, Interaction.
All curators that use transgenes will need to run a script before each upload to make sure all their transgenes are valid objects.
Gene_regulation
The gene regulation curator can create new transgenes using the Transgene OA or request them from the transgene curator.
Expr_pattern
The expression pattern curator requires many transgene objects to be created on the fly. Rather than impeding the curation flow, when the expression pattern curator needs a transgene that has not been created already, they enter the relavant information for the transgene in their Reporter gene text box. A script that is manually run will check for lines that have text in the reporter gene field and missing a value in the transgene field. This script will create a new object in the transgene OA containing with the corresponding paper id, expression curator as curator, and the remark field populated with the reporter gene text. A synonym will be created based on the expression pattern value with an appended _Ex. This temporary name will be deposited in the synonym field of the transgene OA for that newly created object. See this wiki page for more information: http://wiki.wormbase.org/index.php/Expression_Pattern#Exporting_Reporter_Gene_description_from_Expr_pattern_OA_to_Transgene_OA
The transgene curator needs to
- verify that the object created by the expression pattern curator is not a duplicate transgene, if it is a duplicate, the transgene curator will merge the transgene into the preexisting one, this will make the new transgene invalid its information will not be dumped. The new transgene ID and other synonyms will be pipe added to the synonym list of the pre-existing object
- assign a public name if it exists or if needed
- fill in all other relevant information
Interaction
The interaction curator can create new transgenes using the Transgene OA or request them from the transgene curator.
CGC Transgenes from Strain via Hinxton
Processing the periodically released transgene_report
Strain data from the CGC has relevant remarks about the transgene in the strain. Often, there are discrepancies between how an author describes a transgene's genotype and how a CGC strain submitter enters the transgene details.
Quarterly, Mary Ann processes a strain report from the CGC and creates a file transgene_report.<date>.txt that maps transgenes mentioned in the CGC file to WBTransgene IDs, then sends the file to the WB transgene curator.
/home/postgres/work/pgpopulation/transgene/20131204_maryann_transgene/populate_maryann_transgene.pl*
NOTE: This script was written before the construct model took over many of the tags in the transgene model"
5/9/17 Changes to the script:
- comment out any part of the script that refers to these missing tables: trp_gene and trp_driven_by_gene
Script only populates trp tables and does not immediately create any corresponding construct ID; these have to be done manually for any new transgene as well as curate construct details.
The transgene_report comes formatted like this:
{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210 {\fonttbl\f0\fswiss\fcharset0 Helvetica;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0 \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural \f0\fs24 \cf0 Transgene : shEx34\ Strain : ABR14\ Description : shEx34 [myo-3p::mCherry]. Pick mCherry+ to maintain. This strain serves as a control strain to ABR16. Reference: Han S, et al. Nature 2017 doi: 10.1038/nature21686.\ WBID : NONE\ \ Transgene : shEx1\ Strain : ABR16\ Description : shEx1 [ges-1p::fat-7 + myo-3p::mCherry]. Pick mCherry+ to maintain. FAT-7 over-expressing strain. ABR14 serves as a control strain for this strain. Reference: Han S, et al. Nature 2017 doi: 10.1038/nature21686.\ WBID : NONE\
The input file for the script needs to look like this:
Transgene : staIs1 Strain : ABR5 Description : staIs1 [pie-1p::GFP + unc-119(+)]. Superficially wild-type. Maintain under normal conditions. Reference: This strain is used as the empty vector control in Greer EL et al Nature 2010 doi: 10.1038/nature09195. WBID : WBTransgene00017210
NOTE: extraneous characters such as "\" need to be removed before running the script.
The file needs to be placed on tazendara (note name change) postgres/work/pgpopulation/transgene/transgene_report.txt
Run script populate_maryann_transgene.pl
The populate_maryann_transgene.pl script
1. Enters transgenes from the CGC file that are missing from WB.
Start with transgene name, map to WBTransgene ID from trp_name/trp_publicname.
- If no transgene match, add trp_name (generate trp_name (WBTransgeneID) if it didn't exist already) trp_publicname, trp_strain, trp_summary.
- If there is a transgene match only add/change whatever is new/different
- compare and add strains to trp_strain, pipe separate, ' | ', multiple values, output in out.strains.
nIs407 10031 PG CGC MT19635 trp_strain NOW MT19635 nIs408 8234 PG MT20298 CGC MT19756 | MT20298 trp_strain NOW MT19756 | MT20298 nIs425 NEW PG CGC MT19851 trp_strain NOW MT19851
Line 1 no strain existed for pgid 10031 nIs407, the CGC strain was added to trp_strain.
Line 2 different strain existed for pgid 8234 nIs408, CGC strain was merged and pipe separated.
Line 3 nIs425 did not exist in pg, a line was created, and the CGC strain was added to trp_strain.
2. Extract CGC transgene genotype(s), compare it to WB transgene trp_summary. trp_table changes are recorded in out.summary.
The transgene should be in square brackets after the transgene name (optional space or words 'contains' or 'is' may be between the transgene name and genotype) in the Description field. There are many cases where the same transgene is described differently in the CGC, the script designates the longest CGC description as the canonical one and pipe separates the remaining genotypes in trp_synonym, with the original trp_summary genotype if it doesn't match the chosen CGC description.
If there is no transgene name mentioned in the Description that matches the Transgene value, pretend there is no genotype, but there is a strain and transgene. Put all unmatched and transgenes with no recognizable genotypes in file cgc_transgene_errors. If the transgene does not exist in pg, a new line, with pgid, trp_publicname, trp_strain, trp_laboratory 'CGC' and trp_curator 'Mary Ann' will be created, but no genotype will be assigned.
Genotype goes to trp_summary, anything that was in summary goes to trp_synonym aggregating with what was there before and separating with ' | '. There can only be one genotype in trp_summary, and it cannot match anything in trp_synonym.
Get the cgcgenotypes sort them get the longest one to be the genotype get the trp_summary values into an aggregate list get the trp_synonyms values into the aggregate list get the remaining cgcgenotypes that are not the longest one into the aggregate list if cgcgenotype different from pgsummary, update, and add pgsummary to aggregate list if aggregate list different from pg synonyms, replace trp_summary with aggregate list
asIs2 72 PG [pie-1::mcherry-egg-3] PGSYN CGC [pie-1::mcherry::egg-3] | [unc-119(+) + ppie-1::GFP::egg-1] 72 trp_summary NOW [unc-119(+) + ppie-1::GFP::egg-1] 72 trp_synonym NOW [pie-1::mcherry-egg-3] | [pie-1::mcherry::egg-3]
Line 1 transgene pgid trp_summary trp_synonym.
Line 2 CGC genotypes, aggregated and pipe separated
Line 3 pgid trp_summary replaced by longest CGC genotype
Line 4 pgid trp_synonym aggregate of original trp_summary(PG), original trp_synonyms(PGSYN), and remaining CGC genotypes.
First run was done on all genotypes. In future runs only do it if the genotype is different from the trp_summary.
3. Map WB annotated genes with CGC genotypes, output in out.summary.
In many cases, CGC genotypes use a different form of the gene name than what was published. To make it easier to compare the CGC genotypes with WB transgene genotypes, it is helpful to have a gene name mapping output for these elements.
Look at existing trp_gene and trp_drivenbygene, and compare with genes in all gin_locus, gin_synonyms, gin_seqname.
Output on 3 columns which ones matched somewhere in the genotype, which had no match, and whether there was no match at all.
arIs36 45 PG [phsp::ssGFP] PGSYN CGC [phsp::ssGFP] 45 trp_summary NOW [phsp::ssGFP] 45 trp_synonym NOW 45 Matched Gene: None 45 Gene Synonyms: None asIs2 72 PG [pie-1::mcherry-egg-3] PGSYN CGC [pie-1::mcherry::egg-3] | [unc-119(+) + ppie-1::GFP::egg-1] 72 trp_summary NOW [unc-119(+) + ppie-1::GFP::egg-1] 72 trp_synonym NOW [pie-1::mcherry-egg-3] | [pie-1::mcherry::egg-3] 72 Matched Gene: WBGene00004027 pie-1 72 Matched Gene: WBGene00009701 egg-3 72 Gene Synonyms: WBGene00004027 CELE_Y49E10.14, Y49E10.14, pic-1 72 Gene Synonyms: WBGene00009701 CELE_F44F4.2, F44F4.2 axEx1125 78 PG [pie-1::gfp-mex-5] PGSYN CGC [pKR2.04 + pRF4 + N2 genomic DNA] 78 trp_summary NOW [pKR2.04 + pRF4 + N2 genomic DNA] 78 trp_synonym NOW [pie-1::gfp-mex-5] 78 Matched Gene: None 78 Gene Synonyms: WBGene00003230 CELE_W02A2.7, W02A2.7, mex-5 78 Gene Synonyms: WBGene00004027 CELE_Y49E10.14, Y49E10.14, pic-1, pie-1 axIs1427 5676 PG [pie-1prom::LAP::DCAP-1] PGSYN CGC [pCG26/ LAP-tag DCP-1] 5676 trp_summary NOW [pCG26/ LAP-tag DCP-1] 5676 trp_synonym NOW [pie-1prom::LAP::DCAP-1] 5676 Matched Gene: None 5676 Gene Synonyms: WBGene00004027 CELE_Y49E10.14, Y49E10.14, pic-1, pie-1 5676 Gene Synonyms: WBGene00021929 CELE_Y55F3AM.12, Y55F3AM.12, dcap-1, dcp1
Full entries in out.summary.
For arIs36, the CGC genotype matched trp_summary value, not need to compare genes.
For asIs2, the CGC genotype did not match trp_summary value, genes within old and new trp_summary values need comparing. Matched genes are noted. Gene synonyms are supplied.
For axEx1125, trp_summary replaced, but no genes match, no gene or gene synonym recognized in CGC genotype, this transgene needs resolving.
For axIs1427, trp_summary replaced, gene synonyms match for DCAP-1 and DCP-1.
4. errors file in output are for pg entries that can't be compared, for example if there are duplicate transgene public_names, and neither of them have been labeled as 'FAIL'. These objects will need to be resolved or the script will break.
5. All changes to the postgres tables are listed in out.pg. Use these grep searches to display the number of changes to the different tables:
>grep " trp_strain VALUES " out.pg | wc -l 902 >grep " trp_summary VALUES " out.pg | wc -l 4450 >grep " trp_synonym VALUES " out.pg | wc -l 4396 >grep " trp_publicname VALUES " out.pg | wc -l 55
Other tables that can be queried from the out.pg are:
- trp_summary_hst
- trp_strain_hst
- trp_synonym_hst
- trp_publicname_hst NOTE: trp_publicname and trp_publicname_hst table change when there is a new entry
- trp_laboratory NOTE: trp_laboratory and trp_laboratory_hst table change when there is a new entry
- trp_laboratory_hst
NOTE: No diff file necessary between transgene_reports, always run on everything, only do stuff when it's new.
Web Displays
Page for mapped transgenes http://wiki.wormbase.org/index.php/Mapped_Transgenes