Difference between revisions of "Transgenes"

From WormBaseWiki
Jump to navigationJump to search
Line 162: Line 162:
 
<br>
 
<br>
  
'''Gene_regulation:'''
+
===Gene_regulation===
 
The gene regulation curator can create new transgenes using the Transgene OA or request them from the transgene curator.   
 
The gene regulation curator can create new transgenes using the Transgene OA or request them from the transgene curator.   
  
'''Expr_pattern:'''
+
===Expr_pattern===
 
The expression pattern curator requires many transgene objects to be created on the fly.  Rather than impeding the curation flow, when the expression pattern curator needs a transgene that has not been created already, they enter the relavant information for the transgene in their Reporter gene text box.  A script that is manually run will check for lines that have text in the reporter gene field and missing a value in the transgene field.  This script will create a new object in the transgene OA containing with the corresponding paper id, expression curator as curator, and the remark field populated with the reporter gene text. A synonym will be created based on the expression pattern value with an appended _Ex. This temporary name will be deposited in the synonym field of the transgene OA for that newly created object. See this wiki page for more information: http://wiki.wormbase.org/index.php/Expression_Pattern#Exporting_Reporter_Gene_description_from_Expr_pattern_OA_to_Transgene_OA
 
The expression pattern curator requires many transgene objects to be created on the fly.  Rather than impeding the curation flow, when the expression pattern curator needs a transgene that has not been created already, they enter the relavant information for the transgene in their Reporter gene text box.  A script that is manually run will check for lines that have text in the reporter gene field and missing a value in the transgene field.  This script will create a new object in the transgene OA containing with the corresponding paper id, expression curator as curator, and the remark field populated with the reporter gene text. A synonym will be created based on the expression pattern value with an appended _Ex. This temporary name will be deposited in the synonym field of the transgene OA for that newly created object. See this wiki page for more information: http://wiki.wormbase.org/index.php/Expression_Pattern#Exporting_Reporter_Gene_description_from_Expr_pattern_OA_to_Transgene_OA
  
Line 173: Line 173:
 
*fill in all other relevant information
 
*fill in all other relevant information
  
'''Interaction'''
+
===Interaction===
 
The interaction curator can create new transgenes using the Transgene OA or request them from the transgene curator.
 
The interaction curator can create new transgenes using the Transgene OA or request them from the transgene curator.
  
''Strain:''
+
===Strain===
Strain data seems to have relevant remarks about the transgene in the strain, is there a way we can link strain info to transgene curation.
+
Strain data from the CGC has relevant remarks about the transgene in the strain. There are always discrepancies between how an author describes a transgene's genotype and how a CGC strain submitter enters the transgene details.  Near the end of every data upload, Mary Ann processes a strain report from the CGC and creates a file '''transgene_report.<date>.txt''' that maps transgenes mentioned in the file to WBTransgene IDs. She then sends the file to the WB transgene curator. <br>
Possible ways we could do this:
+
In the transgene_report, the data shows like:
1. automate the display of strain info when curating transgenes to sync up data and paper links.
+
<pre>
2. dump all strains containing transgenes and use the import (or manually update) the transgene info.
+
Transgene : staIs1
 +
Strain : ABR5
 +
Description : staIs1 [pie-1p::GFP + unc-119(+)]. Superficially wild-type. Maintain under normal conditions. Reference: This strain is used as the empty vector control in Greer EL et al Nature 2010 doi: 10.1038/nature09195.
 +
WBID : WBTransgene00017210
 +
</pre>
 +
The file needs to be placed on tazendara postgres/work/pgpopulation/transgene/???<br>
 +
Run script populate_maryann_transgene.pl<br>
 +
 
 +
The populate_maryann_transgene.pl script<br>
 +
1. Enter transgenes from CGC file that are missing from WB. <br>
 +
Start with transgene name, map to WBTransgene ID from trp_name/trp_publicname. 
 +
If no transgene match, add trp_name (generate trp_name (WBTransgeneID)if it didn't exist already) trp_publicname trp_strain trp_cgcremark and trp_summary.
 +
 
 +
If there is a transgene match only add/change whatever is new/different.
 +
 
 +
For now do this for all genotypes. In future runs only do it if the genotype is different from the trp_summary.<br>
 +
 
 +
2. Extracts the CGC transgene genotype. The transgene should be in square brackets after the transgene name (optional space or words 'contains' or 'is' may be between the transgene name and genotype) in the Description field.  <br>
 +
 
 +
If there is no transgene name mentioned in the Description that matches the Transgene value, pretend there is no genotype, but there is a strain and transgene. <br>
 +
 
 +
Genotype goes to trp_summary, anything that was in summary goes to trp_synonym aggregating with what was there before and separating with ' | '.  The summary is only the genotype.  Add only if it's different.
 +
<pre>
 +
Get the cgcgenotypes
 +
sort them
 +
get the longest one to be the genotype
 +
get the trp_summary values into an aggregate list
 +
get the trp_synonyms values into the aggregate list
 +
get the remaining cgcgenotypes that are not the longest one into the
 +
  aggregate list
 +
if cgcgenotype different from pgsummary, update, and add pgsummary to
 +
  aggregate list
 +
if aggregate list different from pg synonyms, replace trp_summary with
 +
  aggregate list
 +
</pre>
 +
<pre>
 +
asIs2 72 PG [pie-1::mcherry-egg-3] CGC [unc-119(+) + ppie-1::GFP::egg-1] NOW [unc-119(+) + ppie-1::GFP::egg-1]
 +
asIs2 72 PGSUM [pie-1::mcherry-egg-3] PGSYN CGC [pie-1::mcherry::egg-3] | [unc-119(+) + ppie-1::GFP::egg-1] trp_synonyms NOW [pie-1::mcherry-egg-3] | [pie-1::mcherry::egg-3]
 +
</pre>
 +
 
 +
3. Map WB annotated genes with CGC genotypes.
 +
In many cases, CGC genotypes use a different from of the gene names that what was published. To make it easier to compare the CGC genotypes with WB transgene genotypes, it is helpful to have a gene name mapping output for these elements. <br>
 +
Look at existing trp_gene and trp_drivenbygene, and compare all gin_locus, gin_synonyms, gin_seqname.<br>
 +
Output on 3 columns which ones matched somewhere in the genotype, which had no match, and whether there was no match at all.
 +
 
 +
4. Assign Mary Ann as curator for new entries. No need to tag updates to existing entries.
 +
 
 +
NOTE:
 +
No diff file necessary between transgene_reports, always run on everything, only do stuff when it's new.
  
 
==Web Displays==
 
==Web Displays==

Revision as of 22:20, 6 December 2013

back to Caltech documentation

Transgene objects

We curate both integrated (Is, In) and extrachromosomal (Ex) transgene arrays. For each transgene, we extract the following information:

  • name of the transgene*
  • genomic expression summary
  • promoter(s)
  • co-injection markers (as noted as part of the genomic expression summary)
  • reporters (GFP, YFP, LacZ, etc.),
  • other gene product(s)
  • if the transgene is integrated-- the method of integration, and to what LG it was mapped, if known.
  • which papers use the transgene as part of an experiment
  • any other names for the transgene, which might be reported by the different authors.

{*In cases where a transgene has not been assigned a name, or one that does not adhere to standard nomenclature, we assign a name based on the WBPaperID.}

Transgene model

?Transgene	Evidence	#Evidence
       Public_name //added for WS234
	Summary	UNIQUE	?Text
	Synonym	?Text
	Promoter	Driven_by_gene	?Gene	XREF	Drives_Transgene
		Driven_by_construct	?Text
	Reporter	Reporter_product	?Text
		Gene	?Gene	XREF	Transgene_product	Text
               3_UTR  ?Gene   
	Reporter_type	?Text
	Construction	Fragment	?Text
		Coinjection_marker	?Text
		Integration_method	UNIQUE	?Text
		Laboratory	?Laboratory	#Lab_Location
		Author	?Author
		Person	?Person
	Genetic_information	Extrachromosomal
		Integrated
		Map	?Map	#Map_position
		Map_evidence	#Evidence
		Mapping_data	2_point	?2_point_data
			Multi_point	?Multi_pt_data
		Phenotype	?Phenotype	XREF	Transgene	#Phenotype_info
		Phenotype_not_observed	?Phenotype	XREF	Not_in_Transgene	#Phenotype_info
	Used_for	
		Expr_pattern	?Expr_pattern	XREF	Transgene
		Marker_for	?Text	#Evidence
		Gene_regulation	?Gene_regulation	XREF	Transgene
		Interactor	?Interaction
	Associated_with	Marked_rearrangement	?Rearrangement	XREF	By_transgene
		Clone	?Clone	XREF	Transgene	Text
		Strain	?Strain	XREF	Transgene
	Reference	?Paper	XREF	Transgene
	Species	UNIQUE	?Species
	Remark	?Text	#Evidence

--kjy 00:16, 26 May 2012 (UTC)

Transgene first pass:

In/Is transgenes

Arun and Wen have automated the identification of papers that contain transgenes by using Textpresso to scan the C. elegans corpus of papers for the regular expressions (1-3 capital letter)Is or In (1-4 digits). This script will miss any transgenes that do not have a standard name using "Is" or "In".

Transgene names are extracted and sent to populate the transgene postgres table, these entries do not have a summary (genotype) or remark so they can be retrieved through Phenote by hitting <Retrieve> next to the "Search New Transgene" field on tab three.

The output from the Textpresso search script is here:

http://textpresso-dev.caltech.edu/transgene/transgenes_in_regular_papers.out

All transgene-paper links are listed on this page. "...paper.sup.1" means the transgene name was mentioned in the supplementary file.

All the transgene-paper links will be entered into postgres database automatically by one of Juancarlos' scripts.

At this moment, only the 'Is' lines are entered into postgres from the Textpresso script. 'In' lines are not entered to textpresso unless the transgene name already matches something that is in the database (which means that they are confirmed already to be valid transgenes).

Ex transgenes

We've recently began extracting and curating Ex transgenes en masse. Previously, we only curated those Ex transgenes that were associated with an expression pattern, phenotype, or gene regulation experiment. A script was employed to find all Ex transgenes along with genomic summaries in the C. elegans corpus. The output lists all the papers that use a specific transgene, with the associated genomic expression for that transgene for that paper, all sorted by transgene. This output file can be viewed here:

http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out

Obsolete transgene objects

There are some false positive transgene hits, which get extracted by the Textpresso scan, these objects need to be deleted from postgres as well as added to the transgene object exclusion list so they will not be picked up again during future transgene object scans.

when is this file called on and by which scripts?
This file is important and needs to be edited every time a new false positive is discovered.
Some examples of false positives include:

  • "In" lines that are chromosomal inversions rather than transgenes.
  • There are some typos in transgene names, which get published, these should not go to postgres as their own entities, but should be noted in the remark or listed as a synonym, if appropriate, for the real transgene.
  • Textpresso mishandles of transgene names, such as when transgenes are referred to as syIs13-19 in the paper, in which case Textpresso reports it as syIs1319 (i.e., the hyphen disappears sometimes during pdf2Text conversion). Curators should enter all the transgene objects into postgres and obsolete the syIs1314 object.

Curation

The Textpresso transgene search deposits the transgene name and all new paper instances of the transgene directly into the transgene table.

Access to this table is currently through phenote but we will be migrating to the WB in-house Ontology Annotator (OA) at some point.

Curation tools

Transgene .ace dumper

The transgene.ace dumper was written by Juancarlos and Wen to translate the transgene postgres data into .ace format for uploading into AcEDB. The script is located on tazendra. The output file is dumped into the same directory.

Test the .ace in CitaceMinus

  • make sure the file reads in fine.
  • look at all the transgene objects to make sure there are no strange looking ones.
  • do a count of objects before and after the read-in to make sure the number of new objects is reasonable.

A cron job set up by Juancarlos and Wen runs the Transgene .ace dumper script on Thursday mornings at 6am and deposits it on citace at 8am. If there has been any new data or changes in data between testing the file and the Thursday morning dump, make sure to rerum the script and transfer the new data dump to citace for upload.

Requested changes

Changes recorded on https://bitbucket.org/kyook/ky_wbprojects/wiki/transgene_dump_ace.pl

--kjy 23:26, 20 June 2012 (UTC)

  • change dump cron job to Wed morning, spica still calls it at 8am on Thursday
  • fix source of allele codes for obo_laboratory, URL for the lab-allele designations

http://www.cbs.umn.edu/cgc/lab-allele Juancarlos will parse out the page and create a local copy. A cron job will compare the page to the local copy, if the page is altered in format the table will revert to the local copy.


Handling Dead Genes During Dump Process

The dumper script will now (as of May, 2013) run an automatic check for dead genes in any gene field (for transgenes this applies to "Driven_by_gene", "Gene" and "3_UTR" fields/tags). Any genes that are considered dead that are referenced in an Transgene object in the OA will be handled in the following manner:

1) If there is a replacement for the gene (i.e. the gene has merged into another gene), the dead gene will be dumped into a "Historical_gene" field in the .ACE file, the replacement gene will fill the original gene field. A comment will be added to the Historical_gene field via the #Evidence hash. The original gene field (now with the updated gene reference) will be printed with an "Inferred_automatically" tag after the gene. So, for example, if WBGene00001234 is now a dead gene that has been merged into WBGene00002345:

Gene  "WBGene00001234"

becomes

Gene  "WBGene00002345"  Inferred_automatically
Historical_gene  "WBGene00001234"  Remark  "Note: This object originally referred to WBGene00001234.
WBGene00001234 is now considered dead and has been merged into WBGene00002345. WBGene00002345 has 
replaced WBGene00001234 accordingly."

2) If there is no replacement for the gene (Dead or Suppressed), we would dump the following:

Historical_gene  "WBGene00001234"  Remark  "Note: This object originally referred to a gene
 (WBGene00001234) that is now considered dead. Please interpret with discretion."

OR

Historical_gene  "WBGene00001234"  Remark  "Note: This object originally referred to a gene
 (WBGene00001234) that has been suppressed. Please interpret with discretion."

and lastly,

3) If the gene has undergone a split, the interaction will still get dumped, but such genes will be printed out in the error output file of the dumping script for a curator to go back and manually change according to best judgement.


Gene Examples:
A split gene: WBGene00012507
A merged gene: WBGene00007524
A dead gene: WBGene00007814
A suppressed gene: WBGene00015490

Cross curation with other curators

Transgenes are used by other datatypes; Expr_pattern, Phenotype, Gene_regulation, Interaction. All curators that use transgenes will need to run a script before each upload to make sure all their transgenes are valid objects.

Gene_regulation

The gene regulation curator can create new transgenes using the Transgene OA or request them from the transgene curator.

Expr_pattern

The expression pattern curator requires many transgene objects to be created on the fly. Rather than impeding the curation flow, when the expression pattern curator needs a transgene that has not been created already, they enter the relavant information for the transgene in their Reporter gene text box. A script that is manually run will check for lines that have text in the reporter gene field and missing a value in the transgene field. This script will create a new object in the transgene OA containing with the corresponding paper id, expression curator as curator, and the remark field populated with the reporter gene text. A synonym will be created based on the expression pattern value with an appended _Ex. This temporary name will be deposited in the synonym field of the transgene OA for that newly created object. See this wiki page for more information: http://wiki.wormbase.org/index.php/Expression_Pattern#Exporting_Reporter_Gene_description_from_Expr_pattern_OA_to_Transgene_OA

The transgene curator needs to

  • verify that the object created by the expression pattern curator is not a duplicate transgene, if it is a duplicate, the transgene curator will merge the transgene into the preexisting one, this will make the new transgene invalid its information will not be dumped. The new transgene ID and other synonyms will be pipe added to the synonym list of the pre-existing object
  • assign a public name if it exists or if needed
  • fill in all other relevant information

Interaction

The interaction curator can create new transgenes using the Transgene OA or request them from the transgene curator.

Strain

Strain data from the CGC has relevant remarks about the transgene in the strain. There are always discrepancies between how an author describes a transgene's genotype and how a CGC strain submitter enters the transgene details. Near the end of every data upload, Mary Ann processes a strain report from the CGC and creates a file transgene_report.<date>.txt that maps transgenes mentioned in the file to WBTransgene IDs. She then sends the file to the WB transgene curator.
In the transgene_report, the data shows like:

Transgene : staIs1
Strain : ABR5
Description : staIs1 [pie-1p::GFP + unc-119(+)]. Superficially wild-type. Maintain under normal conditions. Reference: This strain is used as the empty vector control in Greer EL et al Nature 2010 doi: 10.1038/nature09195.
WBID : WBTransgene00017210

The file needs to be placed on tazendara postgres/work/pgpopulation/transgene/???
Run script populate_maryann_transgene.pl

The populate_maryann_transgene.pl script
1. Enter transgenes from CGC file that are missing from WB.
Start with transgene name, map to WBTransgene ID from trp_name/trp_publicname.

If no transgene match, add trp_name (generate trp_name (WBTransgeneID)if it didn't exist already) trp_publicname trp_strain trp_cgcremark and trp_summary.  
If there is a transgene match only add/change whatever is new/different.
For now do this for all genotypes. In future runs only do it if the genotype is different from the trp_summary.

2. Extracts the CGC transgene genotype. The transgene should be in square brackets after the transgene name (optional space or words 'contains' or 'is' may be between the transgene name and genotype) in the Description field.

If there is no transgene name mentioned in the Description that matches the Transgene value, pretend there is no genotype, but there is a strain and transgene.

Genotype goes to trp_summary, anything that was in summary goes to trp_synonym aggregating with what was there before and separating with ' | '. The summary is only the genotype. Add only if it's different.

Get the cgcgenotypes
sort them
get the longest one to be the genotype
get the trp_summary values into an aggregate list
get the trp_synonyms values into the aggregate list
get the remaining cgcgenotypes that are not the longest one into the
  aggregate list
if cgcgenotype different from pgsummary, update, and add pgsummary to
  aggregate list
if aggregate list different from pg synonyms, replace trp_summary with
  aggregate list
asIs2 72	PG [pie-1::mcherry-egg-3]	CGC [unc-119(+) + ppie-1::GFP::egg-1]	NOW [unc-119(+) + ppie-1::GFP::egg-1]
asIs2 72	PGSUM [pie-1::mcherry-egg-3]	PGSYN 	CGC [pie-1::mcherry::egg-3] | [unc-119(+) + ppie-1::GFP::egg-1]	trp_synonyms NOW [pie-1::mcherry-egg-3] | [pie-1::mcherry::egg-3]

3. Map WB annotated genes with CGC genotypes. In many cases, CGC genotypes use a different from of the gene names that what was published. To make it easier to compare the CGC genotypes with WB transgene genotypes, it is helpful to have a gene name mapping output for these elements.
Look at existing trp_gene and trp_drivenbygene, and compare all gin_locus, gin_synonyms, gin_seqname.
Output on 3 columns which ones matched somewhere in the genotype, which had no match, and whether there was no match at all.

4. Assign Mary Ann as curator for new entries. No need to tag updates to existing entries.

NOTE: No diff file necessary between transgene_reports, always run on everything, only do stuff when it's new.

Web Displays

Page for mapped transgenes http://wiki.wormbase.org/index.php/Mapped_Transgenes