Difference between revisions of "Transgenes"

From WormBaseWiki
Jump to navigationJump to search
Line 33: Line 33:
 
We've recently begun extracting and curating Ex transgenes en masse.  Previously, we only curated those Ex transgenes that were associated with an expression pattern, phenotype, or gene regulation experiment.  A script was employed to find all Ex transgenes along with genomic summaries in the C. elegans corpus.  The output lists all the papers that use a specific transgene, with the associated genomic expression for that transgene for that paper, all sorted by transgene.  This output file can be viewed here: http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out
 
We've recently begun extracting and curating Ex transgenes en masse.  Previously, we only curated those Ex transgenes that were associated with an expression pattern, phenotype, or gene regulation experiment.  A script was employed to find all Ex transgenes along with genomic summaries in the C. elegans corpus.  The output lists all the papers that use a specific transgene, with the associated genomic expression for that transgene for that paper, all sorted by transgene.  This output file can be viewed here: http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out
  
==III. Obsolete Transgene objects.==
+
== Remove obsolete transgene objects==
  
There are some false positives in Textpresso screen, they should be excluded. All the false positives are in the file
+
There are some false positive transgene hits extracted by the Textpresso scan, these objects need to be deleted from postgres as well as added to the transgene object exclusion list so they will not be picked up again during future transgene object scans. The transgene exclusion (false positive) list lives:
 +
on tazendra
 +
/home/acedb/wen/phenote_transgene/ObsoleteTg.txt
 +
'''when is this file called on and which Scripts?'''
  
/home/acedb/wen/phenote_transgene/ObsoleteTg.txt
+
This file is important and needs to be edited every time a new false positive is discovered. Some examples of false positives include:
 
+
* "In" lines that are chromosomal inversions rather than transgenes.
This is an important file that needs to be edited every time a new false positive is discovered.  
+
* There are some typos in transgene names, which get published, these should not go to postgres as their own entities, but should be noted in the remark or listed as a synonym, if appropriate, for the real transgene.  
 
+
* Textpresso mishandles of transgene names, such as when transgenes are referred to as syIs13-19 in the paper, in which case Textpresso reports it as syIs1319 (i.e., the hyphen disappears sometimes during pdf2Text conversion). Curators should enter all the transgene objects into postgres and obsolete the syIs1314 object.
1. Some "In" lines are not transgenes but chromosomal inversions, these should not go to postgres.
 
2. There are some typos on transgene names in papers, these should not go to postgres.  
 
3. Some transgenes are listed as syIs13-19 in the paper, but the Textpresso results sometimes comes out as syIs1319 (the hyphen disappears sometimes during pdf2Text conversion). It means syIs13, syIs14, ... syIs19. Curators should enter all the transgene objects into postgres and obsolete the syIs1314 object.
 
 
   
 
   
 
//hIn1 is on the obsolete list but it still looks like it is in the textpresso output file on http://textpresso-dev.caltech.edu/wen/transgenes_in_regular_papers.out (ky)
 
//hIn1 is on the obsolete list but it still looks like it is in the textpresso output file on http://textpresso-dev.caltech.edu/wen/transgenes_in_regular_papers.out (ky)
 
  
 
==IV. Curation==
 
==IV. Curation==

Revision as of 02:54, 26 June 2010

Caltech curation Data_types

Transgene objects

We curate both integrated (Is, In) and extrachromosomal (Ex) transgene arrays. For each transgene, we extract the following information:

  • name of the transgene*
  • genomic expression summary
  • promoter(s)
  • co-injection markers (as noted as part of the genomic expression summary)
  • reporters (GFP, YFP, LacZ, etc.),
  • other gene product(s)
  • whether or not it is integrated and if so, the method of integration, and to what LG it was mapped, if known.
  • which papers use the transgene as part of an experiment
  • any other names for the transgene, which might be reported by the different authors.

{*In cases where a transgene has not been assigned a name, or one that does not adhere to standard nomenclature, we assign a name based on the WBPaperID.}

Transgene first pass:

We curate both integrated (Is, In) and recently extrachromosomal (Ex) transgenic constructs.

In/Is transgenes

Arun and Wen have automated the identification of papers that contain transgenes by using Textpresso to scan the C. elegans corpus of papers for the regular expressions (1-3 capital letter)Is or In (1-4 digits). This script will miss any transgenes that do not have a standard name using "Is" or "In".

Transgene names are extracted and sent to populate the transgene postgres table, these entries do not have a summary (genotype) or remark so they can be retrieved through Phenote by hitting <Retrieve> next to the "Search New Transgene" field on tab three.

The output from the Textpresso search script is here: http://textpresso-dev.caltech.edu/wen/transgenes_in_regular_papers.out All transgene-paper links are listed on this page. "...paper.sup.1" means the transgene name was mentioned in the supplementary file.

All the transgene-paper links will be entered into postgres database automatically. They will show up on Phenote. Juancarlos runs a script to do so. It is locoated at: /home/acedb/wen/ but I do not remember which one is the script.

At this moment, only the 'Is' lines are entered into postgres from the Textpresso script. 'In' lines are not entered to textpresso unless the transgene name already matches something that is in the database (which means that they are confirmed already to be valid transgenes).

Ex transgenes

We've recently begun extracting and curating Ex transgenes en masse. Previously, we only curated those Ex transgenes that were associated with an expression pattern, phenotype, or gene regulation experiment. A script was employed to find all Ex transgenes along with genomic summaries in the C. elegans corpus. The output lists all the papers that use a specific transgene, with the associated genomic expression for that transgene for that paper, all sorted by transgene. This output file can be viewed here: http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out

Remove obsolete transgene objects

There are some false positive transgene hits extracted by the Textpresso scan, these objects need to be deleted from postgres as well as added to the transgene object exclusion list so they will not be picked up again during future transgene object scans. The transgene exclusion (false positive) list lives:

on tazendra 
/home/acedb/wen/phenote_transgene/ObsoleteTg.txt

when is this file called on and which Scripts?

This file is important and needs to be edited every time a new false positive is discovered. Some examples of false positives include:

  • "In" lines that are chromosomal inversions rather than transgenes.
  • There are some typos in transgene names, which get published, these should not go to postgres as their own entities, but should be noted in the remark or listed as a synonym, if appropriate, for the real transgene.
  • Textpresso mishandles of transgene names, such as when transgenes are referred to as syIs13-19 in the paper, in which case Textpresso reports it as syIs1319 (i.e., the hyphen disappears sometimes during pdf2Text conversion). Curators should enter all the transgene objects into postgres and obsolete the syIs1314 object.

//hIn1 is on the obsolete list but it still looks like it is in the textpresso output file on http://textpresso-dev.caltech.edu/wen/transgenes_in_regular_papers.out (ky)

IV. Curation

1. Transgene curation goes through Phenote. The command to link to the transgene tables is (suppose you already know how to install Phenote): ./phenote -c worm-transgene.cfg

The "Search New Transgene" button (tab 3) allows curator to look for new transgene objects. New transgenes are identified as objects having no Summary or Remark. Usually they link to paper objects already since they were entered from the Textpresso search.

Curators should look for the information of new transgenes in the paper document provided by Textpresso (main paper or supplementary file).

Sometimes papers do not provide any information on the transgenes, only the name is provided. Then "No transgene info in original publication." should be entered into the Remark field so that it will not be identified as a new transgene again.

Here is the controlled vocabulary for the transgene remark field:

Remark "Conflicting mapping info: ..." Remark "Conflicting genotype: ..." Remark "No transgene info in original publication." Remark "Other integration method: ..." Remark "Clone = " Remark "Mapping info: "


V. Transgene .ace dumper

The transgene.ace dumper was written by Juancarlos and Wen to translate the transgene postgres data into .ace format for uploading into AcEDB. The script is located on tazendra at :

/home/acedb/wen/phenote_transgene/transgene_dump_ace.pl

The output file is dumped into the same directory and is called: transgene.ace.200XXXXX, a copy is made that is called transgene.ace

Test the .ace in CitaceMinus
a. make sure the file reads in fine.
b. look at all the transgene objects to make sure there are no strange looking ones.
c. do a count of objects before and after the read-in to make sure the number of new objects is reasonable.

A cron job set up by Juancarlos and Wen runs the Transgene.ace dumper script on Thursday mornings at 6am and deposits it on spica at 8am into:

citace@spica.caltech.edu:/home/citace/Data_for_citace/Data_from_Karen/ 

If there has been any new data or changes in data between testing the file and the Thursday morning dump, make sure to rerum the script and transfer the new data dump to spica for upload.


VI: Interaction with other curators

Transgenes are used by many other datatypes, such as Expr_pattern, Phenotype, Gene_regulation.


Gene_regulation: Xiaodong sends out a file at the beginning of every month requesting new transgene IDs. The transgene curator should fill in the new IDs and send the file back to her.

Phenotype: Transgene IDs requested by Phenotype curation is done via the webform: http://tazendra.caltech.edu/~postgres/cgi-bin/new_objects.cgi?action=Update+Transgene+%21 An email is sent to the transgene curator when there are transgene names pending approval.

Expr_pattern: At the moment, I can log into Phenote to curate new transgenes by myself. However, in the future, we can probably combine all three data types and do it via the webform.


//(ky)

    • Strain:

Strain data seems to have relevant remarks about the transgene in the strain, is there a way we can link strain info to transgene curation. Possible ways we could do this: 1. automate the display of strain info when curating transgenes to sync up data and paper links. 2. dump all strains containing transgenes and use the import (or manually update) the transgene info.