Sequence Feature

From WormBaseWiki
Jump to: navigation, search

Flagging papers

send them to worm-bug@sanger.ac.uk This is where papers identified by svms/pattern matching are sent. We will be moving away from this ticketing system, but for the meantime they will all be in the same place.


Rules for marking up regions (from GW)

  • If a region is necessary and sufficient to drive a reporter gene, then mark it as an 'enhancer' or 'silencer'.

(I don't think these are the classic definitions for enhancer/silencer, RL)

  • If a region is both an enhancer and a silencer, then it should have the SO_term tags for both of these.
  • If mobility shift experiments or similar experimental evidence is available to assert that a short region is a TF binding site, then mark it as a TF_binding_site.
  • Similarity to a known binding motif is not evidence of being a TF_binding_site.
  • If there is no evidence for a TF binding site and it has an effect on expression when mutated or deleted, but is not sufficient to drive a reporter gene, then we cannot assert that it is an enhancer or a TF binding site. Mark it as an anonymous 'regulatory_region'.
  • If a region has the properties of being both a TF binding site and an enhancer then mark it up as two Features, one a TF_binding_site and one an enhancer.
  • If a region is asserted to be a promoter region in the paper and it is within 200bp (or thereabouts?) of the 5' of the target gene and it is neccessary and sufficient to promote a reporter gene, mark it as a promoter. If in doubt, consider marking it as an enhancer.


Example for sequence feature curation

the example is from WBPaper00003631


Feature : "egl-1_temp_1.1"
Sequence VF23B12L
Mapping_target VF23B12L
Flanking_sequences cagctcaattattaaattttattgggtattgttta cataaaattctattgtcccagatttaggatacatcg
DNA_text CTCCTAACCGGGTGGTC
Description "This is a TRA-1 binding site that represses egl-1."
Remark "This is the TF_binding_site for TRA-1 which silences egl-1. 
N.B. a 'silencer' Feature has also been made at this location to aid expression and interaction curation
[2013-07-23 gw3]"
Associated_with_gene WBGene00001170 // egl-1
Bound_by_product_of WBGene00006604 // tra-1
Transcription_factor WBTranscriptionFactor000029 // tra-1
Method  TF_binding_site
SO_term "SO:0000235" // TF_binding_site
Defined_by_paper WBPaper00003631
Public_name "TRA-1 binding site"

Feature : "egl-1_temp_1.2"
Sequence VF23B12L
Mapping_target VF23B12L
Flanking_sequences cagctcaattattaaattttattgggtattgttta cataaaattctattgtcccagatttaggatacatcg
DNA_text CTCCTAACCGGGTGGTC
Description "This is the silencer of egl-1, containing a single TF_binding_site bound by TRA-1."
Remark "Made this 'silencer' feature in addition to the TRA-1 TF_binding_site Feature to aid expression 
and interaction curation [2013-07-23 gw3]"
Associated_with_gene WBGene00001170 // egl-1
Method  silencer
SO_term "SO:0000625" // silencer
Defined_by_paper WBPaper00003631
Public_name "TRA-1 binding site silencer"

Most Expr_pattern and Interaction objects will be attached to the 'enhancer/silencer' Features rather than the TF_binding_site Features

Link to Gene Regulation/Regulatory interactions

Two types of gene_regulation can be linked to feature:

  • trans-regulation: TF A regulates target B through element C

In this situation, our current interaction model already accommodate this data and links feature object via:

?Interaction

Interaction_associated_feature  ?Feature  XREF Associated_with_Interaction //trans-regulation

  • cis-regulation: enhancer element C (cis-regulator) cis-regulates gene B

Current interaction model needs to be modified to accommodate this type of data by adding new tag:

?Interaction

Feature_interactor  ?Feature  XREF  Interacting_feature  #Interactor_info //cis-regulation


We will propose corresponding feature model change to have one-to-one XREF between the models. The intention is that interactions that explicitly state a sequence feature object as an interactor in a physical or regulatory interaction can refer to a ?Feature object as a "Feature_interactor". Alternatively, when there is less direct evidence or the association is more vague, we would make use of the "Interaction_associated_feature" tag. The XREFs will then link to the appropriate tags in the corresponding objects.

  • proposed model change:

?Interaction

   Feature_interactor  ?Feature  XREF  Interacting_feature  #Interactor_info //cis-regulation

   Interaction_associated_feature  ?Feature  XREF Associated_with_Interaction //trans-regulation

?Feature

   Interacting_feature  ?Interaction  XREF  Feature_interactor //cis-regulation

   Associated_with_Interaction  ?Interaction  XREF Interaction_associated_feature //trans-regulation

Link to Expression pattern

When do we link sequence features to Expression Pattern objects and how.


Example 1 -from WBPaper00003631:

"The egl-1 gene appears to be expressed in the HSNs in males." The construct used is [Pegl-1::gfp] transcriptional fusion.

  • Curator creates an Expression object for egl-1 in the male's HSN and links it to pegl-1::GFP transgene.
Expr_pattern : "Expr11092"
Anatomy_term	"WBbt:0004757" Certain //HSNR
Anatomy_term	"WBbt:0004758" Certain //HSNL
Anatomy_term	"WBbt:0007850" Certain //male
Gene	"WBGene00001170"//egl-1
Pattern	"The egl-1 gene appears to be expressed in the HSNs in males, in which the HSNs normally undergo 
programmed cell death, but not in hermaphrodites, in which the HSNs normally survive."
Reference	"WBPaper00003631"
Reporter_gene	"[Pegl-1::gfp] transcriptional fusion. To construct Pegl-1::gfp, bases +174 to +5820 (5'-3') 
downstream of the stop codon of the egl-1 gene and bases -1914 to -837 (5'-3') upstream of the stop codon were
amplified with appropriate primers and cloned into the SpeI-ApaI (5'-3') and PstI-BamHI (5'-3') sites of 
vector pPD95.69, respectively (A. Fire et al., personal communication). --precise ends."

  • Sequence curator creates a sequence feature for that object -we are not there yet but we should aim for it.
  • In the sequence feature object there will be a link to the expression.

note that in this expression object we have, as per the Expression_pattern model

Expr_pattern	Expression_of	Gene ?Gene XREF Expr_pattern
				
  • The Expression pattern object in this case is linked to the gene as the authors hypothesize that the transcriptional fusion expression is the endogenous egl-1 expression.


Example 2 from WBPaper00003631 (hypothetical made up example- in this specific paper there's not such evidence but might be a scenario):

"This specific sequence of 80bp is expressed in the HSNL. The construct used is [80bp-egl-1::gfp].

1) One way to go is to link the expression to the sequence, other than the gene. From the Expr_pattern model:

Expr_pattern	Expression_of	Gene ?Gene XREF Expr_pattern
				Sequence   ?Sequence XREF Expr_pattern 

Expr_pattern : "Expr11093"
Anatomy_term	"WBbt:0004758" Certain //HSNL
Sequence	"???"
Pattern	"This particular sequence::GFP was expressed in HSNL"
Reference	"WBPaper00003631"
Reporter_gene	"[80bp-egl-1::gfp]. To construct 80bp-egl-1::gfp..."
				
  • Sequence curator creates a sequence feature for that object.
  • In the sequence feature object there will be a link to the expression.
  • The Expression pattern object in this case is linked to the sequence as the artificial construct might not resemble the endogenous egl-1 expression.
  • It will be generally hard to determine where is the boundary between artificial and endogenous expression if no other experimental evidences -IHC, ish- are available.

* If we curate the objects this way we should determine how to display them on the site. Separate from other expression objects?

2) Another option would be to include those objects in Gene regulation other than expression. That specific sequence is responsible for expression in..

How were these kinds of objects curated in the past? Was it via gene_regulation Cis_regulated_seq?

Although 'Cis_regulated_seq' existed in old gene_regulation model, it was never used for any objects both in Wen and Xiaodong's hands. In new Interaction modle, this tag is gone. --XW

3) A third possibility is to add Drives_expression_in in the feature object

               Drives_expression_in
			
				Life_stage   ?Life_stage  
				Anatomy_term ?Anatomy_term 
				GO_term      ?GO_term     

This is a favorable way as it will not "contaminate" the expression pattern class and at the same time the info of expression of the enhancer is captured. In REDfly (Regulatory Element Database for Drosophila, http://redfly.ccr.buffalo.edu/) the enhancer region is annotated to the anatomy terms but that expression is not listed under the classic expression patterns. See for example the decapentaplegic gene (dpp) construct dpp_303lacZ.

In the example of Hwang and Sternberg, 2004 (WBPaper00006370), the feature object will be

Feature : 
Public_name "lin-3 enhancer"
Sequence F36H1
Description "lin-3 enhancer region, driving anchor cell (AC) specific expression"
Flanking_sequences "ctagaacttcccgtctctccctattcaatg" "cttaccaatgtctcaggcatttttggaaaa" 
Mapping_target F36H1
Associated_with_gene WBGene00002992 // lin-3
Species "Caenorhabditis elegans"
Defined_by_paper WBPaper00006370
SO_term SO:0000165 // enhancer
Method enhancer 
Associated_with_Interaction WBInteraction000501966// hlh-2 binds to lin-3
Associated_with_Interaction WBInteraction000520204// nhr-25 binds to lin-3
Anatomy_term	"WBbt:0004522"//Anchor cell

4) We could simply generate an Expr_pattern object and add the Associated_feature ?Feature. For display purposes on the site we can display objects that have Associated_feature in a separate section

Example 3 from WBPaper00003631:

"The egl-1 gene appears to be expressed in the HSNs in males (Pegl-1::GFP reporter)...if tra-1 is bound to egl-1 the expression in HSNs is repressed"

  • The region of tra-1 binding to egl-1 is known and 2 sequence features are created for it, one as TF_binding_site and one for silencer.
  • A gene regulation object is created -> egl-1 downregulation in HSN.
  • The object is added in the silencer sequence feature object.

Should we create an expression object for the tra-1 binding site? in this case should create a negative expression. egl-1 is NOT expressed in HSNs if bound by tra-1. This falls under gene regulation to me -DR

Should we link to the existing expression pattern Expr_pattern : "Expr11092" -see above? This might not be appropriate as Expr11092 depicts expression in male HSNs. If we want to pull out that info we could do it anyway through the gene regulation object -DR

Should we just leave the gene regulation association?

As of now few Expression Patterns are linked to the Genome Browser (Vancouver set is the only data set). The ultimate goal is to map, whenever we can, expression constructs to the genome browser.

Top down approach

We are brainstorming in order to develop a model that will be suitable for accommodating curation of all the above.


The potential model should contain the following info

for Expression
  • sequence - the sequence could be any stretch of DNA from few bp to kbs

(?Feature, 1 or more)

  • reporter -GFP, RFP, YFP, mCherry, Venus,...

(+ Other: text, including when endogenous gene is used as the (part of, e.g. gfp fused in) reporter)

  • gene (the gene immediately downstream of the sequence) non unique because it could be associated to more than one gene

(NOT annotate gene because 1. the base model is about describing the pattern of expression, 2. location information intrinsically informs possible cis-targets, 3. if author asserts relevant genes, that should go in some ?Regulation)

  • Reflects_endogenous_expression_of ?Gene #if the author assume that expression reflects the endogenous then we put it otherwise not
  • anatomy term
  • life stage
  • (sex will be encoded in life stage and anatomy)
  • WBPaper
  • experimental info?
  • other info will be textual

After brainstorming (people involved Xiaodong, Raymond, Wen, Daniela) we agreed the current Expr_model can accommodate most of the changes proposed above. The only modification that should be done is to add the

  • Reflects_endogenous_expression_of ?Gene #if the authors assume that the expression reflects the endogenous one we put it otherwise not

for all the *artificial* constructs we will not populate the tag. Daniela will start curation and see if everything fits with the proposal. If so, will request a model change.


for Regulation

Next topics: capture regulation, post-transcriptional regulation Agreement has been reached for gene regulation objects and is summarized in a chapter above.

Sequence Feature Model

?Feature SMap S_parent UNIQUE Sequence UNIQUE ?Sequence XREF Feature_object
	 Name Public_name UNIQUE ?Text
	      Other_name ?Text
         Sequence_details Flanking_sequences UNIQUE Text UNIQUE Text
			  Mapping_target UNIQUE ?Sequence 
			  Source_location UNIQUE Int UNIQUE ?Sequence UNIQUE Int UNIQUE Int UNIQUE #Evidence //source data, <WSversion> ?Sequence pos1 pos2 Evidence(Paper/person etc. remarks)
	 DNA_text UNIQUE ?Text       // for storing the sequence of the feature...can use IUPAC codes to be able
				     // store consensus sequences, e.g. binding site consensus sequence
	 Origin     Species UNIQUE ?Species  //added by pad, as we are moving towards multi species readyness.
		    Strain UNIQUE ?Strain//added by pad, as we are moving towards multi strain readyness.
	 History    Merged_into	UNIQUE ?Feature XREF Acquires_merge #Evidence
		    Acquires_merge ?Feature XREF Merged_into #Evidence
		    Deprecated Text #Evidence 
         Visible    Description ?Text
		    SO_term ?SO_term
	 Defined_by Defined_by_sequence ?Sequence XREF Defines_feature #Evidence
		    Defined_by_paper ?Paper XREF Feature #Evidence
		    Defined_by_person ?Person
		    Defined_by_author ?Author
		    Defined_by_analysis ?Analysis Int
	 Score Float Text #Evidence // this would be a log score as indicated by the analysis used in gff dump
	 Associations Associated_with_gene ?Gene XREF Associated_feature #Evidence // richard
                      Associated_with_CDS ?CDS XREF Associated_feature #Evidence // richard
                      Associated_with_transcript ?Transcript XREF Associated_feature #Evidence // richard
                      Associated_with_pseudogene ?Pseudogene XREF Associated_feature #Evidence // richard
                      Associated_with_transposon ?Transposon XREF Associated_feature #Evidence //richard
		      Associated_with_variation ?Variation XREF Feature #Evidence
		      Associated_with_Position_Matrix ?Position_Matrix XREF Associated_feature #Evidence
		      Associated_with_operon ?Operon XREF Associated_feature #Evidence
		      Associated_with_Interaction ?Interaction XREF Feature_interactor
		      Associated_with_expression_pattern ?Expr_pattern XREF Associated_feature #Evidence 
		      Associated_with_Feature ?Feature XREF Associated_with_Feature #Evidence
		      Associated_with_construct ?Construct XREF Sequence_feature
	 Bound_by_product_of ?Gene XREF Gene_product_binds #Evidence //pad added this to show what gene it binds
	 Transcription_factor UNIQUE ?Transcription_factor XREF Binding_site 
	 Annotation UNIQUE ?LongText // added for data attribution [030220 dl]
	 Confidential_remark ?Text //pad
	 Remark ?Text #Evidence
         Method UNIQUE ?Method

OA interface

Tab1

  • PGID Pgdbid -- no table -- postgres database ID
  • Feature ID -- sqf_name -- ontology on features. In term info displays: publicname othername description dnatext species paper wbgene boundbyproduct trascriptionfactor method analysis and interaction, regulation and expression objects related to that feature
  • Public_name -- sqf_publicname -- text
  • Other Name -- sqf_othername -- text -- seems there is no tag with othername in the file but we can keep it in case they will ever add one
  • Description -- sqf_description -- text
  • Species -- sqf_species -- dropdown as in Expression OA if when parsing data it gives a problem keep as text
  • Deprecated -- sqf_deprecated -- text
  • Defined_by_paper -- sqf_paper -- Multiontology on papers
  • Defined_by_person -- sqf_person -- multiontology on people
  • Defined_by_analysis -- sqf_analysis -- text

Tab2

  • Method -- sqf_method -- dropdown on the following controlled vocabulary: binding_site, binding_site_region, enhancer, promoter, regulatory_region, TF_binding_site, TF_binding_site_region, silencer (not yet any in geneace). NC_conserved_region (not yet in geneace)
  • SO_term -- sqf_soterm -- ontology http://sourceforge.net/p/song/svn/HEAD/tree/trunk/so-xp-simple.obo
  • DNA_text -- sqf_dnatext -- text
  • Flanking sequence A -- sqf_flanka -- text
  • Flanking sequence B -- sqf_flankb -- text
  • Mapping_target -- sqf_target -- text
  • Sequence -- sqf_sequence -- text
  • Merged_into -- sqf_mergedinto -- ontology on sequence features
  • Curator -- sqf_curator -- dropdown put Gary Williams WBPerson4025

Tab3

  • Gene -- sqf_wbgene -- multiontology on genes tag: Associated_with_gene
  • Expression_pattern(readonly) -- sfq_exprpattern -- multiontology on Expr_pattern (ExprID, exp_name in the expr_pattern table) tag: Associated_with_expression_pattern
  • Interaction(readonly) -- sqf_intid -- multiontology on interactions (WBInteractionxxxxxxxxx) atg: Associated_with_Interaction
  • CDS -- sqf_cds -- multiontology on sequences(?) as Trans Regulated Seq field in tab 3 in genereg OA tag: Associated_with_CDS <- Daniela, should this be text ?
  • Operon --sqf_operon -- text tag: Associated_with_operon
  • Construct(readonly) -- sqf_construct -- multiontology on constructs like in tab 2 for expr_pattern OA
  • Bound By Product Of -- sqf_boundbyproduct -- multiontology on genes
  • Transcription Factor -- sqf_trascriptionfactor -- text
  • Confidential_remark --sqf_confidential -- text
  • Remark -- sqf_remark -- text
  • Score -- sqf_score_ text


Some unused tags, do not generate anything for now -we will put in place a check that will throw an error if an unused tag starts to get populated:

  • Transcript -?Transcript (multiontology), sf_transcript, Dumps as Associated_with_transcript
  • Pseudogene -?Pseudogene (multiontology), sf_pseudogene, Dumps as Associated_with_pseudogene
  • Transposon -?Transposon (multionlogy), sf_transposon, Dumps as Associated_with_transposon
  • Variation -?Variation (multiontology), sf_variation, Dumps as Associated_with_variation
  • Position_Matrix -?Position_Matrix (multiontology), sf_pwm, Dumps as Associated_with_Position_Matrix

GeneAce link

ftp://ftp.sanger.ac.uk/pub2/wormbase/STAFF/mh6/nightly_geneace/


Parsing and Alerting script

the script is here: /home/postgres/work/pgpopulation/sqf_sequencefeature/populate_from_geneace/parse_seqfeat.pl

cronjob that runs at 8pm daily

If the parser encounters the Associated_with_expression or associated_with_interaction fields it will ingore them. We changed this in Dec 2014 because the associations are not anymore dumped at the Hinxton side but through Expression and regulation objects in caltech and it works this way:

it gets the SELECT * FROM sqf_name SELECT * FROM sqf_paper

existing name and paper data to see if there are new names or papers changed we get the file from ftp://ftp.sanger.ac.uk/pub/wormbase/STAFF/mh6/nightly_geneace/features.ace.gz

we read it from /home/postgres/work/pgpopulation/sqf_sequencefeature/prev_features.ace.gz

If the files are the same nothing happens If those 2 are different we replace the prev_features file with the sanger

$goodMethods{"binding_site"}++;
$goodMethods{"binding_site_region"}++;
$goodMethods{"DNAseI_hypersensitive_site"}++;
$goodMethods{"enhancer"}++;
$goodMethods{"histone_binding_site_region"}++;
$goodMethods{"promoter"}++;
$goodMethods{"regulatory_region"}++;
$goodMethods{"TF_binding_site"}++;
$goodMethods{"TF_binding_site_region"}++;
$goodMethods{"history_feature"}++;
mapping of .ace tags to sqf tables:
$tagToField{"Public_name"}                          = 'publicname';
$tagToField{"Other_name"}                           = 'othername';
$tagToField{"Description"}                          = 'description';
$tagToField{"Species"}                              = 'species';
$tagToField{"Deprecated"}                           = 'deprecated';
$tagToField{"Defined_by_paper"}                     = 'paper';
$tagToField{"Defined_by_person"}                    = 'person';
$tagToField{"Defined_by_analysis"}                  = 'analysis';
$tagToField{"Method"}                               = 'method';
$tagToField{"SO_term"}                              = 'soterm';
$tagToField{"DNA_text"}                             = 'dnatext';
$tagToField{"Flanking_sequences"}                   = 'flanka';
$tagToField{"Mapping_target"}                       = 'target';
$tagToField{"Sequence"}                             = 'sequence';
$tagToField{"Associated_with_gene"}                 = 'wbgene';
$tagToField{"Associated_with_CDS"}                  = 'cds';
$tagToField{"Associated_with_operon"}               = 'operon';
$tagToField{"Construct"}                            = 'construct';
$tagToField{"Bound_by_product_of"}                  = 'boundbyproduct';
$tagToField{"Transcription_factor"}                 = 'trascriptionfactor ';
$tagToField{"Confidential_remark"}                  = 'confidential';
$tagToField{"Remark"}                               = 'remark';
$tagToField{"Score"}                                = 'score';
$tagToField{"Merged_into"}                          = 'mergedinto';

we ignore the $tagToIgnore{"Acquires_merge"}++; as we have the merged_into

these fields are the multiontology:
$isMulti{"paper"}++;
$isMulti{"person"}++;
$isMulti{"wbgene"}++;
$isMulti{"exprpattern"}++;
$isMulti{"intid"}++;
$isMulti{"cds"}++;
$isMulti{"construct"}++;
$isMulti{"boundbyproduct"}++;

we take the prev_features file and break it up into each .ace entry for each object and we do the same for the sanger file. We check the Method. if it's not a good method we skip the object.

then we check that the object has a WBsfID otherwise we skip the object and then if there was data for that object in the same file and if they are the same we skip them.

At this point we have only .ace entries that are different from the previous .ace entries. If the .ace entries existed in postgres we get the pgid and add it to a list of the pgids that need to have the data removed from the sqf tables. If the entry is completely new we add it to the list of objects that needs to be e-mailed to DR and XW and we get a new pgid.


we store in curator WBPerson4025 and in name WBsfID

for each line in the .ace we get the tag and the data, if it's in the list of the ones to ignore we skip it

   next if ($tagToIgnore{$tag});


if the tag does not exist in a table it will send an e-mail saying that there is an invalid tag unless ($tagToField{$tag}) { $errorEmail .= "$wbsfid invalid tag $tag : $line\n"; next; }


if it's a flanking sequence it will extract flank A and flank B

for the ontology objects it is extracting the object IDs to aboid all the evidence hash. for anything else it's storing everything

We join all the remarks together separating them with pipes.

We add the data to postgres

if the field is a paper and the WBsfID is not a new object then we see if there were old paper data in postgres and if it has changed it will add to the list of changed papers to e-mail to DR/XW

then it starts populating postgres, see if there are differences and populate postgres

my $email = 'draciti@caltech.edu, xdwang@its.caltech.edu';

if there are any new objects it will e-mail you the list of objects. If they are less than 100 it lists the objects, if >100 it will send a list

same for the papers

it will also send an e-mail if there was an error in the parsing

Cronjob

this is the cronjob: 0 20 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl it runs every night at 8pm

it calls the next 2 scripts

[10/1/14 3:48:57 PM] j chan: `/home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/nightly_geneace.pl`; the one that gets the data and populate the table

[10/1/14 3:49:06 PM] j chan: `/home/postgres/work/pgpopulation/sqf_sequencefeature/populate_from_geneace/parse_seqfeat.pl`;