Sequence Feature
Contents
Flagging papers
send them to worm-bug@sanger.ac.uk This is where papers identified by svms/pattern matching are sent. We will be moving away from this ticketing system, but for the meantime they will all be in the same place.
Rules for marking up regions (from GW)
- If a region is necessary and sufficient to drive a reporter gene, then mark it as an 'enhancer' or 'silencer'.
(I don't think these are the classic definitions for enhancer/silencer, RL)
- If a region is both an enhancer and a silencer, then it should have the SO_term tags for both of these.
- If mobility shift experiments or similar experimental evidence is available to assert that a short region is a TF binding site, then mark it as a TF_binding_site.
- Similarity to a known binding motif is not evidence of being a TF_binding_site.
- If there is no evidence for a TF binding site and it has an effect on expression when mutated or deleted, but is not sufficient to drive a reporter gene, then we cannot assert that it is an enhancer or a TF binding site. Mark it as an anonymous 'regulatory_region'.
- If a region has the properties of being both a TF binding site and an enhancer then mark it up as two Features, one a TF_binding_site and one an enhancer.
- If a region is asserted to be a promoter region in the paper and it is within 200bp (or thereabouts?) of the 5' of the target gene and it is neccessary and sufficient to promote a reporter gene, mark it as a promoter. If in doubt, consider marking it as an enhancer.
Example for sequence feature curation
the example is from WBPaper00003631
Feature : "egl-1_temp_1.1" Sequence VF23B12L Mapping_target VF23B12L Flanking_sequences cagctcaattattaaattttattgggtattgttta cataaaattctattgtcccagatttaggatacatcg DNA_text CTCCTAACCGGGTGGTC Description "This is a TRA-1 binding site that represses egl-1." Remark "This is the TF_binding_site for TRA-1 which silences egl-1. N.B. a 'silencer' Feature has also been made at this location to aid expression and interaction curation [2013-07-23 gw3]" Associated_with_gene WBGene00001170 // egl-1 Bound_by_product_of WBGene00006604 // tra-1 Transcription_factor WBTranscriptionFactor000029 // tra-1 Method TF_binding_site SO_term "SO:0000235" // TF_binding_site Defined_by_paper WBPaper00003631 Public_name "TRA-1 binding site" Feature : "egl-1_temp_1.2" Sequence VF23B12L Mapping_target VF23B12L Flanking_sequences cagctcaattattaaattttattgggtattgttta cataaaattctattgtcccagatttaggatacatcg DNA_text CTCCTAACCGGGTGGTC Description "This is the silencer of egl-1, containing a single TF_binding_site bound by TRA-1." Remark "Made this 'silencer' feature in addition to the TRA-1 TF_binding_site Feature to aid expression and interaction curation [2013-07-23 gw3]" Associated_with_gene WBGene00001170 // egl-1 Method silencer SO_term "SO:0000625" // silencer Defined_by_paper WBPaper00003631 Public_name "TRA-1 binding site silencer"
Most Expr_pattern and Interaction objects will be attached to the 'enhancer/silencer' Features rather than the TF_binding_site Features
Link to Gene Regulation/Regulatory interactions
Two types of gene_regulation can be linked to feature:
- trans-regulation: TF A regulates target B through element C
In this situation, our current interaction model already accommodate this data and links feature object via:
?Interaction
Interaction_associated_feature ?Feature XREF Associated_with_Interaction //trans-regulation
- cis-regulation: enhancer element C (cis-regulator) cis-regulates gene B
Current interaction model needs to be modified to accommodate this type of data by adding new tag:
?Interaction
Feature_interactor ?Feature XREF Interacting_feature #Interactor_info //cis-regulation
We will propose corresponding feature model change to have one-to-one XREF between the models. The intention
is that interactions that explicitly state a sequence feature object as an
interactor in a physical or regulatory interaction can refer to a ?Feature
object as a "Feature_interactor". Alternatively, when there is less direct
evidence or the association is more vague, we would make use of the
"Interaction_associated_feature" tag. The XREFs will then link to the
appropriate tags in the corresponding objects.
- proposed model change:
?Interaction
Feature_interactor ?Feature XREF Interacting_feature #Interactor_info //cis-regulation
Interaction_associated_feature ?Feature XREF Associated_with_Interaction //trans-regulation
?Feature
Interacting_feature ?Interaction XREF Feature_interactor //cis-regulation
Associated_with_Interaction ?Interaction XREF Interaction_associated_feature //trans-regulation
Link to Expression pattern
When do we link sequence features to Expression Pattern objects and how.
Example 1 -from WBPaper00003631:
"The egl-1 gene appears to be expressed in the HSNs in males." The construct used is [Pegl-1::gfp] transcriptional fusion.
- Curator creates an Expression object for egl-1 in the male's HSN and links it to pegl-1::GFP transgene.
Expr_pattern : "Expr11092" Anatomy_term "WBbt:0004757" Certain //HSNR Anatomy_term "WBbt:0004758" Certain //HSNL Anatomy_term "WBbt:0007850" Certain //male Gene "WBGene00001170"//egl-1 Pattern "The egl-1 gene appears to be expressed in the HSNs in males, in which the HSNs normally undergo programmed cell death, but not in hermaphrodites, in which the HSNs normally survive." Reference "WBPaper00003631" Reporter_gene "[Pegl-1::gfp] transcriptional fusion. To construct Pegl-1::gfp, bases +174 to +5820 (5'-3') downstream of the stop codon of the egl-1 gene and bases -1914 to -837 (5'-3') upstream of the stop codon were amplified with appropriate primers and cloned into the SpeI-ApaI (5'-3') and PstI-BamHI (5'-3') sites of vector pPD95.69, respectively (A. Fire et al., personal communication). --precise ends."
- Sequence curator creates a sequence feature for that object -we are not there yet but we should aim for it.
- In the sequence feature object there will be a link to the expression.
note that in this expression object we have, as per the Expression_pattern model
Expr_pattern Expression_of Gene ?Gene XREF Expr_pattern
- The Expression pattern object in this case is linked to the gene as the authors hypothesize that the transcriptional fusion expression is the endogenous egl-1 expression.
Example 2 from WBPaper00003631 (hypothetical made up example- in this specific paper there's not such evidence but might be a scenario):
"This specific sequence of 80bp is expressed in the HSNL. The construct used is [80bp-egl-1::gfp].
1) One way to go is to link the expression to the sequence, other than the gene. From the Expr_pattern model:
Expr_pattern Expression_of Gene ?Gene XREF Expr_pattern Sequence ?Sequence XREF Expr_pattern
Expr_pattern : "Expr11093" Anatomy_term "WBbt:0004758" Certain //HSNL Sequence "???" Pattern "This particular sequence::GFP was expressed in HSNL" Reference "WBPaper00003631" Reporter_gene "[80bp-egl-1::gfp]. To construct 80bp-egl-1::gfp..."
- Sequence curator creates a sequence feature for that object.
- In the sequence feature object there will be a link to the expression.
- The Expression pattern object in this case is linked to the sequence as the artificial construct might not resemble the endogenous egl-1 expression.
- It will be generally hard to determine where is the boundary between artificial and endogenous expression if no other experimental evidences -IHC, ish- are available.
* If we curate the objects this way we should determine how to display them on the site. Separate from other expression objects?
2) Another option would be to include those objects in Gene regulation other than expression. That specific sequence is responsible for expression in..
How were these kinds of objects curated in the past? Was it via gene_regulation Cis_regulated_seq?
Although 'Cis_regulated_seq' existed in old gene_regulation model, it was never used for any objects both in Wen and Xiaodong's hands. In new Interaction modle, this tag is gone. --XW
3) A third possibility is to add Drives_expression_in in the feature object
Drives_expression_in Life_stage ?Life_stage Anatomy_term ?Anatomy_term GO_term ?GO_term
This is a favorable way as it will not "contaminate" the expression pattern class and at the same time the info of expression of the enhancer is captured. In REDfly (Regulatory Element Database for Drosophila, http://redfly.ccr.buffalo.edu/) the enhancer region is annotated to the anatomy terms but that expression is not listed under the classic expression patterns. See for example the decapentaplegic gene (dpp) construct dpp_303lacZ.
In the example of Hwang and Sternberg, 2004 (WBPaper00006370), the feature object will be
Feature : Public_name "lin-3 enhancer" Sequence F36H1 Description "lin-3 enhancer region, driving anchor cell (AC) specific expression" Flanking_sequences "ctagaacttcccgtctctccctattcaatg" "cttaccaatgtctcaggcatttttggaaaa" Mapping_target F36H1 Associated_with_gene WBGene00002992 // lin-3 Species "Caenorhabditis elegans" Defined_by_paper WBPaper00006370 SO_term SO:0000165 // enhancer Method enhancer Associated_with_Interaction WBInteraction000501966// hlh-2 binds to lin-3 Associated_with_Interaction WBInteraction000520204// nhr-25 binds to lin-3 Anatomy_term "WBbt:0004522"//Anchor cell
4) We could simply generate an Expr_pattern object and add the Associated_feature ?Feature. For display purposes on the site we can display objects that have Associated_feature in a separate section
Example 3 from WBPaper00003631:
"The egl-1 gene appears to be expressed in the HSNs in males (Pegl-1::GFP reporter)...if tra-1 is bound to egl-1 the expression in HSNs is repressed"
- The region of tra-1 binding to egl-1 is known and 2 sequence features are created for it, one as TF_binding_site and one for silencer.
- A gene regulation object is created -> egl-1 downregulation in HSN.
- The object is added in the silencer sequence feature object.
Should we create an expression object for the tra-1 binding site? in this case should create a negative expression. egl-1 is NOT expressed in HSNs if bound by tra-1. This falls under gene regulation to me -DR
Should we link to the existing expression pattern Expr_pattern : "Expr11092" -see above? This might not be appropriate as Expr11092 depicts expression in male HSNs. If we want to pull out that info we could do it anyway through the gene regulation object -DR
Should we just leave the gene regulation association?
As of now few Expression Patterns are linked to the Genome Browser (Vancouver set is the only data set). The ultimate goal is to map, whenever we can, expression constructs to the genome browser.
Top down approach
We are brainstorming in order to develop a model that will be suitable for accommodating curation of all the above.
The potential model should contain the following info
for Expression
- sequence - the sequence could be any stretch of DNA from few bp to kbs
(?Feature, 1 or more)
- reporter -GFP, RFP, YFP, mCherry, Venus,...
(+ Other: text, including when endogenous gene is used as the (part of, e.g. gfp fused in) reporter)
- gene (the gene immediately downstream of the sequence) non unique because it could be associated to more than one gene
(NOT annotate gene because 1. the base model is about describing the pattern of expression, 2. location information intrinsically informs possible cis-targets, 3. if author asserts relevant genes, that should go in some ?Regulation)
- Reflects_endogenous_expression_of ?Gene #if the author assume that expression reflects the endogenous then we put it otherwise not
- anatomy term
- life stage
- (sex will be encoded in life stage and anatomy)
- WBPaper
- experimental info?
- other info will be textual
After brainstorming (people involved Xiaodong, Raymond, Wen, Daniela) we agreed the current Expr_model can accommodate most of the changes proposed above. The only modification that should be done is to add the
- Reflects_endogenous_expression_of ?Gene #if the authors assume that the expression reflects the endogenous one we put it otherwise not
for all the *artificial* constructs we will not populate the tag. Daniela will start curation and see if everything fits with the proposal. If so, will request a model change.
for Regulation
Next topics: capture regulation, post-transcriptional regulation Agreement has been reached for gene regulation objects and is summarized in a chapter above.
Sequence Feature Model
?Feature SMap S_parent UNIQUE Sequence UNIQUE ?Sequence XREF Feature_object Name Public_name UNIQUE ?Text Other_name ?Text Sequence_details Flanking_sequences UNIQUE Text UNIQUE Text Mapping_target UNIQUE ?Sequence Source_location UNIQUE Int UNIQUE ?Sequence UNIQUE Int UNIQUE Int UNIQUE #Evidence //source data, <WSversion> ?Sequence pos1 pos2 Evidence(Paper/person etc. remarks) DNA_text UNIQUE ?Text // for storing the sequence of the feature...can use IUPAC codes to be able // store consensus sequences, e.g. binding site consensus sequence Origin Species UNIQUE ?Species //added by pad, as we are moving towards multi species readyness. Strain UNIQUE ?Strain//added by pad, as we are moving towards multi strain readyness. History Merged_into UNIQUE ?Feature XREF Acquires_merge #Evidence Acquires_merge ?Feature XREF Merged_into #Evidence Deprecated Text #Evidence Visible Description ?Text SO_term ?SO_term Defined_by Defined_by_sequence ?Sequence XREF Defines_feature #Evidence Defined_by_paper ?Paper XREF Feature #Evidence Defined_by_person ?Person Defined_by_author ?Author Defined_by_analysis ?Analysis Int Score Float Text #Evidence // this would be a log score as indicated by the analysis used in gff dump Associations Associated_with_gene ?Gene XREF Associated_feature #Evidence // richard Associated_with_CDS ?CDS XREF Associated_feature #Evidence // richard Associated_with_transcript ?Transcript XREF Associated_feature #Evidence // richard Associated_with_pseudogene ?Pseudogene XREF Associated_feature #Evidence // richard Associated_with_transposon ?Transposon XREF Associated_feature #Evidence //richard Associated_with_variation ?Variation XREF Feature #Evidence Associated_with_Position_Matrix ?Position_Matrix XREF Associated_feature #Evidence Associated_with_operon ?Operon XREF Associated_feature #Evidence Associated_with_Interaction ?Interaction XREF Feature_interactor Associated_with_expression_pattern ?Expr_pattern XREF Associated_feature #Evidence Associated_with_Feature ?Feature XREF Associated_with_Feature #Evidence Associated_with_construct ?Construct XREF Sequence_feature Bound_by_product_of ?Gene XREF Gene_product_binds #Evidence //pad added this to show what gene it binds Transcription_factor UNIQUE ?Transcription_factor XREF Binding_site Annotation UNIQUE ?LongText // added for data attribution [030220 dl] Confidential_remark ?Text //pad Remark ?Text #Evidence Method UNIQUE ?Method
OA interface
Tab1
- PGID Pgdbid -- no table -- postgres database ID
- Feature ID -- sqf_name -- ontology on features. In term info displays: publicname othername description dnatext species paper wbgene boundbyproduct trascriptionfactor method analysis and interaction, regulation and expression objects related to that feature
- Public_name -- sqf_publicname -- text
- Other Name -- sqf_othername -- text -- seems there is no tag with othername in the file but we can keep it in case they will ever add one
- Description -- sqf_description -- text
- Species -- sqf_species -- dropdown as in Expression OA if when parsing data it gives a problem keep as text
- Deprecated -- sqf_deprecated -- text
- Defined_by_paper -- sqf_paper -- Multiontology on papers
- Defined_by_person -- sqf_person -- multiontology on people
- Defined_by_analysis -- sqf_analysis -- text
Tab2
- Method -- sqf_method -- dropdown on the following controlled vocabulary: binding_site, binding_site_region, enhancer, promoter, regulatory_region, TF_binding_site, TF_binding_site_region, silencer (not yet any in geneace). NC_conserved_region (not yet in geneace)
- SO_term -- sqf_soterm -- ontology http://sourceforge.net/p/song/svn/HEAD/tree/trunk/so-xp-simple.obo
- DNA_text -- sqf_dnatext -- text
- Flanking sequence A -- sqf_flanka -- text
- Flanking sequence B -- sqf_flankb -- text
- Mapping_target -- sqf_target -- text
- Sequence -- sqf_sequence -- text
- Merged_into -- sqf_mergedinto -- ontology on sequence features
- Curator -- sqf_curator -- dropdown put Gary Williams WBPerson4025
Tab3
- Gene -- sqf_wbgene -- multiontology on genes tag: Associated_with_gene
- Expression_pattern(readonly) -- sfq_exprpattern -- multiontology on Expr_pattern (ExprID, exp_name in the expr_pattern table) tag: Associated_with_expression_pattern
- Interaction(readonly) -- sqf_intid -- multiontology on interactions (WBInteractionxxxxxxxxx) atg: Associated_with_Interaction
- CDS -- sqf_cds -- multiontology on sequences(?) as Trans Regulated Seq field in tab 3 in genereg OA tag: Associated_with_CDS <- Daniela, should this be text ?
- Operon --sqf_operon -- text tag: Associated_with_operon
- Construct(readonly) -- sqf_construct -- multiontology on constructs like in tab 2 for expr_pattern OA
- Bound By Product Of -- sqf_boundbyproduct -- multiontology on genes
- Transcription Factor -- sqf_trascriptionfactor -- text
- Confidential_remark --sqf_confidential -- text
- Remark -- sqf_remark -- text
- Score -- sqf_score_ text
Some unused tags, do not generate anything for now -we will put in place a check that will throw an error if an unused tag starts to get populated:
- Transcript -?Transcript (multiontology), sf_transcript, Dumps as Associated_with_transcript
- Pseudogene -?Pseudogene (multiontology), sf_pseudogene, Dumps as Associated_with_pseudogene
- Transposon -?Transposon (multionlogy), sf_transposon, Dumps as Associated_with_transposon
- Variation -?Variation (multiontology), sf_variation, Dumps as Associated_with_variation
- Position_Matrix -?Position_Matrix (multiontology), sf_pwm, Dumps as Associated_with_Position_Matrix
GeneAce link
ftp://ftp.sanger.ac.uk/pub2/wormbase/STAFF/mh6/nightly_geneace/
Parsing and Alerting script
the script is here: /home/postgres/work/pgpopulation/sqf_sequencefeature/populate_from_geneace/parse_seqfeat.pl
cronjob that runs at 8pm daily
If the parser encounters the Associated_with_expression or associated_with_interaction fields it will ingore them. We changed this in Dec 2014 because the associations are not anymore dumped at the Hinxton side but through Expression and regulation objects in caltech and it works this way:
it gets the SELECT * FROM sqf_name SELECT * FROM sqf_paper
existing name and paper data to see if there are new names or papers changed we get the file from ftp://ftp.sanger.ac.uk/pub/wormbase/STAFF/mh6/nightly_geneace/features.ace.gz
we read it from /home/postgres/work/pgpopulation/sqf_sequencefeature/prev_features.ace.gz
If the files are the same nothing happens If those 2 are different we replace the prev_features file with the sanger
$goodMethods{"binding_site"}++; $goodMethods{"binding_site_region"}++; $goodMethods{"DNAseI_hypersensitive_site"}++; $goodMethods{"enhancer"}++; $goodMethods{"histone_binding_site_region"}++; $goodMethods{"promoter"}++; $goodMethods{"regulatory_region"}++; $goodMethods{"TF_binding_site"}++; $goodMethods{"TF_binding_site_region"}++; $goodMethods{"history_feature"}++;
mapping of .ace tags to sqf tables: $tagToField{"Public_name"} = 'publicname'; $tagToField{"Other_name"} = 'othername'; $tagToField{"Description"} = 'description'; $tagToField{"Species"} = 'species'; $tagToField{"Deprecated"} = 'deprecated'; $tagToField{"Defined_by_paper"} = 'paper'; $tagToField{"Defined_by_person"} = 'person'; $tagToField{"Defined_by_analysis"} = 'analysis'; $tagToField{"Method"} = 'method'; $tagToField{"SO_term"} = 'soterm'; $tagToField{"DNA_text"} = 'dnatext'; $tagToField{"Flanking_sequences"} = 'flanka'; $tagToField{"Mapping_target"} = 'target'; $tagToField{"Sequence"} = 'sequence'; $tagToField{"Associated_with_gene"} = 'wbgene'; $tagToField{"Associated_with_CDS"} = 'cds'; $tagToField{"Associated_with_operon"} = 'operon'; $tagToField{"Construct"} = 'construct'; $tagToField{"Bound_by_product_of"} = 'boundbyproduct'; $tagToField{"Transcription_factor"} = 'trascriptionfactor '; $tagToField{"Confidential_remark"} = 'confidential'; $tagToField{"Remark"} = 'remark'; $tagToField{"Score"} = 'score'; $tagToField{"Merged_into"} = 'mergedinto';
we ignore the $tagToIgnore{"Acquires_merge"}++; as we have the merged_into
these fields are the multiontology: $isMulti{"paper"}++; $isMulti{"person"}++; $isMulti{"wbgene"}++; $isMulti{"exprpattern"}++; $isMulti{"intid"}++; $isMulti{"cds"}++; $isMulti{"construct"}++; $isMulti{"boundbyproduct"}++;
we take the prev_features file and break it up into each .ace entry for each object and we do the same for the sanger file. We check the Method. if it's not a good method we skip the object.
then we check that the object has a WBsfID otherwise we skip the object and then if there was data for that object in the same file and if they are the same we skip them.
At this point we have only .ace entries that are different from the previous .ace entries. If the .ace entries existed in postgres we get the pgid and add it to a list of the pgids that need to have the data removed from the sqf tables. If the entry is completely new we add it to the list of objects that needs to be e-mailed to DR and XW and we get a new pgid.
we store in curator WBPerson4025
and in name WBsfID
for each line in the .ace we get the tag and the data, if it's in the list of the ones to ignore we skip it
next if ($tagToIgnore{$tag});
if the tag does not exist in a table it will send an e-mail saying that there is an invalid tag
unless ($tagToField{$tag}) { $errorEmail .= "$wbsfid invalid tag $tag : $line\n"; next; }
if it's a flanking sequence it will extract flank A and flank B
for the ontology objects it is extracting the object IDs to aboid all the evidence hash. for anything else it's storing everything
We join all the remarks together separating them with pipes.
We add the data to postgres
if the field is a paper and the WBsfID is not a new object then we see if there were old paper data in postgres and if it has changed it will add to the list of changed papers to e-mail to DR/XW
then it starts populating postgres, see if there are differences and populate postgres
my $email = 'draciti@caltech.edu, xdwang@its.caltech.edu';
if there are any new objects it will e-mail you the list of objects. If they are less than 100 it lists the objects, if >100 it will send a list
same for the papers
it will also send an e-mail if there was an error in the parsing
Cronjob
this is the cronjob: 0 20 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl it runs every night at 8pm
it calls the next 2 scripts
[10/1/14 3:48:57 PM] j chan: `/home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/nightly_geneace.pl`; the one that gets the data and populate the table
[10/1/14 3:49:06 PM] j chan: `/home/postgres/work/pgpopulation/sqf_sequencefeature/populate_from_geneace/parse_seqfeat.pl`;