Sequence Feature
Contents
Flagging papers
send them to worm-bug@sanger.ac.uk This is where papers identified by svms/pattern matching are sent. We will be moving away from this ticketing system, but for the meantime they will all be in the same place.
Rules for marking up regions (from GW)
- If a region is necessary and sufficient to drive a reporter gene, then mark it as an 'enhancer' or 'silencer'.
(I don't think these are the classic definitions for enhancer/silencer, RL)
- If a region is both an enhancer and a silencer, then it should have the SO_term tags for both of these.
- If mobility shift experiments or similar experimental evidence is available to assert that a short region is a TF binding site, then mark it as a TF_binding_site.
- Similarity to a known binding motif is not evidence of being a TF_binding_site.
- If there is no evidence for a TF binding site and it has an effect on expression when mutated or deleted, but is not sufficient to drive a reporter gene, then we cannot assert that it is an enhancer or a TF binding site. Mark it as an anonymous 'regulatory_region'.
- If a region has the properties of being both a TF binding site and an enhancer then mark it up as two Features, one a TF_binding_site and one an enhancer.
- If a region is asserted to be a promoter region in the paper and it is within 200bp (or thereabouts?) of the 5' of the target gene and it is neccessary and sufficient to promote a reporter gene, mark it as a promoter. If in doubt, consider marking it as an enhancer.
Example for sequence feature curation
the example is from WBPaper00003631
Feature : "egl-1_temp_1.1" Sequence VF23B12L Mapping_target VF23B12L Flanking_sequences cagctcaattattaaattttattgggtattgttta cataaaattctattgtcccagatttaggatacatcg DNA_text CTCCTAACCGGGTGGTC Description "This is a TRA-1 binding site that represses egl-1." Remark "This is the TF_binding_site for TRA-1 which silences egl-1. N.B. a 'silencer' Feature has also been made at this location to aid expression and interaction curation [2013-07-23 gw3]" Associated_with_gene WBGene00001170 // egl-1 Bound_by_product_of WBGene00006604 // tra-1 Transcription_factor WBTranscriptionFactor000029 // tra-1 Method TF_binding_site SO_term "SO:0000235" // TF_binding_site Defined_by_paper WBPaper00003631 Public_name "TRA-1 binding site" Feature : "egl-1_temp_1.2" Sequence VF23B12L Mapping_target VF23B12L Flanking_sequences cagctcaattattaaattttattgggtattgttta cataaaattctattgtcccagatttaggatacatcg DNA_text CTCCTAACCGGGTGGTC Description "This is the silencer of egl-1, containing a single TF_binding_site bound by TRA-1." Remark "Made this 'silencer' feature in addition to the TRA-1 TF_binding_site Feature to aid expression and interaction curation [2013-07-23 gw3]" Associated_with_gene WBGene00001170 // egl-1 Method silencer SO_term "SO:0000625" // silencer Defined_by_paper WBPaper00003631 Public_name "TRA-1 binding site silencer"
Most Expr_pattern and Interaction objects will be attached to the 'enhancer/silencer' Features rather than the TF_binding_site Features
Link to Gene Regulation/Regulatory interactions
Two types of gene_regulation can be linked to feature:
- trans-regulation: TF A regulates target B through element C
In this situation, our current interaction model already accommodate this data and links feature object via:
?Interaction
Interaction_associated_feature ?Feature XREF Associated_with_Interaction //trans-regulation
- cis-regulation: enhancer element C (cis-regulator) cis-regulates gene B
Current interaction model needs to be modified to accommodate this type of data by adding new tag:
?Interaction
Feature_interactor ?Feature XREF Interacting_feature #Interactor_info //cis-regulation
We will propose corresponding feature model change to have one-to-one XREF between the models. The intention
is that interactions that explicitly state a sequence feature object as an
interactor in a physical or regulatory interaction can refer to a ?Feature
object as a "Feature_interactor". Alternatively, when there is less direct
evidence or the association is more vague, we would make use of the
"Interaction_associated_feature" tag. The XREFs will then link to the
appropriate tags in the corresponding objects.
- proposed model change:
?Interaction
Feature_interactor ?Feature XREF Interacting_feature #Interactor_info //cis-regulation
Interaction_associated_feature ?Feature XREF Associated_with_Interaction //trans-regulation
?Feature
Interacting_feature ?Interaction XREF Feature_interactor //cis-regulation
Associated_with_Interaction ?Interaction XREF Interaction_associated_feature //trans-regulation
Link to Expression pattern
When do we link sequence features to Expression Pattern objects and how.
Example 1 -from WBPaper00003631:
"The egl-1 gene appears to be expressed in the HSNs in males." The construct used is [Pegl-1::gfp] transcriptional fusion.
- Curator creates an Expression object for egl-1 in the male's HSN and links it to pegl-1::GFP transgene.
Expr_pattern : "Expr11092" Anatomy_term "WBbt:0004757" Certain //HSNR Anatomy_term "WBbt:0004758" Certain //HSNL Anatomy_term "WBbt:0007850" Certain //male Gene "WBGene00001170"//egl-1 Pattern "The egl-1 gene appears to be expressed in the HSNs in males, in which the HSNs normally undergo programmed cell death, but not in hermaphrodites, in which the HSNs normally survive." Reference "WBPaper00003631" Reporter_gene "[Pegl-1::gfp] transcriptional fusion. To construct Pegl-1::gfp, bases +174 to +5820 (5'-3') downstream of the stop codon of the egl-1 gene and bases -1914 to -837 (5'-3') upstream of the stop codon were amplified with appropriate primers and cloned into the SpeI-ApaI (5'-3') and PstI-BamHI (5'-3') sites of vector pPD95.69, respectively (A. Fire et al., personal communication). --precise ends."
- Sequence curator creates a sequence feature for that object -we are not there yet but we should aim for it.
- In the sequence feature object there will be a link to the expression.
note that in this expression object we have, as per the Expression_pattern model
Expr_pattern Expression_of Gene ?Gene XREF Expr_pattern
- The Expression pattern object in this case is linked to the gene as the authors hypothesize that the transcriptional fusion expression is the endogenous egl-1 expression.
Example 2 from WBPaper00003631 (hypothetical made up example- in this specific paper there's not such evidence but might be a scenario):
"This specific sequence of 80bp is expressed in the HSNL. The construct used is [80bp-egl-1::gfp].
1) One way to go is to link the expression to the sequence, other than the gene. From the Expr_pattern model:
Expr_pattern Expression_of Gene ?Gene XREF Expr_pattern Sequence ?Sequence XREF Expr_pattern
Expr_pattern : "Expr11093" Anatomy_term "WBbt:0004758" Certain //HSNL Sequence "???" Pattern "This particular sequence::GFP was expressed in HSNL" Reference "WBPaper00003631" Reporter_gene "[80bp-egl-1::gfp]. To construct 80bp-egl-1::gfp..."
- Sequence curator creates a sequence feature for that object.
- In the sequence feature object there will be a link to the expression.
- The Expression pattern object in this case is linked to the sequence as the artificial construct might not resemble the endogenous egl-1 expression.
- It will be generally hard to determine where is the boundary between artificial and endogenous expression if no other experimental evidences -IHC, ish- are available.
* If we curate the objects this way we should determine how to display them on the site. Separate from other expression objects?
2) Another option would be to include those objects in Gene regulation other than expression. That specific sequence is responsible for expression in..
How were these kinds of objects curated in the past? Was it via gene_regulation Cis_regulated_seq?
Although 'Cis_regulated_seq' existed in old gene_regulation model, it was never used for any objects both in Wen and Xiaodong's hands. In new Interaction modle, this tag is gone. --XW
3) A third possibility is to add Drives_expression_in in the feature object
Drives_expression_in Life_stage ?Life_stage Anatomy_term ?Anatomy_term GO_term ?GO_term
This is a favorable way as it will not "contaminate" the expression pattern class and at the same time the info of expression of the enhancer is captured. In REDfly (Regulatory Element Database for Drosophila, http://redfly.ccr.buffalo.edu/) the enhancer region is annotated to the anatomy terms but that expression is not listed under the classic expression patterns. See for example the decapentaplegic gene (dpp) construct dpp_303lacZ.
In the example of Hwang and Sternberg, 2004 (WBPaper00006370), the feature object will be
Feature : Public_name "lin-3 enhancer" Sequence F36H1 Description "lin-3 enhancer region, driving anchor cell (AC) specific expression" Flanking_sequences "ctagaacttcccgtctctccctattcaatg" "cttaccaatgtctcaggcatttttggaaaa" Mapping_target F36H1 Associated_with_gene WBGene00002992 // lin-3 Species "Caenorhabditis elegans" Defined_by_paper WBPaper00006370 SO_term SO:0000165 // enhancer Method enhancer Associated_with_Interaction WBInteraction000501966// hlh-2 binds to lin-3 Associated_with_Interaction WBInteraction000520204// nhr-25 binds to lin-3 Anatomy_term "WBbt:0004522"//Anchor cell
4) We could simply generate an Expr_pattern object and add the Associated_feature ?Feature. For display purposes on the site we can display objects that have Associated_feature in a separate section
Example 3 from WBPaper00003631:
"The egl-1 gene appears to be expressed in the HSNs in males (Pegl-1::GFP reporter)...if tra-1 is bound to egl-1 the expression in HSNs is repressed"
- The region of tra-1 binding to egl-1 is known and 2 sequence features are created for it, one as TF_binding_site and one for silencer.
- A gene regulation object is created -> egl-1 downregulation in HSN.
- The object is added in the silencer sequence feature object.
Should we create an expression object for the tra-1 binding site? in this case should create a negative expression. egl-1 is NOT expressed in HSNs if bound by tra-1. This falls under gene regulation to me -DR
Should we link to the existing expression pattern Expr_pattern : "Expr11092" -see above? This might not be appropriate as Expr11092 depicts expression in male HSNs. If we want to pull out that info we could do it anyway through the gene regulation object -DR
Should we just leave the gene regulation association?
As of now few Expression Patterns are linked to the Genome Browser (Vancouver set is the only data set). The ultimate goal is to map, whenever we can, expression constructs to the genome browser.
Top down approach
We are brainstorming in order to develop a model that will be suitable for accommodating curation of all the above.
The potential model should contain the following info
for Expression
- sequence - the sequence could be any stretch of DNA from few bp to kbs
(?Feature, 1 or more)
- reporter -GFP, RFP, YFP, mCherry, Venus,...
(+ Other: text, including when endogenous gene is used as the (part of, e.g. gfp fused in) reporter)
- gene (the gene immediately downstream of the sequence) non unique because it could be associated to more than one gene
(NOT annotate gene because 1. the base model is about describing the pattern of expression, 2. location information intrinsically informs possible cis-targets, 3. if author asserts relevant genes, that should go in some ?Regulation)
- Reflects_endogenous_expression_of ?Gene #if the author assume that expression reflects the endogenous then we put it otherwise not
- anatomy term
- life stage
- (sex will be encoded in life stage and anatomy)
- WBPaper
- experimental info?
- other info will be textual
After brainstorming (people involved Xiaodong, Raymond, Wen, Daniela) we agreed the current Expr_model can accommodate most of the changes proposed above. The only modification that should be done is to add the
- Reflects_endogenous_expression_of ?Gene #if the authors assume that the expression reflects the endogenous one we put it otherwise not
for all the *artificial* constructs we will not populate the tag. Daniela will start curation and see if everything fits with the proposal. If so, will request a model change.
for Regulation
Next topics: capture regulation, post-transcriptional regulation Agreement has been reached for gene regulation objects and is summarized in a chapter above.
Sequence Feature Model
?Feature SMap S_parent UNIQUE Sequence UNIQUE ?Sequence XREF Feature_object Name Public_name UNIQUE ?Text Other_name ?Text Sequence_details Flanking_sequences UNIQUE Text UNIQUE Text Mapping_target UNIQUE ?Sequence Source_location UNIQUE Int UNIQUE ?Sequence UNIQUE Int UNIQUE Int UNIQUE #Evidence //source data, <WSversion> ?Sequence pos1 pos2 Evidence(Paper/person etc. remarks) DNA_text UNIQUE ?Text // for storing the sequence of the feature...can use IUPAC codes to be able // store consensus sequences, e.g. binding site consensus sequence Origin Species UNIQUE ?Species //added by pad, as we are moving towards multi species readyness. Strain UNIQUE ?Strain//added by pad, as we are moving towards multi strain readyness. History Merged_into UNIQUE ?Feature XREF Acquires_merge #Evidence Acquires_merge ?Feature XREF Merged_into #Evidence Deprecated Text #Evidence Visible Description ?Text SO_term ?SO_term Defined_by Defined_by_sequence ?Sequence XREF Defines_feature #Evidence Defined_by_paper ?Paper XREF Feature #Evidence Defined_by_person ?Person Defined_by_author ?Author Defined_by_analysis ?Analysis Int Score Float Text #Evidence // this would be a log score as indicated by the analysis used in gff dump Associations Associated_with_gene ?Gene XREF Associated_feature #Evidence // richard Associated_with_CDS ?CDS XREF Associated_feature #Evidence // richard Associated_with_transcript ?Transcript XREF Associated_feature #Evidence // richard Associated_with_pseudogene ?Pseudogene XREF Associated_feature #Evidence // richard Associated_with_transposon ?Transposon XREF Associated_feature #Evidence //richard Associated_with_variation ?Variation XREF Feature #Evidence Associated_with_Position_Matrix ?Position_Matrix XREF Associated_feature #Evidence Associated_with_operon ?Operon XREF Associated_feature #Evidence Associated_with_Interaction ?Interaction XREF Feature_interactor Associated_with_expression_pattern ?Expr_pattern XREF Associated_feature #Evidence Associated_with_Feature ?Feature XREF Associated_with_Feature #Evidence Associated_with_construct ?Construct XREF Sequence_feature Bound_by_product_of ?Gene XREF Gene_product_binds #Evidence //pad added this to show what gene it binds Transcription_factor UNIQUE ?Transcription_factor XREF Binding_site Annotation UNIQUE ?LongText // added for data attribution [030220 dl] Confidential_remark ?Text //pad Remark ?Text #Evidence Method UNIQUE ?Method
OA interface
Tab1
- PGID Pgdbid -- no table -- postgres database ID
- Feature ID -- sf_name -- multiontology on Features (WBsfIDs)
- Public_name -- sf_publicname -- text
- Other Name -- sf_othername -- text -- seems there is no tags with othername in the file but we can keep it in case they will ever add one
- Description -- sf_description -- text
DNA_text Defined_by_person Defined_by_paper Mapping_target Associated_with_expression_pattern Deprecated
Species Associated_with_CDS Flanking_sequences Method Confidential_remark Bound_by_product_of Transcription_factor Associated_with_operon SO_term Score Defined_by_analysis Associated_with_gene Remark
Associated_with_Interaction Sequence
Tab1
- Description, text, Dumps as Description
- Curator - (Dropdown) sf_curator Dumps as: N/A
- Paper - (Multiontology) sf_paper Dumps as: Defined_by_paper <Paper>
- Species
- Strain
- Merged_into
- Acquires_merge
- Deprecated, text
- Author sf_author, Dumps as : Defined_by_author
- Not sure about the rest of 'Defined_by' tags (person, analysis, sequence)
Tab2
- S-parent
- Flanking sequences
- Mapping target
- Source location
- SO terms
- Methods, text, (Dropdown), sf_method, Dumps as Method
- Sequence, Dumps as Defined_by_sequence?
Tab3
- Gene - ?Gene (multiontology), sf_gene, Dumps as Associated_with_gene
- CDs - ?CDS (multiontology), sf_CDS, Dumps as Associated_with_CDS
- Transcript -?Transcript (multiontology), sf_transcript, Dumps as Associated_with_transcript
- Pseudogene -?Pseudogene (multiontology), sf_pseudogene, Dumps as Associated_with_pseudogene
- Transposon -?Transposon (multionlogy), sf_transposon, Dumps as Associated_with_transposon
- Variation -?Variation (multiontology), sf_variation, Dumps as Associated_with_variation
- Position_Matrix -?Position_Matrix (multiontology), sf_pwm, Dumps as Associated_with_Position_Matrix
- Operon -?Operon (multiontology), sf_operon, Dumps as Associated_with_operon
- Interaction -?Interaction (multiontology), sf_interaction, Dumps as Associated_with_Interaction
- Expression -?Expr_pattern (multiontology), sf_expr, Dumps as Associated_with_expression_pattern
- Construct - ?Construct (multiontology), sf_construct, Dumps as Associated_with_Feature
- Bound By Product Of -?Gene (multiontolgy), sf_bound_by_product, Dumps as Bound_by_product_of
- Transcription Factor -?Trascription_factor (multiontology), Dumps as Transcription_factor
- Remark, text, sf_remark, Dumps as Remark