Working Group:Sequence Features

From WormBaseWiki
Revision as of 22:19, 8 September 2014 by Draciti (talk | contribs)
Jump to navigationJump to search

Topics

  • Display of Sequence Features on the website
    • And Transcription factors and Gene_product binds.
  • Stream out working flow
    • How can we automatically identify Sequence Feature papers.
    • How do papers come in now? SVM? Textpresso string matches?
  • Improving data flow
    • How can we make the data immediately available to all curators?
      • Add Paper and Public_name fields to the Features in the Nameserver?
      • geneace is available for download updated every day - a copy is taken by Caltech.
  • Sample papers curation
    • Prepare a paper list from each person
  • assign meaningful names to the public_name field, e.g. 'distal enhancer 1' instead of 'DE1'
    • gw3 - I disagree with this. We should use the name that is used to describe the region in the paper, rather than making up our own names. I have had to go back and re-annotate regions after Xiaodong found problems with my stuff several months after I had originally done it and I really needed to be able to unambiguously identify which region the paper was talking about by using the Public_name field and matching this up to the name of the region in the paper. Users may well need to do this as well if they are reading the paper and looking at the Features marking up the regions.
    • dr - ok sounds good to me Gary

Pre-Jamboree prep

  • Suggestions for work
    • Curate some (5? 10?) papers from our lists and collate data as follows
      • How long did each paper take to curate
      • How many new objects did you add to WormBase
      • Where in the paper was the information (e.g. supplementary data, figure legends, within text)
      • Did you need to contact the author for more information
      • Did you come across data for other curators
        • gw3 - I don't know in detail what data other curators require. It would be useful to know this so I can summarise data for others.

The Jamboree 15-17th Sept 2014

  • Suggestions for work
    • Discuss result from above
    • Work through some papers which have already been curated and have different data types (e.g. promoter and gene regulation) to identify best curation practices.
    • browsing capabilities through WormMine?


Matters Arising

  • things noted - not necessarily to do with the topic in hand
    • There are many duplicates of Interaction objects in geneace, with or without two leading digits. Sometimes there is only the shorter form.
    • The Features WBsf919641 and WBsf919607 are at the same position. Both are DAF-3 binding sites. WBsf919641 has Method "binding_site", WBsf919607 has Method "TF_binding_site" which I think is more correct as DAF-3 is a TF. WBsf919641 used the paper WBPaper00004526, WBsf919607 used the paper WBPaper00003384. WBsf919641 is associated with WBGene00202278, WBsf919607 is associated with WBGene00003514 (myo-2). The paper WBPaper00004526 appears to about a PEB-1 binding site, not a DAF-3 site. I made the Feature WBsf919609 for the PEB-1 site based on WBPaper00004526 in the last round of curation. Something not right here???
    • The Features WBsf019227 and WBsf038813 and in the same position, but opposite strands. They both are for a LAG-1 binding site associated with gene lin-11. One from paper: WBPaper00005357, other from WBPaper00032298. WBsf019227 needs to be retired and merged into WBsf038813.
    • WBsf038819 has a Method of 'regulatory_region', but the Public_name and Description describes it as an Enhancer. Which is it?
    • WBsf919669 is an Enhancer, but it has a tag for a Transcription_factor WBTranscriptionFactor000101. It is far too large to be a binding site for a transcription fact. I think this tag should be removed.
    • 393 out of the 500 TF_binding_site Features have an incorrect SO_term tag. It should be set to: SO:0000235
    • 10 Features with Method = "TF_binding_site" are lacking a Transcription_factor tag.
  • how many 'Method's for feature? enhancer, regulatory_region, TF_biding_site, promoter…?
    • when display on JBrowse/GBrowse, do we want to display WBsfxxxxx or methods?


Duplicated Features The following are a set of duplicated regulatory Features. I (Gary) am probably the main culprit in not checking to see if there is a pre-existing Feature. How can we avoid this in future?

  • These two need to be merged - I suggest WBsf019227 be retired as it is in the opposite sense to its gene.
    • CHROMOSOME_I TF_binding_site TF_binding_site 10245805 10245811 . - . Feature "WBsf019227" ; TF_ID "WBTranscriptionFactor000101" ; TF_name "LAG-1"
    • CHROMOSOME_I TF_binding_site TF_binding_site 10245805 10245811 . + . Feature "WBsf038813" ; TF_ID "WBTranscriptionFactor000101" ; TF_name "LAG-1"
  • These two need to be merged - I suggest WBsf019221 be retired as it doesn't have the Interaction objects.
    • CHROMOSOME_I TF_binding_site TF_binding_site 10245835 10245841 . + . Feature "WBsf019221" ; TF_ID "WBTranscriptionFactor000101" ; TF_name "LAG-1"
    • CHROMOSOME_I TF_binding_site TF_binding_site 10245835 10245841 . + . Feature "WBsf038814" ; TF_ID "WBTranscriptionFactor000101" ; TF_name "LAG-1"
  • These two need to be merged - no difference between them. The SO_term is wrong - needs to be SO:0000235
    • CHROMOSOME_II TF_binding_site TF_binding_site 10950131 10950138 . + . Feature "WBsf019124" ; TF_ID "WBTranscriptionFactor000014" ; TF_name "SKN-1"
    • CHROMOSOME_II TF_binding_site TF_binding_site 10950131 10950138 . + . Feature "WBsf019126" ; TF_ID "WBTranscriptionFactor000014" ; TF_name "SKN-1"
  • Two papers - one says this is a MEX-1 recognition elemant and the other says this is a MEX-3 recognition element - are they both right?
    • CHROMOSOME_III binding_site binding_site 4804863 4804877 . - . Feature "WBsf899528"
    • CHROMOSOME_III binding_site binding_site 4804863 4804877 . - . Feature "WBsf899543"
  • Two papers - one says this is a MEX-1 recognition elemant and the other says this is a MEX-3 recognition element - are they both right?
    • CHROMOSOME_III binding_site binding_site 4804911 4804928 . - . Feature "WBsf899526"
    • CHROMOSOME_III binding_site binding_site 4804911 4804928 . - . Feature "WBsf899542"
  • These are from the same paper - one says it is "PUF-8 recognition element (PRE-1)" and the other says it is "PUF-8 recognition element (PRE-2)". Are they both right?
    • CHROMOSOME_III binding_site binding_site 4805138 4805145 . - . Feature "WBsf899537"
    • CHROMOSOME_III binding_site binding_site 4805138 4805145 . - . Feature "WBsf899538"
  • These are from the same paper - one says it is "TF LIN-1 binding site S11 in pJW5" and the other is "TF LIN-1 binding site S20 in pJW5" - looks like an error to me - Gary to redo this.
    • CHROMOSOME_III TF_binding_site TF_binding_site 7540351 7540361 . - . Feature "WBsf919592" ; TF_ID "WBTranscriptionFactor000135" ; TF_name "LIN-1"
    • CHROMOSOME_III TF_binding_site TF_binding_site 7540351 7540361 . - . Feature "WBsf919594" ; TF_ID "WBTranscriptionFactor000135" ; TF_name "LIN-1"
  • These two need to be merged - I suggest WBsf047654 be retired as it contains less information.
    • CHROMOSOME_III regulatory_region misc_feature 7540972 7540992 . - . Feature "WBsf047654"
    • CHROMOSOME_III regulatory_region misc_feature 7540972 7540992 . - . Feature "WBsf919589"
  • These two need to be merged - they are from different papers. I suggest WBsf047505 be retired as it has an incorrect SO_term - needs to be SO:0000235
    • CHROMOSOME_IV TF_binding_site TF_binding_site 2306593 2306606 . - . Feature "WBsf047478" ; TF_ID "WBTranscriptionFactor000052" ; TF_name "DAF-19"
    • CHROMOSOME_IV TF_binding_site TF_binding_site 2306593 2306606 . - . Feature "WBsf047505" ; TF_ID "WBTranscriptionFactor000052" ; TF_name "DAF-19"
  • These two need to be merged - they are from different papers. I suggest WBsf019088 be retired as it has an incorrect SO_term - needs to be SO:0000235
    • CHROMOSOME_V TF_binding_site TF_binding_site 10671938 10671944 . + . Feature "WBsf019088" ; TF_ID "WBTranscriptionFactor000061" ; TF_name "CEH-22"
    • CHROMOSOME_V TF_binding_site TF_binding_site 10671938 10671944 . + . Feature "WBsf919536" ; TF_ID "WBTranscriptionFactor000061" ; TF_name "CEH-22"
  • These three need to be merged. They are identical. The SO_term is wrong - needs to be SO:0000235
    • CHROMOSOME_V TF_binding_site TF_binding_site 11882353 11882360 . + . Feature "WBsf216760" ; TF_ID "WBTranscriptionFactor000126" ; TF_name "CEH-6"
    • CHROMOSOME_V TF_binding_site TF_binding_site 11882353 11882360 . + . Feature "WBsf216762" ; TF_ID "WBTranscriptionFactor000126" ; TF_name "CEH-6"
    • CHROMOSOME_V TF_binding_site TF_binding_site 11882353 11882360 . + . Feature "WBsf216764" ; TF_ID "WBTranscriptionFactor000126" ; TF_name "CEH-6"
  • These three need to be merged. They are identical. The SO_term is wrong - needs to be SO:0000235
    • CHROMOSOME_X TF_binding_site TF_binding_site 4100019 4100026 . - . Feature "WBsf216754" ; TF_ID "WBTranscriptionFactor000126" ; TF_name "CEH-6"
    • CHROMOSOME_X TF_binding_site TF_binding_site 4100019 4100026 . - . Feature "WBsf216755" ; TF_ID "WBTranscriptionFactor000126" ; TF_name "CEH-6"
  • These two need to be merged. WBsf899549 should be retired as it is incorrect - this is not a TF binding site, it is a regulatory region.
    • CHROMOSOME_X regulatory_region misc_feature 5100882 5100888 . + . Feature "WBsf919622"
    • CHROMOSOME_X TF_binding_site TF_binding_site 5100882 5100888 . + . Feature "WBsf899549"
  • These two need to be merged. WBsf899545 should be retired as it is incorrect - this is not a TF binding site, it is a regulatory region.
    • CHROMOSOME_X regulatory_region misc_feature 5100919 5100925 . + . Feature "WBsf919623"
    • CHROMOSOME_X TF_binding_site TF_binding_site 5100919 5100925 . + . Feature "WBsf899545"
  • These two need to be merged. WBsf899547 should be retired as it is incorrect - this is not a TF binding site, it is a regulatory region.
    • CHROMOSOME_X regulatory_region misc_feature 5100970 5100976 . + . Feature "WBsf919625"
    • CHROMOSOME_X TF_binding_site TF_binding_site 5100970 5100976 . + . Feature "WBsf899547"
  • These two need to be merged. WBsf899548 should be retired as it is incorrect - this is not a TF binding site, it is a regulatory region.
    • CHROMOSOME_X regulatory_region misc_feature 5100985 5100991 . + . Feature "WBsf919626"
    • CHROMOSOME_X TF_binding_site TF_binding_site 5100985 5100991 . + . Feature "WBsf899548"
  • WBsf047482 cites paper WBPaper00025203 - this has now been merged with paper WBPaper00026601 and should really be updated to this in all 11 Features which use it.
  • WBsf042312 cites paper WBPaper00025051
  • These two Features are otherwise nearly identical and should be merged.
    • CHROMOSOME_X TF_binding_site TF_binding_site 8940207 8940220 . - . Feature "WBsf042312" ; TF_ID "WBTranscriptionFactor000052" ; TF_name "DAF-19"
    • CHROMOSOME_X TF_binding_site TF_binding_site 8940207 8940220 . - . Feature "WBsf047482" ; TF_ID "WBTranscriptionFactor000052" ; TF_name "DAF-19"
  • These two need to be merged. WBsf919641 should be retired as it is incorrect - this is a TF binding site, it is not just a binding_site.
    • CHROMOSOME_X binding_site binding_site 12467733 12467737 . + . Feature "WBsf919641"
    • CHROMOSOME_X TF_binding_site TF_binding_site 12467733 12467737 . + . Feature "WBsf919607" ; TF_ID "WBTranscriptionFactor000472" ; TF_name "DAF-3"


Papers to curate

  • I have taken the Sequence Feature Papers from RT and assigned the four of us 10 papers each.
  • Just in case there is any paper here that has already been curated for Interaction, I've added the number of Interaction objects linked to by the paper. If this is not helpful, feel free to remove it.
WBPaper00002925 Daniela Interaction: 8
WBPaper00004568 Daniela Interaction: 2
WBPaper00005842 Daniela
WBPaper00024328 Daniela Interaction: 4
WBPaper00028802 Daniela
WBPaper00028915 Daniela
WBPaper00029140 Daniela
WBPaper00029255 Daniela Interaction: 55
WBPaper00030829 Daniela Interaction: 5
WBPaper00030933 Daniela Interaction: 1
WBPaper00003929 Gary Interaction: 25
                Time: 4 hours
                New objects: none - 6 existing Features were corrected and updated. (WBsf019182, WBsf019181, WBsf019179, WBsf019177, WBsf019178, WBsf019180)
                Location of information: body text and in a figure in the supplemental table of the paper WBPaper00006376
                Other curator data: "lin-41 and lin-42 are negatively regulated by let-7"
                Comments: This is Gary Ruvkun's (Nature 2000) let-7 paper that started the miRNA field!
                Comments: This is being marked as a pair of 'binding_site' Features because this is miRNA not TF binding.
                See also: let-7, lin-41 binding WBPaper00006376 PubMed: 14729570 (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC324419/) - This is the paper that defined the binding sites.
                Comments (XW): curated one new cis-regulation object with Gary's WBsfs

WBPaper00005044 Gary Interaction: 20
                Time: 3 hours
                New objects: made ire-1 binding site Feature (WBsf977530)
                Location of information: body of this paper and WBPaper00005036
                Other curator data: IRE-1 is a stress-activated endonuclease resident in the ER that is conserved in all known eukaryotes. IRE-1 mediated unconventional splicing of an intron from xbp-1 mRNA controls expression of the encoded transcription factor and is required for upregulation of most UPR target genes.
                See also: WBPaper00005036
WBPaper00005971 Gary Interaction: 16 This has already been done by Gary: Feature WBsf718850
WBPaper00024440 Gary Interaction: 1
                Time: 3 hours
                New objects: Made WBsf977532, WBsf977533 M-2 motif/daf-12 binding sites for ceh-22
                New objects: Made WBsf977534, WBsf977535 M-2 motif/daf-12 binding sites for myo-2
                Location of information: body of this paper and Figure Supplemental 3
                Other curator data: "We conclude that regulation of myo-2 and ceh-22 during dauer development depends critically on the M-2 motif."
                Comments: lots of hypothetical binding sites for daf-12 in 90-odd genes, but these have not been experimentally confirmed.
WBPaper00028816 Gary
                Time: 30 mins
                New objects: none
                Location of information: body of this paper
                Other curator data: "recruitment sites are widely distributed along X to bind the DCC and nucleate DCC spreading to X regions lacking recruitment sites"
                Commments: This paper determines the A and B motifs of the Dosage Compensation Complex (DCC). There are hundreds of thousands of sites of varying strength.
WBPaper00028986 Gary Interaction: 16
WBPaper00029181 Gary
WBPaper00029327 GaryInteraction: 33
WBPaper00030849 Gary
WBPaper00031355 Gary Interaction: 5
WBPaper00004181 Mary Ann Interaction: 8 
                No Features. Took 5 mins to read. 
WBPaper00005056 Mary Ann 
                Time: 35 mins
                New objects: 37 - not yet curated.
                Location of information: Mainly in fig. 1, but in body text. 
                Comments: TRTTKRY element in promoter region of T05E11.3, D2096.6, C44H4.1, ZK816.4, ceh-22, 
                          tph-1, M05B5.2, myo-2. Bound by pha-4 
WBPaper00006429 Mary Ann Interaction: 3
                Time: 1/2hr
                New objects: 1
                Location of information: In body of text and fig. 3B
                Comments: Alludes to GATA binding sites, but no experimental data. 
WBPaper00024505 Mary Ann
                Time: 
                New objects: 
                Location of information: 
                Comments: 
WBPaper00028849 Mary Ann
                Time: 45 mins
                New objects: 1
                Location of information: In body of text and fig. 3B
                Comments: 
WBPaper00029058 Mary Ann Interaction: 50
WBPaper00029190 Mary Ann
WBPaper00029406 Mary Ann Interaction: 145
WBPaper00030877 Mary Ann Interaction: 8
WBPaper00031471 Mary Ann
WBPaper00004482 Xiaodong Interaction: 2
                Time: two hours
                New objects: request new features for more interaction objects
                Location of information: figure 5A, 
                Comments: HOX/CEH-20 binding sites in hlh-8 promoter. feature will be used in cis-regulation and physical interaction objects
                Duplication: yes. only one interaction objects WBInteraction000001291 associated with the paper currently
WBPaper00005609 Xiaodong Interaction: 20 This has already been done by Margaret: Feature WBsf019097, WBsf038788, WBsf038789, WBsf038790, WBsf038791, WBsf038793, WBsf038794, WBsf019098, WBsf038792
                Time: 50 mins
                New objects: three interaction objects
                Location of information: In body of text and fig. 5
                Comments: only WBsf038790 and WBsf038791 are useful. not sure why other features exist? seems to be redundant?
WBPaper00024189 Xiaodong
                Time: 10 mins
                Comments: no features in this paper
WBPaper00024981 Xiaodong
                Time: 40 mins
                New objects: request through RT
                Location of information: figure 2
                Comments: MED-1 binding sites in end-1and end-3 promoters
WBPaper00028910 Xiaodong Interaction: 7
                Time: 30 mins
                New objects: request through RT
                Location of information: in body text and figure 4
                Comments: MEF-2 direct binding site in str-1 promoter and a few minimal regulatory region mentioned in body text

WBPaper00029109 Xiaodong

WBPaper00029229 Xiaodong Interaction: 3
WBPaper00030809 Xiaodong
WBPaper00030931 Xiaodong Interaction: 18
WBPaper00031565 Xiaodong Interaction: 1
                Time: 30 mins
                New objects: request through RT
                Location of information: in body text and figure 2 and figure 3
                Comments: PBC-HOX biding sites S1 and S2; cis-regulatory elements E1 and E2

  • I suggest we curate 5 of our papers before the jamboree, taking notes as described above and any other things that arise during your curation.