CCC Form 2.0 Specifications

From WormBaseWiki
Jump to navigationJump to search

This page is intended to document specifications for the next version of the Textpresso for Cellular Component Curation (CCC) tool. The changes to the tool, and pipeline, have been suggested by curators and are also part of the broader plan for Textpresso-based curation pipelines and the GO's Common Annotation Framework.


Textpresso search specifications

  • Frequency of searches - this will vary by group (e.g., weekly for dictyBase)
  • Corpus - this will also vary by group
  • Categories - gene/protein name, CCC, assay term, verb
  • Filtering (Textpresso)
  1. Journal
  2. Date
  3. Document IDs
  • Filtering (non-Textpresso)
  1. SVM
  2. Gene Ontology Gene Association File
  • Ranking search results - e.g., highest scoring papers presented first
  • Naming search results file
  • Storing search histories
  1. Recording versions of pdf2text conversion
  2. Recording version of categories used
  3. Recording search criteria, i.e. categories, corpus, filters
  4. Recording curator or group and date of search

Curation form

Overall Workflow

  • Curator will login
  • Directed to a blank annotation page with drop down menu for Textpresso sentence file selection
  • Curators will select a sentence file for annotation
  • If a completely new sentences file, curators can begin to create annotations and classify sentences (if desired)
  • If a partially annotated file, curators can modify existing annotations (still thinking about this possibility) or continue to add new annotations and classify new sentences (if desired) - but curators can still see all previous work (this is a fundamental difference between the new and old forms)
  • Any new annotations can be sent to Protein2GO annotation tool when curator is ready - not all curators using the form are also using Protein2GO (WB and dictyBase will be, TAIR not yet), so submitting to Protein2GO would have to be an option that the curator consciously selected
  • Searches can also be performed using gene names, paper IDs, component terms in sentences, and GO terms - searches could be performed within a file or across all results files, if possible. For example, if a curator was interested in viewing all of the returned sentences that mentioned 'PAR-3', they could do that.

Detailed Workflow

  • The form presents sentences for annotation - some sentences will lead to a GO annotation, some will not
  • Not all sentences will be classified (although it'd be great if they were)
    • Does that mean that we need a way to permanently skip sentences, or can they be set as "not classified" to skip it (thus classifying it)?
      • Sentences without a classification could be stored as 'unclassified'. We could add 'unclassified' as a sentence classification check box option?

Data to be Stored

  • What data to store for a GO annotation
    • What is unique about an annotation? - File, Paper, Sentence, Gene product, GO term, GO evidence code, Curator, Date
  1. Name of search results file
  2. Paper identifier - this will come from the Textpresso sentences file and will be a PMID or doi; hopefully we won't get the same papers as different instances with different IDs - I don't think this would happen; dois would likely only get used if there wasn't a PMID available.
  3. Gene/gene product identifier - this will be derived from the match between the gene product in the sentence and the information in the 'gpi' file that maps a gene product name to a database identifier
  4. Textpresso component term - column 2 in current form
  5. GO component term - column 3 in current form
  6. Evidence code - previously all annotations via the CCC form used the IDA evidence code; the new form will also allow for the IPI evidence code.
  7. Qualifier
  8. With string
  9. Annotation extension
  10. Sentence ID
  11. Sentence classification
  12. Curator
  13. Annotation date - recorded upon pressing submit button
  14. Annotation history - - this would be a way to track how often an annotation or classification was touched and by home; akin to the curator history field on the concise description oa. It is helpful sometimes to be able to see the curation history for an annotated object, particularly if there is ever a question about an annotation. If history could be displayed as part of the annotation when it is displayed, that would be fine.
  • What data to store if no GO annotation:
  1. Name of search results file
  2. Paper identifier
  3. Sentence ID
  4. Sentence classification
  5. Curator
  6. Annotation date
  7. Annotation history

Curation Workflow

Detailed curation workflows and search scenarios

  • Curator login
    • Curator login would be used to determine which files would be available for annotation; I don't know which is easier: 1) a separate login page for each MOD, or 2) a mapping of curator to files
      • Mapping of curators to files:
        • dictyBase - Petra Fey and Robert Dodson
        • TAIR - Tanya Berardini and Donghui Li
        • WormBase - Kimberly Van Auken, Ranjana Kishore (others?)
  • Import of search results files
    • Can this be automated? dictyBase searches will be run weekly, can the results files be automatically transferred to tazendra?
      • If textpresso does this in a way where the output is always in a specific public_html/ directory with a sensible filename structure, then we can have a cronjob pick them up.
      • File names will need to follow the format: date_modname_some_meaningful_text
  • Organization of search results file
    • If we have many results files, perhaps we can organize them in neater way than just one long list. See the Textpresso categories for an example of cascading menus, one possible solution.
      • One possibility would be by date of search - from top to bottom level, the menu would progress from year to month to day
  • Selection of search results file for curation - this would be one starting point for curation, while more specific searches for gene product, paper ID, Textpresso component term, GO term searches might be another. If curators selected a sentence file, they would progress through that file and annotate; if they opted to start their curation with a gene search, they would search for sentences that mention a gene and then annotate from those sentences.
  • Display of paper bibliographic information
    • This refers to adding more information than what we currently display, as well as how we display it. I think we could get the additional information from Textpresso and then just pretty up the display by adding some spacing and some bold text, etc. See PubMed for one possible example: [1]
    • For C. elegans papers, the information would be in postgres, but for dictyBase and TAIR papers (and any other group that used the form), that information would come from Textpresso. We could make getting this information uniform across all groups, though, and just get everything from Textpresso.
  • Display of Source File Sentences - Highlighting Matched Terms
    • Each sentence will have at least one match to each of the four searches categories, generally: 1) gene product, 2) verbs, 3) assay term, 4) component.
      • The matching terms from each sentence are displayed in different colors to indicate the category matched. Here are the current (5/2013) category names and color matches:
        • CCC_TAIR = red/brown + underlined
        • localization_cell_components_082208 = red/brown + underlined
        • localization_cell_components_2011-02-11 = red/brown + underlined
        • protein_celegans = blue
        • genes_arabidopsis = blue
        • dicty_genes = blue
        • localization_verbs_082008 = green
        • localization_verbs_082208 = green
        • localization_other_082008 = orange
        • localization_experimental_082008 = orange
        • localization_experimental_082208 = orange
    • There are cases where phrases of two or more terms contain matches to more than one category. For example:
      • <CCC_TAIR><localization_experimental_082208>Protein</localization_experimental_082208> complex </CCC_TAIR>
      • Because of these cases, we'll first look for <CCC_TAIR> with <localization_experimental_082208> within it and convert the <localization_experimental_082208> to italization. Then convert all the XML tags to color.
  • Display of Source Sentence Files - Order by Section from Publication
  • Search functionality on form - this includes some new features, allowing curators to search previous annotations, individual papers, etc.
  1. Gene - search for all sentences that mention a specified gene and/or synonym (all sentences or specific sentence classification)
  2. Paper - search for sentences from a paper (all or specific sentence classifications)
  3. Curator - search for all sentences classified by a given curator (all or specific sentence classification)
  4. Annotation date - search for all work done for a given date (use wild cards)
  5. Component term in sentence - search for all sentences that matched a given Textpresso component term (all or sentence classification)
  6. GO term used for annotation - search for all sentences that used a specific GO term for annotation (Curate or Already Curated)
  • Curation when all entities are recognized - straightforward
    • Autocomplete for GO term using term string, synonym string, and GO ID
  • Curation when one or more entities is not recognized - add a value to either of first two columns - Error checking to implement?
  1. Enter a new gene name and database identifier - how to feedback to MOD?
  2. Enter a new component term in sentence - how to feed back to Textpresso?
  3. Perhaps something like an 'End of Session Report' if any new entities are added?
  • Feedback from form to Textpresso - this will have to be worked out with Textpresso; we would have to establish a mechanism for automatically updating a category - To be implemented later?
  1. Add gene name or synonym plus database identifier - supply as a report to curators so they can update their gpi file?
  2. Add component term to Textpresso cellular component category
  • Evidence codes
    • IDA (default), IPI (complex membership) - only one evidence code is allowed per annotation, the curator would have to select from a drop down list of evidence codes and ECO IDs?
    • For the IPI evidence codes, curators must add a value in the With box.
    • If annotations are being sent to Protein2GO, then the identifier in the With box should be of the form UniProtKB:nnnnn.
    • If there is more than one identifier contained within the With field, then the identifiers should be pipe-separated: UniProtKB:nnnnn|UniProtKB:nnnnn
  • Qualifiers - curators can select none, one, or both
    • not
    • colocalizes_with
  • If a single sentence contained evidence for more than one annotation, could we have duplicate functionality? How best to handle this?
    • It's hard to really say how many possible annotations there might be from a single sentence. Would this work the way the old concise description form used to work? We displayed four boxes and then there was an option to add another box? See below.
  1. Curate - select one or more entities from each column, will add new GO annotations to database
  2. Already curated - select one or more entities from each column, will record as already annotated, mark in red, filter, different location on page? I'm honestly not sure whether it's better to handle this at the level of the curation form or by filtering the Textpresso results files according to curator specifications. I'm leaning towards the latter at this point.
  • Sentence classification - should be check boxes, as it is possible, for example, to have a run-on sentence that is also a false positive. These could be grouped into distinct areas/fields on the form.
  1. False positive (classification stored as: falspos)
  2. Positive for localization, but not for GO (classification stored as: poslocneggo)
  3. Run-on sentence (classification stored as: runon)
  4. Scrambled sentence
  • Edit a previous annotation
    • Change gene annotated, change component term used, change GO term assigned, change evidence code
    • This may best be done through the Protein2GO tool; perhaps include a Caltech pgid in the annotation comment we send to Protein2GO and then an exported GAF from Protein2GO also with a postgres annotation ID could be used to update the annotation stored at Caltech
  • Edit relationship index
    • Currently, the Textpresso component - GO term index is stored in a table: ccc_component_go_index
    • Editing or adding to this index would be a separate functionality, but would allow a curator to view and edit the relationship index if needed.
    • It would be a way to add, for example, a new relationship even if it didn't directly result from an annotation, but was instead a relationship a curator thought might be useful.
    • This would be via a separate window. If we do this, I'd like to store some information about the editing and version the relationship index.
    • While creating the new form, we mapped the existing GO term strings in the relationship index to GO IDs. There were seven cases where the existing GO term string did not map to a GO ID (using the term name):
      • Membrane -> plasma membranes should be GO:0005886 = plasma membrane (probably a curator typo)
      • Vacuolar membranes -> plant-type vacuolar membrane should be GO:0009705 = plant-type vacuole membrane (probably a curator typo)
      • cell bodies -> cell soma should be GO:0044297 = cell body (change in term name, old name now exact synonym)
      • cell body -> cell soma should be GO:0044297 = cell body (change in term name, old name now exact synonym)
      • chromosomal axes -> axial element should be GO:0000800 = lateral element (may have also been a change in term name, not sure)
      • leading pseudopod -> pseudopod should be GO:0031143 pseudopodium (possibly curator error)
      • vacuole -> tonoplast should be GO:0009705 = plant-type vacuole membrane (may have also been a change in term name, not sure)
  • Delete a search results file
    • This could be tricky. We'd need to make sure there are no annotations associated with that search file.
      • If this didn't happen often, it would probably be easiest if you did it manually through the shell. If this happens often, we'd probably have two directories: 1) stuff to use 2) stuff not to use and a way to list all of them and move them back and forth assuming that Textpresso files will always have a unique name and we never want to get the same file twice.
        • I don't honestly know how often this might happen. Perhaps having a separate drop down for files used and files deleted would be reasonable?
  • Delete an annotation - we could add a delete button next to each annotation.
  • Comment field - we could have one comment field per sentence.
  • Annotation IDs - could be assigned sequentially to each annotation.

Error Checking

  • Can we build in any error checking when the Submit button is pressed?
  • This would be basic checks for having selected at least one entity from each of the three columns and also for having selected Curate when adding an annotation (perhaps this latter check could be done automatically upon submission, since entering an annotation means that the curator has classified the sentence as curatable?).

Export annotations

  • Initially not all groups will be using Protein2GO, so we will need to have different options for exporting the annotations.
    • Is it better to just automatically write annotations to tazendra as they are made? I think dumping the annotation file or sending it to Protein2GO should be a conscious, clickable action.
    • Can a curator's login session be tracked so that any annotations made during a given login session are what is sent to Protein2GO?
    • If we send them a pgid in the Comment field, then we can track each annotation?
  • To a MOD - GO Gene Association File (GAF)(preferred) or three-column file (Gene ID, GO ID, Paper ID) like we have now with the Dump Annotation File option.
  • To Protein2GO - using web services See Protein2GO Web Services
    • This would happen whenever a curator clicked on a 'Submit to Protein2GO' button.
    • Taxon ID will be fixed for each curator/database.
      • C. elegans taxon ID: 6239
      • Arabidopsis thaliana (TAIR) taxon ID: 3702
      • Dictyostelium discoideum (dictyBase) taxon ID: 44689
    • userid's are curator-specific and take the form "database:email", where database is one of either "test" or "production", and email is the email address associated with the curator's Protein2GO account.
    • While CCC/P2G integration is in the testing phase, you should use the test P2G database, so that means that your userid for the web service would be "test:email@address.edu" (without the quotes).

Files needed (see Mapping Files and Source Files below)

  • Mapping file for gene names and synonyms to MOD identifier and UniProtKB identifier
  • GO's gpi file format would have all of the information we need
  • The basic file format is a header and then a tab-delimited file of database identifiers, names, synonyms, etc.


Other issues

  • What to do with old data - can we map all old data onto new tables? Some information may be missing in old data, is that okay?


  1. curator login, maps to mods
  2. main page, search options, fields, one search button to find all fields
  3.   checkboxes for annotated and for not annotated (to allow both selected)
  4.   option of how many annotations to make to a given paper-sentence (say 3)
  5. search page searches all fields and returns sentences that match and all its annotations regardless of whether the annotation has the search (like searching for a GO Term and getting all the annotation even those without the matching GO Term).
  6.  - source_file, paper, gene, component  search sentence files
  7.  - go term, classification, curator, date, annotation_extension, with_string, p2goID  search postgres
  8.  - evidence code, qualifier  don't search
  9. search results with all papers listed at the top as links to anchors in page.
  10. each paper has own form and submit this paper button
  11. for each paper show bibliography and list all sentences
  12. for each sentence show 1 set (of annotation options), hide all others, if any data in first set show next one.
  13. for each sentence, show the paperID, sentence ID in the source file, sentence, classification multiselect (store this in separate table ?)
  14. submitting to paper-sentence with blank ID gets new ID, existing pairs already have an ID
  15. hide postgres id
  16. show Gene (list of IDs from textpresso)
  17. show component free text
  18. show component (list of terms from textpresso)
  19. show GO term free text
  20. show GO term (list of gene_comp_go mappings)
  21. evidence code (ida OR ipi - in the future, if we include other data types, may need to expand to the whole list of GO evidence codes) for this, include text description of evidence code and ECO (Evidence Code Ontology) ID
    Sample from eco.obo file, we will display what comes after an xref tag where the text after xref starts with GOECO:
    We will also display the ECO ID, found after the id tag.
    [Term]
    id: ECO:0000021
    name: physical interaction evidence
    def: "Experimental evidence that is based on characterization of an interaction between a gene product and another molecule." [ECO:MCC]
    comment: Molecules interacted with might include protein, nucleic acid, ion, or complex.
    synonym: "inferred from physical interaction" RELATED [GOECO:IPI]
    synonym: "IPI" RELATED [GOECO:IPI]
    xref: GOECO:IPI "inferred from physical interaction"
    xref: PSI-MI:MI\:0045 "experimental interaction detection"
    is_a: ECO:0000006 ! experimental evidence
    Example: GOECO: IPI "inferred from physical interaction" ECO:000021
  1. qualifiers (not / with) multiselect
  2. with_string free text (for now)
  3. annotation_extension free text (for now)
  4. curator   (display, don't allow manual change)
  5. p2go ID   (display, don't allow manual change)
  6. timestamp (display, don't allow manual change)

in postgres

  1. annotations : pg_annotation_id, paperId, sentNum, gene (id), component, go (id), evidence, qualifier, with, annot, curator, p2go, timestamp
  2. sentences_classification : paperId, sentNum, classification, curator, timestamp
  3.  ?? sentences_to_files : paperId, sentNum, files-pipe-separated
  4. biblio for paper - generate once, store somewhere - could the bibliographic information be part of the sentences file, say a header or footer?

http://mangolassi.caltech.edu/~azurebrd/cgi-bin/forms/ccc/ccc.cgi

Mapping Files

  1. gpi file - maps gene names and synonyms to database identifiers
  2. specifications for WB gpi file

Source Files

  1. The source files will be generated by Textpresso and transferred to tazendra.
  2. specifications for source files
  3. Test files for C. elegans are on mangolassi here: /home/acedb/kimberly/ccc_2_testing/20130426_WB_test_files
  4. TODO : Need cronjob to update http://textpresso-dev.caltech.edu/ccc_results/accession to map PMID to each MODs's paper ID
  5. for sentence/abstract display, get mappings of punctuation code from textpresso-dev /data2/svn-checkout-dev/textpresso2.0/branches/celegans/perlmodules/TextpressoGeneralGlobals.pm

Testing

Testing Search Results - 20130509

How the Form Works

  1. This documentation is written based on using the form on a Mac running OS X 10.6.8 and Firefox 20.0.
  2. Curator goes to login page, selects first name from list of curators, clicks on Login!
  3. Curators are then taken to a Search page that allows for either selecting one or more source files for annotations or searching across source files. The source files available for curation and searching are MOD-specific meaning that WormBase files will load for a WormBase curator, TAIR files for the TAIR curators, and dictyBase files for the dicty curators.
  4. The sentence-curation drop down allows a curator to specify what types of sentences will be returned with respect to curation. There are three options: 1) search all sentences, 2) exclude curated (i.e., search only uncurated sentences), 3) exclude uncurated (i.e., search only curated sentences).
  5. Paper to show indicates the number of paper matches that will be returned for a given search. The number to show probably depends on the curator and the size of the source files (i.e., the number of papers in the source file). We may need to get some feedback from curators after they've used the form a bit to see if the current number of 10 is a good target or if they're finding that they need to change it a lot. I'd also thought about a drop down here of 10, 25, 50, 100 as an alternative.
  6. Source lists all of the currently available source files for curating and searching. To select one file, click on the file name and it will be highlighted. To select all files in the list, click on the first file, hold down the shift key and then click on the last file in the list. To select two consecutive files, click on the first file, hold down the shift key, and then click on the next file. To select non-consecutive files, click on the first file, hold down the command (control on PC and Linux) key, and then click on the second file.
  7. When you select a file name and click on Search, the form will return the number of papers specified in the papers to show box, but will also alert the curator to the total number of papers and sentences that matched the search criteria. For example, if you enter 20 papers and select a source file that has 54 papers, the first 20 papers (based upon paper ID in ascending order) and their associated sentences will show along with a message saying: The above search has 54 papers with 100 sentences, here are 20 papers :. This indicates to the curator that there are additional papers and sentences that matched the search criteria.
  • Searches that could be performed on Textpresso data, annotated data, or both
  1. Gene Product Search - Can search for name as well as MOD or UniProtKB IDs; substring search will match on any of them. As an example: ZFP-1|WBGene00006975|UniProtKB:P34447 - only needs to search Textpresso results since no free text gene product can be entered.
  2. Paper
  3. Component
  4. GO Term
  • Searches that could be performed on annotated data only:
  1. Annotation Curator
  2. Annotation Date


Back to Gene Ontology