CCC Form 2.0 Specifications

From WormBaseWiki
Jump to navigationJump to search

This page is intended to document specifications for the next version of the Textpresso for Cellular Component Curation (CCC) tool. The changes to the tool, and pipeline, have been suggested by curators and are also part of the broader plan for Textpresso-based curation pipelines and the GO's Common Annotation Framework.

Textpresso search specifications

  • Frequency of searches - this will vary by group (e.g., weekly for dictyBase)
  • Corpus - this will also vary by group
  • Categories - gene/protein name, CCC, assay term, verb
  • Filtering (Textpresso)
  1. Journal
  2. Date
  3. Document IDs
  • Filtering (non-Textpresso)
  1. SVM
  2. Gene Ontology Gene Association File
  • Ranking search results - e.g., highest scoring papers presented first
  • Naming search results file
  • Storing search histories
  1. Recording versions of pdf2text conversion
  2. Recording version of categories used
  3. Recording search criteria, i.e. categories, corpus, filters
  4. Recording curator or group and date of search

Curation form

Overall Workflow

  • Curator will login
  • Directed to a blank annotation page with drop down menu for Textpresso sentence file selection
  • Curators will select a sentence file for annotation
  • If a completely new sentences file, curators can begin to create annotations and classify sentences (if desired)
  • If a partially annotated file, curators can modify existing annotations (still thinking about this possibility) or continue to add new annotations and classify new sentences (if desired) - but curators can still see all previous work (this is a fundamental difference between the new and old forms)
  • Any new annotations can be sent to Protein2GO annotation tool when curator is ready - not all curators using the form are also using Protein2GO (WB and dictyBase will be, TAIR not yet), so submitting to Protein2GO would have to be an option that the curator consciously selected
  • Searches can also be performed using gene names, paper IDs, component terms in sentences, and GO terms - searches could be performed within a file or across all results files, if possible. For example, if a curator was interested in viewing all of the returned sentences that mentioned 'PAR-3', they could do that.

Detailed Workflow

  • The form presents sentences for annotation - some sentences will lead to a GO annotation, some will not
  • Not all sentences will be classified (although it'd be great if they were)
    • Does that mean that we need a way to permanently skip sentences, or can they be set as "not classified" to skip it (thus classifying it)?
      • Sentences without a classification could be stored as 'unclassified'. We could add 'unclassified' as a sentence classification check box option?

Data to be Stored

  • What data to store for a GO annotation
    • What is unique about an annotation? - File, Paper, Sentence, Gene product, GO term, GO evidence code, Curator, Date
  1. Name of search results file
  2. Paper identifier - this will come from the Textpresso sentences file and will be a PMID or doi; hopefully we won't get the same papers as different instances with different IDs - I don't think this would happen; dois would likely only get used if there wasn't a PMID available.
  3. Gene/gene product identifier - this will be derived from the match between the gene product in the sentence and the information in the 'gpi' file that maps a gene product name to a database identifier
  4. Textpresso component term - column 2 in current form
  5. GO component term - column 3 in current form
  6. Evidence code - previously all annotations via the CCC form used the IDA evidence code; the new form will also allow for the IPI evidence code.
  7. Qualifier
  8. With string
  9. Annotation extension
  10. Sentence ID
  11. Sentence classification
  12. Curator
  13. Annotation date - recorded upon pressing submit button
  14. Annotation history - - this would be a way to track how often an annotation or classification was touched and by home; akin to the curator history field on the concise description oa. It is helpful sometimes to be able to see the curation history for an annotated object, particularly if there is ever a question about an annotation. If history could be displayed as part of the annotation when it is displayed, that would be fine.
  • What data to store if no GO annotation:
  1. Name of search results file
  2. Paper identifier
  3. Sentence ID
  4. Sentence classification
  5. Curator
  6. Annotation date
  7. Annotation history

Curation Workflow

Detailed curation workflows and search scenarios

  • Curator login
    • Curator login would be used to determine which files would be available for annotation; I don't know which is easier: 1) a separate login page for each MOD, or 2) a mapping of curator to files
      • Mapping of curators to files:
        • dictyBase - Petra Fey and Robert Dodson
        • TAIR - Tanya Berardini and Donghui Li
        • WormBase - Kimberly Van Auken, Ranjana Kishore (others?)
  • Import of search results files
    • Can this be automated? dictyBase searches will be run weekly, can the results files be automatically transferred to tazendra?
      • If textpresso does this in a way where the output is always in a specific public_html/ directory with a sensible filename structure, then we can have a cronjob pick them up.
      • File names will need to follow the format: date_modname_some_meaningful_text
  • Organization of search results file
    • If we have many results files, perhaps we can organize them in neater way than just one long list. See the Textpresso categories for an example of cascading menus, one possible solution.
      • One possibility would be by date of search - from top to bottom level, the menu would progress from year to month to day
  • Selection of search results file for curation - this would be one starting point for curation, while more specific searches for gene product, paper ID, Textpresso component term, GO term searches might be another. If curators selected a sentence file, they would progress through that file and annotate; if they opted to start their curation with a gene search, they would search for sentences that mention a gene and then annotate from those sentences.
  • Display of paper bibliographic information
    • This refers to adding more information than what we currently display, as well as how we display it. I think we could get the additional information from Textpresso and then just pretty up the display by adding some spacing and some bold text, etc. See PubMed for one possible example: [1]
    • For C. elegans papers, the information would be in postgres, but for dictyBase and TAIR papers (and any other group that used the form), that information would come from Textpresso. We could make getting this information uniform across all groups, though, and just get everything from Textpresso.
  • Display of Source File Sentences - Highlighting Matched Terms
    • Each sentence will have at least one match to each of the four searches categories, generally: 1) gene product, 2) verbs, 3) assay term, 4) component.
      • The matching terms from each sentence are displayed in different colors to indicate the category matched. Here are the current (5/2013) category names and color matches:
        • CCC_TAIR = red/brown + underlined
        • localization_cell_components_082208 = red/brown + underlined
        • localization_cell_components_2011-02-11 = red/brown + underlined
        • protein_celegans = blue
        • genes_arabidopsis = blue
        • dicty_genes = blue
        • localization_verbs_082008 = green
        • localization_verbs_082208 = green
        • localization_other_082008 = orange
        • localization_experimental_082008 = orange
        • localization_experimental_082208 = orange
    • There are cases where phrases of two or more terms contain matches to more than one category. For example:
      • <CCC_TAIR><localization_experimental_082208>Protein</localization_experimental_082208> complex </CCC_TAIR>
      • Because of these cases, we'll first look for <CCC_TAIR> with <localization_experimental_082208> within it and convert the <localization_experimental_082208> to italization. Then convert all the XML tags to color.
  • Display of Source Sentence Files - Order by Section from Publication
  • Search functionality on form - this includes some new features, allowing curators to search previous annotations, individual papers, etc.
  1. Gene - search for all sentences that mention a specified gene and/or synonym (all sentences or specific sentence classification)
  2. Paper - search for sentences from a paper (all or specific sentence classifications)
  3. Curator - search for all sentences classified by a given curator (all or specific sentence classification)
  4. Annotation date - search for all work done for a given date (use wild cards)
  5. Component term in sentence - search for all sentences that matched a given Textpresso component term (all or sentence classification)
  6. GO term used for annotation - search for all sentences that used a specific GO term for annotation (Curate or Already Curated)
  • Curation when all entities are recognized - straightforward
    • Autocomplete for GO term using term string, synonym string, and GO ID
  • Curation when one or more entities is not recognized - add a value to either of first two columns - Error checking to implement?
  1. Enter a new gene name and database identifier - how to feedback to MOD?
  2. Enter a new component term in sentence - how to feed back to Textpresso?
  3. Perhaps something like an 'End of Session Report' if any new entities are added?
  • Feedback from form to Textpresso - this will have to be worked out with Textpresso; we would have to establish a mechanism for automatically updating a category - To be implemented later?
  1. Add gene name or synonym plus database identifier - supply as a report to curators so they can update their gpi file?
  2. Add component term to Textpresso cellular component category
  • Evidence codes
    • IDA (default), IPI (complex membership) - only one evidence code is allowed per annotation, the curator would have to select from a drop down list of evidence codes and ECO IDs?
    • For the IPI evidence codes, curators must add a value in the With box.
    • If annotations are being sent to Protein2GO, then the identifier in the With box should be of the form UniProtKB:nnnnn.
    • If there is more than one identifier contained within the With field, then the identifiers should be pipe-separated: UniProtKB:nnnnn|UniProtKB:nnnnn
  • Qualifiers - curators can select none, one, or both
    • not
    • colocalizes_with
  • If a single sentence contained evidence for more than one annotation, could we have duplicate functionality? How best to handle this?
    • It's hard to really say how many possible annotations there might be from a single sentence. Would this work the way the old concise description form used to work? We displayed four boxes and then there was an option to add another box? See below.
  1. Curate - select one or more entities from each column, will add new GO annotations to database
  2. Already curated - select one or more entities from each column, will record as already annotated, mark in red, filter, different location on page? I'm honestly not sure whether it's better to handle this at the level of the curation form or by filtering the Textpresso results files according to curator specifications. I'm leaning towards the latter at this point.
  • Sentence classification - should be check boxes, as it is possible, for example, to have a run-on sentence that is also a false positive. These could be grouped into distinct areas/fields on the form.
  1. False positive (classification stored as: falspos)
  2. Positive for localization, but not for GO (classification stored as: poslocneggo)
  3. Run-on sentence (classification stored as: runon)
  4. Scrambled sentence
  • Edit a previous annotation
    • Change gene annotated, change component term used, change GO term assigned, change evidence code
    • This may best be done through the Protein2GO tool; perhaps include a Caltech pgid in the annotation comment we send to Protein2GO and then an exported GAF from Protein2GO also with a postgres annotation ID could be used to update the annotation stored at Caltech
  • Edit relationship index
    • Currently, the Textpresso component - GO term index is stored in a table: ccc_component_go_index
    • Editing or adding to this index would be a separate functionality, but would allow a curator to view and edit the relationship index if needed.
    • It would be a way to add, for example, a new relationship even if it didn't directly result from an annotation, but was instead a relationship a curator thought might be useful.
    • This would be via a separate window. If we do this, I'd like to store some information about the editing and version the relationship index.
    • While creating the new form, we mapped the existing GO term strings in the relationship index to GO IDs. There were seven cases where the existing GO term string did not map to a GO ID (using the term name):
      • Membrane -> plasma membranes should be GO:0005886 = plasma membrane (probably a curator typo)
      • Vacuolar membranes -> plant-type vacuolar membrane should be GO:0009705 = plant-type vacuole membrane (probably a curator typo)
      • cell bodies -> cell soma should be GO:0044297 = cell body (change in term name, old name now exact synonym)
      • cell body -> cell soma should be GO:0044297 = cell body (change in term name, old name now exact synonym)
      • chromosomal axes -> axial element should be GO:0000800 = lateral element (may have also been a change in term name, not sure)
      • leading pseudopod -> pseudopod should be GO:0031143 pseudopodium (possibly curator error)
      • vacuole -> tonoplast should be GO:0009705 = plant-type vacuole membrane (may have also been a change in term name, not sure)
  • Delete a search results file
    • This could be tricky. We'd need to make sure there are no annotations associated with that search file.
      • If this didn't happen often, it would probably be easiest if you did it manually through the shell. If this happens often, we'd probably have two directories: 1) stuff to use 2) stuff not to use and a way to list all of them and move them back and forth assuming that Textpresso files will always have a unique name and we never want to get the same file twice.
        • I don't honestly know how often this might happen. Perhaps having a separate drop down for files used and files deleted would be reasonable?
  • Delete an annotation - we could add a delete button next to each annotation.
  • Comment field - we could have one comment field per sentence.
  • Annotation IDs - could be assigned sequentially to each annotation.

Error Checking

  • Can we build in any error checking when the Submit button is pressed?
  • This would be basic checks for having selected at least one entity from each of the three columns and also for having selected Curate when adding an annotation (perhaps this latter check could be done automatically upon submission, since entering an annotation means that the curator has classified the sentence as curatable?).

Export annotations

  • Initially not all groups will be using Protein2GO, so we will need to have different options for exporting the annotations.
    • Is it better to just automatically write annotations to tazendra as they are made? I think dumping the annotation file or sending it to Protein2GO should be a conscious, clickable action.
    • Can a curator's login session be tracked so that any annotations made during a given login session are what is sent to Protein2GO?
    • If we send them a pgid in the Comment field, then we can track each annotation?
  • To a MOD - GO Gene Association File (GAF)(preferred) or three-column file (Gene ID, GO ID, Paper ID) like we have now with the Dump Annotation File option.
  • To Protein2GO - using web services See Protein2GO Web Services
    • This would happen whenever a curator clicked on a 'Submit to Protein2GO' button.
    • Taxon ID will be fixed for each curator/database.
      • C. elegans taxon ID: 6239
      • Arabidopsis thaliana (TAIR) taxon ID: 3702
      • Dictyostelium discoideum (dictyBase) taxon ID: 44689
    • userid's are curator-specific and take the form "database:email", where database is one of either "test" or "production", and email is the email address associated with the curator's Protein2GO account.
    • While CCC/P2G integration is in the testing phase, you should use the test P2G database, so that means that your userid for the web service would be "" (without the quotes).

Files needed (see Mapping Files and Source Files below)

  • Mapping file for gene names and synonyms to MOD identifier and UniProtKB identifier
  • GO's gpi file format would have all of the information we need
  • The basic file format is a header and then a tab-delimited file of database identifiers, names, synonyms, etc.

Other issues

  • What to do with old data - can we map all old data onto new tables? Some information may be missing in old data, is that okay?

  1. curator login, maps to mods
  2. main page, search options, fields, one search button to find all fields
  3.   checkboxes for annotated and for not annotated (to allow both selected)
  4.   option of how many annotations to make to a given paper-sentence (say 3)
  5. search page searches all fields and returns sentences that match and all its annotations regardless of whether the annotation has the search (like searching for a GO Term and getting all the annotation even those without the matching GO Term).
  6.  - source_file, paper, gene, component  search sentence files
  7.  - go term, classification, curator, date, annotation_extension, with_string, p2goID  search postgres
  8.  - evidence code, qualifier  don't search
  9. search results with all papers listed at the top as links to anchors in page.
  10. each paper has own form and submit this paper button
  11. for each paper show bibliography and list all sentences
  12. for each sentence show 1 set (of annotation options), hide all others, if any data in first set show next one.
  13. for each sentence, show the paperID, sentence ID in the source file, sentence, classification multiselect (store this in separate table ?)
  14. submitting to paper-sentence with blank ID gets new ID, existing pairs already have an ID
  15. hide postgres id
  16. show Gene (list of IDs from textpresso)
  17. show component free text
  18. show component (list of terms from textpresso)
  19. show GO term free text
  20. show GO term (list of gene_comp_go mappings)
  21. evidence code (ida OR ipi - in the future, if we include other data types, may need to expand to the whole list of GO evidence codes) for this, include text description of evidence code and ECO (Evidence Code Ontology) ID
    Sample from eco.obo file, we will display what comes after an xref tag where the text after xref starts with GOECO:
    We will also display the ECO ID, found after the id tag.
    id: ECO:0000021
    name: physical interaction evidence
    def: "Experimental evidence that is based on characterization of an interaction between a gene product and another molecule." [ECO:MCC]
    comment: Molecules interacted with might include protein, nucleic acid, ion, or complex.
    synonym: "inferred from physical interaction" RELATED [GOECO:IPI]
    synonym: "IPI" RELATED [GOECO:IPI]
    xref: GOECO:IPI "inferred from physical interaction"
    xref: PSI-MI:MI\:0045 "experimental interaction detection"
    is_a: ECO:0000006 ! experimental evidence
    Example: GOECO: IPI "inferred from physical interaction" ECO:000021
  1. qualifiers (not / with) multiselect
  2. with_string free text (for now)
  3. annotation_extension free text (for now)
  4. curator   (display, don't allow manual change)
  5. p2go ID   (display, don't allow manual change)
  6. timestamp (display, don't allow manual change)

in postgres

  1. annotations : pg_annotation_id, paperId, sentNum, gene (id), component, go (id), evidence, qualifier, with, annot, curator, p2go, timestamp
  2. sentences_classification : paperId, sentNum, classification, curator, timestamp
  3.  ?? sentences_to_files : paperId, sentNum, files-pipe-separated
  4. biblio for paper - generate once, store somewhere - could the bibliographic information be part of the sentences file, say a header or footer?

Mapping Files

  1. gpi file - maps gene names and synonyms to database identifiers
  2. specifications for WB gpi file

Source Files

  1. The source files will be generated by Textpresso and transferred to tazendra.
  2. specifications for source files
  3. Test files for C. elegans are on mangolassi here: /home/acedb/kimberly/ccc_2_testing/20130426_WB_test_files
  4. TODO : Need cronjob to update to map PMID to each MODs's paper ID
  5. for sentence/abstract display, get mappings of punctuation code from textpresso-dev /data2/svn-checkout-dev/textpresso2.0/branches/celegans/perlmodules/


Testing Search Results - 20130509

Testing Form - 20130611

How the Form Works

  • This documentation is written based on using the form on a Mac running OS X 10.6.8 and Firefox 20.0.


  • Curator goes to login page, selects first name from list of curators, clicks on Login!

List Component-GO Term

  • Clicking on this button will take curators to a new page that lists the index that maps cellular component terms from the Textpresso category to GO terms.
    • The index is created solely from previous annotations and is updated when new relationships between component terms and GO terms are created from new annotations. To confirm - the mapping gets immediately updated on the curation form - is it also getting updated on the HTML page that lists the index?
    • The list of mappings shows component, GO term name, and GO term id.
    • GO term names listed in red, do not map to a GO term id. This is either because the text string does not precisely match a GO term (e.g., plasma membranes), the GO term name is no longer the primary name (e.g., cell soma is now cell body), or the GO term has become obsolete (no example yet).
    • Component - GO term mappings that do not also include a valid GO id are not offered in the curation form as suggested annotations.

Selecting Source Files

  • After curators login, they are taken to a Search page.
  • The search page allows curators to select one or more source files for annotation or to search across one or more source files.
  • The source files available for curation and searching are MOD-specific meaning that WormBase files will load for a WormBase curator, TAIR files for the TAIR curators, and dictyBase files for the dicty curators.
  • Selecting one file for curation:
    • From the list of source files, click on the file you want for annotation and then click on Search!
    • When a curator selects a file name and clicks on Search!, the form will return the number of papers specified in the "papers to show box", but will also alert the curator to the total number of papers and sentences that matched the search criteria. For example, if you enter 10 papers (the current default setting) and select a source file that has 14 papers, the first 10 papers and their associated sentences will show along with a message saying: The above search has 14 papers with 50 sentences, here are 10 papers :. This indicates to the curator that there are additional papers and sentences that matched the search criteria.
    • The default number of papers to show is 10, but if your file has more than that, you can change the number of papers to show in the "papers to show" box to whatever value you want and then press Search! again.
    • Papers are listed by paper identifier (usually PMID) in ascending order.
    • Sentences are currently listed according to their position in the source file which I believe is alphabetical based upon section title, e.g., abstract, discussion, references, results. I'd like Yuling to change this in the source file so that the sentences are sorted according to the order in which they appear in the paper, e.g., abstract, results, discussion, references.
  • Selecting more than one source file for curation:
    • Curators can select more than one source file for curation.
    • To select two consecutive files, click on the first file, hold down the shift key, and then click on the next file.
    • To select non-consecutive files, click on the first file, hold down the command (control on PC and Linux) key, and then click on the second file.
    • To select all files in the list, click on the first file, hold down the shift key and then click on the last file in the list.
  • Filtering the sentences shown from selected source files
    • Curators can filter what types of sentences, with respect to curation, are shown when selecting a source file by using the sentence-curation dropdown. There are three options:
      • Search all sentences will return all sentences in a file, both curated and uncurated.
      • Exclude curated will return only those sentences that haven't been curated yet.
      • Exclude uncurated will returne only those sentences that have already been curated.

Searching for Specific Data

  • The form also allows curators to search for specific aspects of GO data such as paper, GO term, curator, etc.
  • Data type searches can be performed on one or more source files.
  • As for selecting source files, data type searches can be filtered to exclude curated or noncurated data.
  • Depending on what is being searched, the search will either look through the Textpresso source file, the Textpresso source file plus annotations, or just the annotations.
  • Searches that will be performed on Textpresso data (meaning there are no free text entries allowed for this data type)
    • Gene Product Search
      • The gene product search is case-insensitive.
      • The gene product search will match substrings.
        • SKN-1 - returns sentences that refer to the SKN-1 gene product.
        • skn-1 - returns sentences that refer to the SKN-1 gene product.
        • 804 - returns sentences that refer to the SKN-1 gene product's corresponding WBGene ID, WBGene00004804.
        • 04 - returns sentences that refer to the SKN-1 WBGene ID, WBGene0004804; the RHO-1 WBGene ID, WBGene00004357; the PKC-3 WBGene ID, WBGene00004034; the SMF-3 WBGene ID, WBGene00004878; the DDR-1 WBGene ID, WBGene00016104; the UNC-97 UniProtKB ID, P50464, the CEH-13 WBGene ID, WBGene00000437, the CEH-20 WBGene ID, WBGene00000443, the SNB-1 WBGene ID, WBGene00004897, etc., etc.
    • Paper
      • Each Textpresso source file will include paper IDs.
      • Most of the time, the IDs will be PMIDs, but we could also have dois and MOD-specific IDs.
      • Note that only annotations using a PMID or doi can be sent to Protein2GO; annotations made with MOD-specific IDs will not be accepted by Protein2GO.
      • The paper search is case-insensitive.
      • The paper search will match substrings.
        • Searching on 2306 - returns all paper identifiers that contain the numbers 2306.
        • Searching on PMID:2306 - returns all paper identifiers that begin with PMID:2306
        • Searching on pmid:2306 - returns all papers identifiers that begin with PMID:2306
  • Searches that can be performed on Textpresso data or annotated data (meaning that these data types can have new data entered in addition to what came with the Textpresso source files)
    • Component
      • The component search is case-insensitive.
      • The component search can match a substring.
        • Searching on cytoplasmic - returns sentences for which the Textpresso or annotated component term is cytoplasmic.
        • Searching on cytopl - returns sentences for which the Textpresso or annotated component terms is cytoplasm or cytoplasmic, or any other component term that contains the text string 'cytopl'.
    • GO Term
      • GO Term searches can be performed using a GO ID or a GO Term Name. Is this correct, specifically the part about the ID search?
      • The GO Term search is case-insensitive.
      • The GO term search can match a substring.
        • Searching on the term neuron - returns sentences associated with the GO term neuron projection.
        • Searching on the term Synaptic - returns sentences associated with GO terms such as synaptic vesicle, presynaptic active zone, etc.
        • Searching on the GO ID
  • Searches that can be performed on annotated data only (these data don't exist until an annotation is created)
    • Annotation Curator
      • The annotation curator search is case-insensitive.
      • The annotation curator search can match a substring.
    • Annotation Date
      • The annotation date is in the form of a time stamp.
      • The annotation date search can match a substring.

Sentence Classification

  • Curators can select as many as apply.
  • Data is stored in the ccc_sentenceclassification table.
  • False positive (falsepos) - The sentence does not refer to subcellular localization at all.
  • Positive for localization, but not for GO (poslocneggo) - The sentences refers to subcellular localization but would typically be curated for GO. Examples of this include negative results or localization in a mutant or otherwise non-wild type background.
  • Run-on sentence (runon) - The returned sentence is actually one or more complete sentences strung together. Run-on sentences can be true or false positives.
  • Scrambled sentence (scrambled) - The returned sentence includes words from one or more sentences but they may be incomplete and the contents scrambled with other text. This category includes Tables.
  • Curated - if an annotation is made from a sentence, the sentence is automatically regarded as curated. Do we add anything to the ccc_sentenceclassification table for curated sentences?

Sending Annotations to Protein2GO via Web Services

  • Annotations made using the CCC tool can be sent to UniProtKB's Protein2GO annotation tool via web services.
  • Annotations successfully sent to Protein2GO will return an annotation ID and will then be entered into Caltech's curation database.
  • Annotations not successfully sent to Protein2GO will return an error message and will not be entered into Caltech's curation database.
  • The curator will need to see the error message to know that the annotation failed.
  • The CCC tool can send IDA and IPI annotations (i.e., annotations made to the CC term protein complex or its children).
  • To send an IPI annotation, there must be a properly formatted entry in the With/From field of the CCC form.
    • Properly formatted With/From entries are those that contain a valid database prefix, a colon, and then a valid database accession or id.
      • Example of a properly formatted With/From entry - UniProtKB:O16850
  • Additional details on Protein2GO's web services is here:

Back to Gene Ontology