CCC Form 2.0 Specifications

This page is intended to document specifications for the next version of the Textpresso for Cellular Component Curation (CCC) tool. The changes to the tool, and pipeline, have been suggested by curators and are also part of the broader plan for Textpresso-based curation pipelines and the GO's Common Annotation Framework.

Tool Features

Textpresso search specifications

Frequency of searches - this will vary by group (e.g., weekly for dictyBase)
Corpus - this will also vary by group
Categories - gene/protein name, CCC, assay term, verb
Filtering (Textpresso)

Journal
Date
Document IDs

Filtering (non-Textpresso)

SVM
Gene Ontology Gene Association File

Ranking search results - e.g., highest scoring papers presented first
Naming search results file
Storing search histories

Recording versions of pdf2text conversion
Recording version of categories used
Recording search criteria, i.e. categories, corpus, filters
Recording curator or group and date of search

Curation form

The form presents sentences for annotation - some sentences will lead to a GO annotation, some will not
Not all sentences will be classified (although it'd be great if they were)
What data to store for a GO annotation:

Name of search results file
Paper identifier
Gene/gene product identifier
Textpresso component term
GO component term
Evidence code
Sentence ID
Sentence classification
Curator
Annotation date
Annotation history

What data to store if no GO annotation:

Name of search results file
Paper identifier
Sentence ID
Sentence classification
Curator
Annotation date
Annotation history

Curator login
Import of search results files

Can this be automated? dictyBase searches will be run weekly, can the results files be automatically transferred to tazendra?

Organization of search results file

If we have many results files, perhaps we can organize them in neater way than just one long list. See the Textpresso categories for an example of cascading menus, one possible solution.

Selection of search results file for curation
Display of paper bibliographic information

This refers to adding more information than what we currently display, as well as how we display it. I think we could get the additional information from Textpresso and then just pretty up the display by adding some spacing and some bold text, etc. See PubMed for one possible example: [1]

Search functionality on form - this includes some new features

Gene - search for all sentences that mention a specified gene and/or synonym (all sentences or specific sentence classification)
Paper - search for sentences from a paper (all or specific sentence classifications)
Curator - search for all sentences classified by a given curator (all or specific sentence classification)
Annotation date - search for all work done for a given date (use wild cards)
Component term in sentence - search for all sentences that matched a given Textpresso component term (all or sentence classification)
GO term used for annotation - search for all sentences that used a specific GO term for annotation (Curate or Already Curated)

Curation when all entities are recognized - straightforward
Curation when one or more entities is not recognized - add a value to either of first two columns

Enter a new gene name and database identifier
Enter a new component term in sentence

Feedback from form to Textpresso

Add gene name or synonym plus database identifier
Add component term to Textpresso cellular component category

Evidence codes

IDA (default), IPI (complex membership)

Sentence classification

Curate - select one or more entities from each column, will add new GO annotations to database
Already curated - select one or more entities from each column, will record as already annotated, mark in red, filter, different location on page?
Scrambled sentence
Run-on sentence
Positive for localization, not for GO curation (formerly not go curatable)
False positive

Edit a previous annotation

Change gene annotated, change component term used, change GO term assigned, change evidence code

Edit relationship index

This would be a separate functionality, but would allow a curator to view and edit the relationship index if needed.

Delete a search results file

This could be tricky. We'd need to make sure there are no annotations associated with that search file.

Export annotations

To a MOD
To Protein2GO
As a file - GO Gene Association File (GAF)

Files needed

Mapping file for gene names and synonyms to MOD identifier and UniProtKB identifier

GO's gpi file format would have all of the information we need

Other issues

What to do with old data - can we map all old data onto new tables? Some information may be missing in old data, is that okay?

CCC Form 2.0 Specifications

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools