Ontology Annotator

From WormBaseWiki
Jump to navigationJump to search

Description

The Ontology Annotator (OA) is a curation tool developed by WormBase for a variety of curation purposes including the curation of phenotypes, attaching GO terms to genes, genetic interactions, transgenes, free-text descriptions of genes, and several other data-types. The OA uses CGI, javascript and a postgreSQL database and is web-based, eliminating issues that may arise due to operating system differences and the need for an user to install other dependent software. The OA in many ways is similar to [Phenote] (phenote.org). The OA mainly consists of an Editor, where new data is entered or where pre-existing data can be queried and edited, a Data-Table for data review and a Term-Information panel where information about a term like IDs, synonyms etc., is displayed. The OA includes features like term autocomplete from pre-loaded ontologies, a fast AJAX loading of terms, ability to save/query to/from a postgreSQL database, duplicating data, editing several lines of data at once, and filtering the data. The display to some extent can be custom-organized as columns in the Data-Table can be sorted by dragging and the width of each data-column adjusted. For complex data-types with several fields, the OA allows a tabbed organization.

List of WormBase OA configurations:

  • Antibody
  • Concise description
  • Disease
  • Expression pattern
  • Gene class
  • Gene ontology
  • Gene regulation
  • Interaction
  • Molecule
  • Movie
  • Phenotype
  • Picture
  • Process Term
  • RNAi
  • Topic
  • Transgene


The OA uses :

  • Perl CGI.
  • Yahoo!'s YUI library and a local javascript file.
  • PostgreSQL database backend (could probably be modified to other SQL databases).
  • Apache webserver.
  • Documentation for main CGI, javascript, and modules:OA docs

The above description can also be found here: Web-page for OA

Wish List

  1. Include dependencies wherever possible. For example, if making an IMP annotation for a given gene, have a gene-specific drop down menu of alleles or RNAi experiments for the WITH column. Or, if making an IGI annotation for the paper, have a drop down list of all genes mentioned in the paepr. Similarly, when entering a GO term, have the ontology (P, F, or C) get entered automatically. (From Curation Interface Meeting)
  2. Term information window - information should reflect where cursor is placed in the editor window, e.g. Paper should reflect paper info


Batch upload to OA from tab-delimited file

[New as of November 2013] A script has been written that will allow curators to upload data in bulk to the OA through the submission of a properly formatted tab-delimited (TSV) file. The script is located on Mangolassi/Tazendra at:

/home/postgres/public_html/cgi-bin/oa/scripts/populate_oa_tab_file/populate_oa_tab_file.pl

Usage

Enter (cd into) the directory with the script and run by entering:

./populate_oa_tab_file.pl mangolassi #### testfile.tsv

to enter into the Sandbox OA (on Mangolassi) where '####' is the curator's WBPerson ID, and 'testfile.tsv' is a test upload file to make sure everything is formatted properly

./populate_oa_tab_file.pl tazendra #### realfile.tsv

to enter into the Live OA (on Tazendra) where '####' is the curator's WBPerson ID, and 'realfile.tsv' is the real upload file that has been successfully tested on the sandbox (mangolassi)

Note: It is very important to test your batch upload file on the sandbox first, as there are many possible errors in formatting (mis-spelled names, IDs, etc.). Once uploaded to the OA on Tazendra, each entry will need to be edited manually, one-by-one if there are any mistakes.

OA's capable of accepting bulk upload

As of November 2013, the list of OA's that can accept bulk uploads via this method are as follows:

  • Interaction
  • Phenotype
  • Process Term
  • RNAi
  • Topic
  • Transgene

Tab-delimited (TSV) file format

It is important that the TSV file be formatted properly. Each column header must be a Postgres table name into which data will be uploaded. Each column header on a single form should be a Postgres table name for the same OA such that each row in the spreadsheet/TSV file will be a single entry (with a unique PGID) in the OA/Postgres.

Note: Every entry in a cell directly below a column header/Postgres table name MUST be entered EXACTLY as can be received by that table/field, i.e. if an 'ontology' or 'multiontology' field, a drop-down menu, or any other controlled vocabulary field, the entries must be formatted appropriately. If these data are not entered in the proper format, the data will enter the Postgres tables (incorrectly) but will not necessarily show up in the OA, making it difficult to track down erroneous data entries.

Any mis-spelled multi-ontology entries will be ignored by the OA, although the data WILL get written to Postgres; only by querying out the entire object in the OA and making an actual change to the field (by adding or deleting something) will the postgres table be overwritten by what is seen in the OA

Mis-spelled single ontology fields will show up in the OA (and write to Postgres) AS IS and will need to be fixed manually; note, the OA might still try to recognize the entity, for example if the bogus paper ID "WBPaper00012X45" is entered it will pull up term info for "WBPaper00000012" as a best guess

Google spreadsheet for generating TSV file

Any tab-delimited (TSV) file should work, but in order to standardize the submission process, a Google spreadsheet has been generated with drop down fields in each header row for selecting the correct Postgres table name:

https://docs.google.com/spreadsheet/ccc?key=0AgaLIpaBTJmSdEVoTXFLdHROOXQ1QzlkZ1VhYVpGMFE#gid=0

Google Spreadsheet for OA Batch Upload 11-19-2013.png

Above is a screenshot of the Google spreadsheet form. To download the form as a TSV file, select "Download As" from the "File" menu and select "Plain Text".

Google Spreadsheet Download as TSV 11-19-2013.png

Even though the file menu suggests that the file will download as a "*.txt" file, it should download the spreadsheet in tab-delimited (TSV) file format which can be directly uploaded to the OA through the batch OA submission script.

Commenting out lines

Any lines in the uploaded TSV file that begin with a hash symbol '#' will be 'commented out' and this ignored by the script. This may be useful for testing or other purposes. To comment out lines from the Google spreadsheet, just enter a '#' as the first character in the first column on a given row.


Large text fields

There are a couple of points to make about large text fields (e.g. Remark):

  1. Try to avoid entering literal line breaks ('\n') in large text field submissions. The presence of such line breaks may break the code and cause it to crash when running.
  2. There is no need to enter quotes (single or double) around a large text field entry
  3. For "DNA Text" fields that accept 'ATGC...' etc., multiple entries ought to be entered with no quotes and separated by a <space><pipe><space> so that they are recognized as distinct entries. So, for example, two DNA text entries should be entered like this:
CAGTATATCGGAGCTGAGTAGCTGATCACAGGACTGTAGCGTCAGT | CTAGTGATGCTTGATCGTATAGTCCGTAC

and not like this:

CAGTATATCGGAGCTGAGTAGCTGATCACAGGACTGTAGCGTCAGT|CTAGTGATGCTTGATCGTATAGTCCGTAC

or this:

"CAGTATATCGGAGCTGAGTAGCTGATCACAGGACTGTAGCGTCAGT" | "CTAGTGATGCTTGATCGTATAGTCCGTAC"


Back to Caltech documentation