Difference between revisions of "OA and scripts for disease data"

From WormBaseWiki
Jump to navigationJump to search
Line 524: Line 524:
  
  
(will be repeated for all of the OMIM Ids for Disease relevance).
+
(will be repeated for all of the OMIM Ids from for Disease relevance 'OMIM disease for Disease Rel (dis_dbdisrel)' and from 'OMIM gene for Disease Relevance (gene_disrel)').
 
</pre>
 
</pre>
  

Revision as of 21:27, 18 September 2014

Ontology Annotator for disease data

Note: All disease_relevant descriptions have been removed from the concise descriptions OA and moved to the disease OA.

Fields

One gene can be attached to more than one Experimental_Model and one Disease_Relevance (and their related papers, databases and species); they will be grouped together in one instance of the Editor and grouped together in one line in the data-table. This is similar to a gene being attached to more than one GO term. If a gene needs to be attached to a unrelated disease, enter all data on a new line, by hitting 'New' in the OA.

Editor:

Field 1 Name: (dis_wbgene) WBGene
Behavior of field: Autocomplete obo
Source: WBGene obo
Similar to: WBGene in the GO OA or concise descrips OA
As one starts typing locus name, eg, lin-10 or cosmid name, eg., C09H6 script autocompletes and fills in WBGene ID.

Q: So single value, not multiple?
A: Single value.

Field 2 Name: (dis_curator) Curator
Behavior of field: Auto-complete drop-down with ready values
Similar to: Curator field in GO OA

Field 3 Name: (dis_curhistory) Curator History Behavior of field: However it is in the concise OA; this is not something that can be changed manually.
Similar to: consise OA

Field 4 Name: (dis_humandoid) Experimental model for
Behavior:Autocomplete obo
Obo file to be used: DO_term obo
Source: https://diseaseontology.svn.sourceforge.net/svnroot/diseaseontology/trunk/HumanDO.obo
Similar to: GO term field in the GO OA.
For example, curator starts typing 'Alz', picks 'Alzheimer's disease' from the drop-down and script populates field with 'Alzheimer's disease (DOID:10652); similar to GO term OA in the GO OA.

Q:Updating: How do we update this obo file, how frequently do other obo files get updated?
A: Everyday at 8pm, if it has the proper .obo format it should be easy to add to the cronjob that picks them up.
/home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl

Q: Single value / multivalue ?
A: Multiple value, as I may need to attach more than one DO term to a gene.

Field 5 Name: (dis_paperexpmod) Paper for Exp Mod
Obo file to be used: Paper obo
Behavior:Autocomplete obo
Obo file to be used: WBPaper obo
Similar to: The Paper field in the GO OA

Q: You mean the papers in the paper editor ?
The Paper obo, I guess they all come from the Paper Editor.

Q:Single/multi ?
Multi value.

Field 6 Name (now called 'OMIM disease for Exp Mod): (dis_dbexpmod) Database for Exp Mod
Behavior: Free text, multiple values comma-separated

Q: Will they dump in separate lines in the output ? Usually those are pipe-separated. If they'll dump literally as pasted in, then commas are good.
A: Per latest conversation, using commas is fine, as long as there never will be a comma in the data itself, which is not likely to happen as these are OMIM IDs

Q:Do you want to use the same list as everyone else, and add new values to it (if they're okay with those values) ?
A: Yes, I just spoke to Daniela and adding the value 'Homo sapiens' is fine with her, if you want I can e-mail the group, but she felt adding needed values was fine.

Field 7 Name: (dis_species) Species
Behavior: Auto-complete drop-down with ready values
Similar to: Project field in the GO OA
Current values: Homo sapiens

Field 8 Name: (dis_lastupdateexpmod) Last Updated for Exp Model
Script autopopulates date when data is a New line, i.e when the "New" button is used.

Field 9 Name: (dis_diseaserelevance) Disease relevance
Behavior: Big Text box (big text-box, keeps expanding)
Similar To: 'Description Text' field in the Concise OA.
This is the Human_disease_relevance description (it appears as one of the drop-down values) for the'Description Type' field in the 'Concise' OA.
Change needed: Human_disease_relevance will not be entered via the concise OA. We can remove the 'Human_disease_relevance' from the 'Description Type' field in the OA.

Q:Do we start this OA by populating it from existing data in the GO OA ?
A:You mean 'existing data in the concise description OA, We can, if thats the way you want to start, or we can do it later.

Q: If so, let me know how to transfer the data.
A: So for any given 'Human disease relevance' description in the concise OA the transfer from Concise OA to Gene-disease OA is as follows:
WBGene-->WBGene
Curator-->Curator
Curator History-->Curator History
Description Text (Human Disease Relevance)-->Disease Relevance
Reference-->Reference under Disease Relevance
Accession Evidence-->OMIM Database
Last Updated-->Last updated PGID-->PGID

Field 10 Name: (dis_paperdisrel) Paper for Disease Rel
Behavior: Autocomplete obo
Obo file to be used: WBPaper obo
Similar to: The Paper field in the GO OA

Q:So there's two papers fields. Are they both required, or it must have at least one, or nothing is required ?
A: Both are required.
Q:single/multi value ?
A: Multivalue

Field 11 Name: (dis_dbdisrel) Database for Disease Rel (now called, OMIM disease for Disease Rel)
Behavior: Free text, multiple values comma-separated

Q:Same as xref Database, but a different field ?
A: Exactly, again I will pipe-separate multiple values.

Field 12 Name: OMIM gene for Disease Rel, Free text, comma separated

Field 13 Name: (dis_lastupdatedisrel) Last Updated for Disease Rel
Behavior: Script fills in current date if new annotation, if manually changing, entered as YYYY-MM-DD
Script autopopulates date when its a new data line.

Field 14 Name: dis_comment Comment
Behavior: Free text

Field 15 Name: pgid

Data constraints

For curators only at the tool level to check if required fields are filled.
These dis_ tables : wbgene curator humandoid paperexpmod species diseaserelevance paperdisrel lastupdatedisrel WBGene
Curator
Experimental model for
Paper for Exp Mod
Species
Disease relevance
Paper for Disease Rel
Last Updated

To make live:
at : /home/postgres/work/pgpopulation/dis_disease/
create_dis_tables.pl -- create new postgres tables for dis_ disease OA
synchronize OA
transfer_concise_disease.pl -- take 95 entries that have con_desctype = 'Human_disease_relevance' and add them to dis_ tables starting with pgid 1.
Ranjana, manually delete the Human_disease_relevance entries from the concise OA.
remove the Human_disease_relevance option from the OA, resynchronize.

Dumper specifications

Dumper module in sandbox at /home/postgres/work/citace_upload/dis_disease/get_dis_disease_ace.pm Copy /home/postgres/work/citace_upload/dis_disease/use_package.pl to a directory you own and run it there.

Mapping between OA fields and acedb tags

Model:

?Gene
DB_info  Database ?Database ?Database_field Text
Disease_info Experimental_model ?DO_term XREF Gene_by_biology ?Species   #Evidence	            
             Potential_model ?DO_term XREF Gene_by_orthology ?Species #Evidence
             Disease_relevance  ?Text ?Species #Evidence

We do not fill in Potential_model tag, Sanger does.

The example is lov-1 in the disease OA in the sandbox:

Model tag: ?Gene
Use value: WBGene (take ID only)
Eg: WBGene00003058

Model tag: DB_info Database ?Database ?Database_field Text
Use value(s) in 'xref Database' and in 'OMIM database'
Eg: OMIM:173900 and OMIM:601313, do not take OMIM:173900 again from 'OMIM database' since it is a duplicateof that in 'xref Database'.

.ace:
Database	 "OMIM"	   "disease"	 "173900"
Repeat line for each value if there are multiple values

Model tag: Experimental_model ?DO_term XREF Gene_by_biology ?Species #Evidence
Use value in 'Experimental Model for'
Eg:autosomal dominant polycystic kidney (DOID:5937); take ID only
Use value in 'Species' for ?Species
Eg: Homo sapiens
Use value(s) in 'Paper for Disease Rel' for #Evidence
Eg.WBPaper00038373
Repeat .ace line for every paper if multiple papers are present.

.ace:

Experimental_model  DOID:5937  "Homo sapiens"	Paper_evidence	"WBPaper00038373"	

Model tag: Disease_relevance  ?Text ?Species #Evidence
Use value in 'Disease Relevance' for ?Text
Eg:lov-1 and pkd-2 encode the orthologs of human Polycystin-1 and Polycystin-2, which are mutated in autosomal dominant polycystic kidney disease; the polycystins regulate signaling involved in normal renal tubular structure and function; studies in the worm C. elegans have contributed extensively to the finding that cystic kidney diseases can be considered ciliopathies; in elegans lov-1 and pkd-2 are expressed in male ciliary neurons, are required for normal male mating behavior, do not seem to be required for ciliogenesis, and each polycystin may actually have a potential inhibitory function on the other for ciliary function; lov-1 and pkd-1 interact with a single-pass transmembrane protein, CWP-5, though the significance of this interaction for polycystic kidney disease is unknown.

Use value in 'Species' for ?Species
Eg. Homo sapiens

Use value in 'Paper for Disease Rel' for #Evidence
Eg: WBPaper00038373

.ace:
Disease_relevance "lov-1 and pkd-2 encode the orthologs of human Polycystin-1 and    Polycystin-2, which are mutated in autosomal dominant polycystic kidney disease; the polycystins regulate signaling involved in normal renal tubular structure and function; studies in the worm C. elegans have contributed extensively to the finding that cystic kidney diseases can be considered ciliopathies; in elegans lov-1 and pkd-2 are expressed in male ciliary neurons, are required for normal male mating behavior, do not seem to be required for ciliogenesis, and each polycystin may actually have a potential inhibitory function on the other for ciliary function; lov-1 and pkd-1 interact with a single-pass transmembrane protein, CWP-5, though the significance of this interaction for polycystic kidney disease is unknown."	"Homo sapiens"	Paper_evidence	"WBPaper00038373"

(Repeat this line for every paper, if multiple papers are present).

So put together, .ace file for lov-1 looks like:

Gene : "WBGene00003058"
Database	"OMIM"	"disease"	"173900"
Database	"OMIM"	"disease"	"601313"
Experimental_model DOID:5937 "Homo sapiens"	Paper_evidence	"WBPaper00038373"	
Disease_relevance	"lov-1 and pkd-2 encode the orthologs of human Polycystin-1 and Polycystin-2, which are mutated in autosomal dominant polycystic kidney disease; the polycystins regulate signaling involved in normal renal tubular structure and function; studies in the worm C. elegans have contributed extensively to the finding that cystic kidney diseases can be considered ciliopathies; in elegans lov-1 and pkd-2 are expressed in male ciliary neurons, are required for normal male mating behavior, do not seem to be required for ciliogenesis, and each polycystin may actually have a potential inhibitory function on the other for ciliary function; lov-1 and pkd-1 interact with a single-pass transmembrane protein, CWP-5, though the significance of this interaction for polycystic kidney disease is unknown."	"Homo sapiens"	Paper_evidence	"WBPaper00038373"

When to dump data

If data is present in Field 4-- (dis_expmodelfor) Experimental model for, dump this field and the related fields:
Field 5 Name:(dis_paperexpmod) Paper for Exp Mod
Field 6 Name:(dis_xrefdb) Database for Exp Mod
Field 7 Name:(dis_species) Species

If data is present in Field 9 Name:(dis_diseaserelevance) Disease relevance, dump this and the related fields:
Field 10 Name:(dis_paperdisrel) Paper for Disease Rel
Field 11 Name:(dis_omimdb) Database for Disease Rel
Field 7 Name:(dis_species) Species

Code annotation

For get_dis_disease_ace.pm


package get_dis_disease_ace;
require Exporter;

our @ISA	= qw(Exporter);
our @EXPORT	= qw( getDisease );
our $VERSION	= 1.00;

# Dumper module to dump Ranjana's dis_ disease data.  2013 01 18

use strict;
use diagnostics;
use LWP;
use LWP::Simple;
use DBI;

my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n";# connecting to postres dtabase 

my $result;

my %theHash;# read all the tables from line 25 and store them in a hash; will store all postgres data to parse into .ace output
my @tables = qw( wbgene humandoid paperexpmod dbexpmod species diseaserelevance paperdisrel dbdisrel ); #list of postgrestables, as dis_wbgene, dis_humandoid, all begin wiht the prefix dis


my $all_entry = ''; #defining all the variables, .ace and the error text
my $err_text = '';

my %nameToIDs;							# type -> name -> ids -> count; maps WBGenes to PGids
my %ids;                                                        #just all the PGIDs that are relevant

my %deadObjects;                   #hash of all the dead objects

my %dataType;
$dataType{humandoid}   = 'multi';
$dataType{paperexpmod} = 'multi';
$dataType{paperdisrel} = 'multi';
$dataType{dbexpmod}    = 'comma';
$dataType{dbdisrel}    = 'comma';

1;

sub populateDeadObjects {
  $result = $dbh->prepare( "SELECT * FROM pap_status WHERE pap_status = 'invalid';" ); $result->execute();
  while (my @row = $result->fetchrow) { $deadObjects{paper}{invalid}{"WBPaper$row[0]"} = $row[1]; }
  $result = $dbh->prepare( "SELECT * FROM gin_dead;" ); $result->execute();
  while (my @row = $result->fetchrow) {                 # Ranjana doesn't care about hierarchy, just show her an error message
    if ($row[1]) { $deadObjects{gene}{"WBGene$row[0]"} = $row[1]; } }
} # sub populateDeadObjects    # we are getting the genes and the papers that are invalid, storing them in the dead objects hash


sub getDisease {
  my ($flag) = shift; #use all or specify the geneID

  if ( $flag eq 'all' ) { $result = $dbh->prepare( "SELECT * FROM dis_wbgene; " ); }		# get all entries for type; # get all entries for all WBGenes
    else { $result = $dbh->prepare( "SELECT * FROM dis_wbgene WHERE dis_wbgene = '$flag';" ); }	# get all entries for type of object intid; #get all entries for WBGenes with the object name being the same as flag
  
     $result->execute();
  while (my @row = $result->fetchrow) {
    if ($deadObjects{gene}{$row[1]}) { $err_text .= "pgid $row[0] has $row[1] which is $deadObjects{gene}{$row[1]}\n"; }        # add dead wbgenes to error out
      else { $theHash{object}{$row[0]} = $row[1]; $nameToIDs{object}{$row[1]}{$row[0]}++; $ids{$row[0]}++; } }          # add non-dead genes to hashes
  my $ids = ''; my $qualifier = '';  #now we are checking for dead genes, if dead, gives an error message, if not it is doing what it was doing before, that is dumping.
  
if ($flag ne 'all') { $ids = join"','", sort keys %ids; $qualifier = "WHERE joinkey IN ('$ids')"; } # for all of the tables lsited before, we will restrict it to those PGids, so if we're getting a specific query, we only want the data for that specific set of pgids
[1/23/13 3:35:35 PM] j chan: and we do that by adding to the postgres query the qualifier WHERE joinkey IN ('$ids') #query for all tables
  foreach my $table (@tables) { #for each of those tables we will do this query, $theHash{$table}{$row[0]} = $row[1];
    $result = $dbh->prepare( "SELECT * FROM dis_$table $qualifier;" );		# get data for table with qualifier (or not if not)
    $result->execute();	#query results stored in this hash, %theHash, the hash maps to DOID, $theHash{humandoid}{1} = 'DO:1234'
    while (my @row = $result->fetchrow) { $theHash{$table}{$row[0]} = $row[1]; }
  } # foreach my $table (@tables)
 foreach my $objName (sort keys %{ $nameToIDs{object} }) {# getting each of the objects from the nameTOID hash 
    my $entry = ''; my $has_data; #storing the .ace entry for .ace object
    $entry .= "\nGene : \"$objName\"\n"; #will dump empty gene objects, if no data present

    foreach my $pgid (sort {$a<=>$b} keys %{ $nameToIDs{object}{$objName} }) { #for each PGID that has that object name the data will be dumped
      my $species = ''; if ($theHash{species}{$pgid}) { $species = $theHash{species}{$pgid}; } #will get species value
      my %omim = (); # filter OMIM results so no duplicates
      if ($theHash{humandoid}{$pgid}) { #if human DOID
        my (@doids) = $theHash{humandoid}{$pgid} =~ m/(DOID:\d+)/g;#match for DOID: numbers, DOID:\d+
        my @papers;
        if ($theHash{paperexpmod}{$pgid}) { (@papers) = $theHash{paperexpmod}{$pgid} =~ m/(WBPaper\d+)/g; } #match for WBPaper, WBPaper\d+
        foreach my $doid (@doids) { # for each DOID
          if (scalar @papers > 0) { foreach my $paper (@papers) { $entry .= qq(Experimental_model\t"$doid"\t"$species"\tPaper_evidence\t"$paper"\n); } } #there are papers,Experimental_model\t"$doid"\t"$species"\tPaper_evidence\t"$paper"
            else { $entry .= qq(Experimental_model\t"$doid"\t"$species"\n); } } #there are no papers, Experimental_model\t"$doid"\t"$species"
        if ($theHash{dbexpmod}{$pgid}) { my (@om) = $theHash{dbexpmod}{$pgid} =~ m/OMIM:(\d+)/g; foreach (@om) { $omim{$_}++; } } #if there is data in dis_dbexpmod, we are going to match for OMIM:(\d+),but only capture the number, not the OMIM:, store in the OMIM hash
      }
      if ($theHash{diseaserelevance}{$pgid}) { # if there is disease relevance, dis_diseaserelevance, convert '-->" for acedb, 
        my $disrel = $theHash{diseaserelevance}{$pgid}; if ($disrel =~ m/\'/) { $disrel =~ s/\'/''/g; } if ($disrel =~ m/\n/) {  $disrel =~ s/\n/ /g; } #converts line breaks into spaces
        
 my @papers; my @all_papers;
        if ($theHash{paperexpmod}{$pgid}) { (@all_papers) = $theHash{paperexpmod}{$pgid} =~ m/(WBPaper\d+)/g; }
        foreach my $paper (@all_papers) {                       # get all papers and send error message for invalid papers, and add valid to list of papers
          if ($deadObjects{paper}{invalid}{$paper}) { $err_text .= "pgid $pgid has invalid paper $paper\n"; }
            else { push @papers, $paper; } }
my @papers; my @all_papers;
        if ($theHash{paperdisrel}{$pgid}) { (@all_papers) = $theHash{paperdisrel}{$pgid} =~ m/(WBPaper\d+)/g; }
        foreach my $paper (@all_papers) {                       # get all papers and send error message for invalid papers, and add valid to list of papers
          if ($deadObjects{paper}{invalid}{$paper}) { $err_text .= "pgid $pgid has invalid paper $paper\n"; }
            else { push @papers, $paper; } }
        
     if ($theHash{paperdisrel}{$pgid}) { (@papers) = $theHash{paperdisrel}{$pgid} =~ m/(WBPaper\d+)/g; } #same as line 73, matching for papers, for the table dis_paperdisrel
        if (scalar @papers > 0) { foreach my $paper (@papers) { $entry .= qq(Disease_relevance\t"$disrel"\t"$species"\tPaper_evidence\t"$paper"\n); } } #same as 75 and 76, for disease relevance as opposed to DOID
          else { $entry .= qq(Disease_relevance\t"$disrel"\t"$species"\n); }
        if ($theHash{dbdisrel}{$pgid}) { my (@om) = $theHash{dbdisrel}{$pgid} =~ m/OMIM:(\d+)/g; foreach (@om) { $omim{$_}++; } } # for disease relevance as opposed to dbexpmod
      }
      foreach my $omim (sort keys %omim) { $entry .= qq(Database\t"OMIM"\t"disease"\t"$omim"\n); } #print all the unique OMIM IDs
      if ($entry) { $has_data++; }                  # if .ace object has a phenotype, append to whole list
    } # foreach my $pgid (sort {$a<=>$b} keys %{ $nameToIDs{$type}{$objName} })
    if ($has_data) { $all_entry .= $entry; }
  } # foreach my $objName (sort keys %{ $nameToIDs{$type} })
  return( $all_entry, $err_text );# returns all entries, no error checking in place for now;
} # sub getDisease

__END__

sub getData {
  my ($cur_entry, $table, $joinkey, $tag, $objName, $goodGenes_ref) = @_;
  if ($theHash{$table}{$joinkey}) {
    my $data = $theHash{$table}{$joinkey};
    if ($data =~ m/^\"/) { $data =~ s/^\"//; }
    if ($data =~ m/\"$/) { $data =~ s/\"$//; }
    if ($data =~ m/
/) { $data =~ s/
//g; }
    if ($data =~ m/\n/) { $data =~ s/\n/  /g; }
    if ($data =~ m/^\s+/) { $data =~ s/^\s+//g; } if ($data =~ m/\s+$/) { $data =~ s/\s+$//g; }
    my @data;
    if ($data =~ m/\",\"/) { @data = split/\",\"/, $data; }
      elsif ($pipeSplit{$table}) { @data = split/ \| /, $data; }
      else { push @data, $data; }
    foreach my $value (@data) {
      if ($value =~ m/\"/) { $value =~ s/\"/\\\"/g; }
    } # foreach my $value (@data)
  }
  return $cur_entry;
} # sub getData

use_package.pl


#!/usr/bin/perl

# use the get_paper_ace.pm module from /home/postgres/work/citace_upload/papers/ 
# to dump the papers, abstracts (LongText objects), and errors associated with
# them.  2005 07 13
#
# Change to default get all papers, not just valid ones.  2005 11 10

use strict;
use Jex;

my $date = &getSimpleSecDate();
my $start_time = time;
my $estimate_time = time + 697;
my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime($estimate_time);             # get time
if ($sec < 10) { $sec = "0$sec"; }    # add a zero if needed
print "START $date -> Estimate $hour:$min:$sec\n";

$date = &getSimpleDate();

use lib qw( /home/postgres/work/citace_upload/dis_disease/ );
use get_dis_disease_ace; #tells script where to get the perl module packages

my $outfile = 'disease_' . $date . '.ace';
my $errfile = 'err.out.' . $date; #has two outputs, .ace file and error files, changed file name to 

open (OUT, ">$outfile") or die "Cannot create $outfile : $!\n";
open (ERR, ">$errfile") or die "Cannot create $errfile : $!\n";


my ($all_entry, $err_text) = &getDisease('all'); # all, gets all objects, need to specify WBGene if only that needed

print OUT "$all_entry\n";
if ($err_text) { print ERR "$err_text"; }  #will print error file, if errors are found, otherwise not

close (OUT) or die "Cannot close $outfile : $!";
close (ERR) or die "Cannot close $errfile : $!";

$date = &getSimpleSecDate();
my $end_time = time;
my $diff_time = $end_time - $start_time;
print "DIFF $diff_time\n";
print "END $date\n";

Counting script specifications

Counting script counts numbers in Postgres at any given instance and not from the .ace file.

Script at : /home/acedb/ranjana/human_disease/count_disease.pl

1. No. of genes (dis_wbgene): Counts all genes including duplicates, lists PGIDs of duplicate genes

2. No. of unique genes : Counts all genes, only once

3. No. of Experimental Models or DO_terms (dis_humandoid): counts all DO_terms

4. No. of unique Experimental models or DO_terms: does not count repeated DO_terms

5. No. of papers for Experimental models or DO_terms (dis_paperexpmod): counts all papers

6. No. of papers for Disease Relevance (dis_paperdisrel)

7. No. of unique papers in all of disease curation: no. of papers in dis_paperexpmod + no. of papers in dis_paperdisrel, counts a paper only once in both categories, no duplicates

8. No. of disease relevance descriptions (dis_diseaserelevance)

9. No. of OMIM genes connected to (WB)genes: from field 12 in OA-'OMIM gene for Disease Rel' entries look like 'OMIM:607485' or just '607485'; entries are comma separated (What is the Postgres table name? -- dis_genedisrel)

10. No. of OMIM diseases connected to WB genes: from OA Field 'OMIM disease for Exp Mod (dis_dbexpmod) plus (dis_dbdisrel) OA field-OMIM disease for Disease Rel, counts a disease only once, if it appears in both categories; entries look like 'OMIM:607485' or just '607485'; entries are comma separated

Ontology Annotator for Disease Term

OA for disease term

Dumping data for citace upload

--All scripts are under: /home/acedb/ranjana/human_disease

--A symlink to the script has been created: ln -s /home/postgres/work/citace_upload/dis_disease/use_package.pl
--disease ontology file for the OA is updated by a cron job that runs at 8pm every day. (Script:0 20 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/update_obo_oa_ontologies.pl

Source:http://www.berkeleybop.org/ontologies/doid.obo, 08.08.2013)


1. Ontology file:

Run parseHuman.pl:

Downloads the HumanDO.obo from http://diseaseontology.svn.sourceforge.net/viewvc/diseaseontology/trunk/HumanDO.obo and converts it to HumanDO.ace. Upload to Spica under Data_for_citace/Data_from_Ranjana/. Change name to HumanDO_WSXXX.ace (Source URL now changed to http://www.berkeleybop.org/ontologies/doid.obo, 08.08.2013)

2. Gene-disease annotation file

Run use_package.pl at /home/acedb/ranjana/human_disease:

Dumps disease data from the disease OA, into disease_<date>.ace, scp file to local machine, change name to disease_WSXXX.ace Upload to Spica at citpub, under Data_for_citace/Data_from_Ranjana/.

Also checks whether all DOIDs in postgres are valid, outputs invalid DOIDs to err.out.<date> file. Note that invalid DOIDs cannot be seen in the OA, identify by PGID and then add the valid DOID to annotation, as the invalid one will not show.

3. DO_term-Worm_model_description annotation file

Run use_package.pl at /home/acedb/ranjana/human_disease/diseaseterm

Dumps disease data from the disease term OA, into diseaseterm_<date>.ace,scp file to local machine, change name to diseaseterm_WSXXX.ace Upload to Spica at citpub, under Data_for_citace/Data_from_Ranjana/.


4. Download the HumanDO.obo file from http://www.berkeleybop.org/ontologies/doid.obo and rename as disease_ontology.WSXXX.obo.


All files should be deposited to:

/home/citpub/Data_for_citace/

/home/citpub/Data_for_Ontology/

Changes to OA May 2013

  • Database for Exp Mod changes to 'OMIM disease for Exp Mod', data can be entered as IDs without the 'OMIM:' as prefix, multiple values comma-separated.
  • 'Database for Disease Rel' changes to 'OMIM disease for Disease Rel', multiple values are comma-separated, data be entered as IDs without the 'OMIM:' prefix.
  • Extra free-text field called 'OMIM gene for Disease Rel' added, data can be entered as IDs without the 'OMIM:' prefix, multiple values comma-separated.
  • When data is present in either the 'OMIM disease for Disease Rel' or 'OMIM gene for Disease Rel' fields, script dumps the following line in .ace for each entry as:

Database "OMIM" "disease" "456789"
Database "OMIM" "gene" "456789"

Changes to gene-disease dumper, Sept 2014: moving OMIM Ids to Accession_evidence

  • Disable part of script that dumps OMIM ids under the 'Database' tag
  • For the Experimental_model tag, look at the Ids either entered as 'OMIM:XXXXX', or just 'XXXXX' in the 'OMIM disease for Exp Mod (dis_dbexpmod)'
  • For the Disease_relevance tag, look at the OMIM Ids either as 'OMIM:XXXXX' or just 'XXXXX' in 'OMIM disease for Disease Rel (dis_dbdisrel)' and 'OMIM gene for Disease Relevance (gene_disrel)'
  • For each unique OMIM ID the .ace syntax for the gene would be:
 
Gene : "WBGene00003052"
Experimental_model "DOID:3911" "Homo sapiens" Accession_evidence  "OMIM"  "176670"
Experimental_model "DOID:0050557" "Homo sapiens" Accession_evidence  "OMIM"  "613205"
Experimental_model "DOID:11726" "Homo sapiens" Accession_evidence  "OMIM"  "181350"
Disease_relevance   "Mutations in human lamin, LMNA, are found in several diseases referred to as the laminopathic diseases, which include Emery-Dreifuss muscular dystrophy (EDMD), LMNA-related congenital muscular dystrophy (L-CMD), limb-girdle muscular dystrophy (L-CMD), Hutchison-Gilford progeria syndrome (HGPS), dilated cardiomyopathy (DCM), Charcot-Marie-Tooth disorder and atypical Werner syndrome; elegans B-type lamin, lmn-1, performs both A and B-type vertebrate lamin functions; similar to A-type lamins, it has roles in development, organization of nuclear pore complexes, and interacts with lamina and nuclear components; similar to B-type lamins, it is expressed widely throughout development, except for sperm, and interacts with B-type lamin-binding proteins; much of the knowledge of the organization and assembly of the nuclear lamina has come from studies in elegans; disease-causing mutations in human LMNA when introduced into elegans lmn-1/lamin alter nuclear lamina organization and dynamics, leading to phenotypes such as decreased fertility and muscle lesions; a mutation found in Hutchison-Gilford progeria syndrome disrupts the supramolecular structure of the lamin filaments in elegans; LMNA mutations that are found in EDMD, DCM and HGPS, when introduced into elegans lmn-1/lamin cause disruption in lamin filament assembly and nuclear localization; also, work in elegans has revealed that lamins are involved in the normal aging process, as worms mutant for lamin age faster."  "Homo sapiens"  Accession_evidence  "OMIM"  "115200"


(will be repeated for all of the OMIM Ids from for Disease relevance 'OMIM disease for Disease Rel (dis_dbdisrel)' and from 'OMIM gene for Disease Relevance (gene_disrel)').

Old way of dumping OMIM IDs for genes:

Gene : "WBGene00003052"
Database	"OMIM"	"disease"	"176670"
Database	"OMIM"	"disease"	"613205"
Database	"OMIM"	"disease"	"181350"

To do

  • Need to tell the EBI team that from the WS239 upload (mid-July) we will be dumping Date_last_updated and Curator_confirmed data into citace and they should pick up.
  • Disease ontology file location has changed, need to alert JC to change the locations for OA and scripts (done, 08.08.2013):

DO group lists two locations: Sourceforge: http://sourceforge.net/p/diseaseontology/code/2599/tree/trunk/

OBO Foundry: http://www.berkeleybop.org/ontologies/doid.obo (will use this source)



Back To Disease and Drugs