Difference between revisions of "WormMine"

From WormBaseWiki
Jump to navigationJump to search
Line 193: Line 193:
  
 
=Using query builder=
 
=Using query builder=
 +
[[File:Querybuilder link.png]]
 +
 +
=== Use case: How do I find all proteins for a gene? ===
 +
How are they intuitively connected?  Genes are transcribed into transcripts which contain coding sequences which are translated into proteins.  Keep this in mind while constructing a query.
 +
 +
* Select "Gene" in the box under "Select a Data Type to Begin a Query"
 +
* Click "Select", or double click "Gene"
 +
This list represents the Gene data available to display.  Various fields can be chosen to be shown in the results table, links to other data types followed, and filters can be set to constrain the results.
 +
 +
==== What do you want your results to look like? ====
 +
We want to display the ID's of each involved object, plus the gene symbol.  This calls for:
 +
# Gene.primaryIdentifier
 +
# Gene.symbol
 +
# Gene (some relationship)-> Transcript.primaryIdentifier
 +
# Gene (some relationship)-> CDS.primaryIdentifier
 +
# Gene (some relationship)-> Protein.primaryIdentifier
 +
 +
==== Building the query ====
 +
[[File:Qb circled.png|frame|click these areas]]
 +
The DB Identifier represents the primaryIdentifier, this is the only field name replaced in this way. Click show next to any of these field names to add that attribute to the result table as a column.
 +
* Show the primaryIdentifier (DB Identifier), and symbol
 +
* Gene contains a collection of transcripts, this relationship represents transcription.  Expand the relationship by clicking on the <code>[+]</code>
  
 
=Creating templates=
 
=Creating templates=
 
=Creating lists=
 
=Creating lists=
 
=Creating widgets=
 
=Creating widgets=

Revision as of 19:44, 8 May 2013

Current status

Link to mine

ONLINE


Follow progress here:

GitHub commit history

Account management

Data contained in WormMine

This lists all data sources contained in WormMine

Source Description Source Species Data mapping
GO Ontology terms and relationships comprising GO GO project website. N/A
GO Annotations Relationships between Genes and GO GO project website. Caenorhabditis elegans
Genomic sequences Fasta DNA sequences WormBase FTP Caenorhabditis elegans
Protein sequences Fasta peptide sequences WormBase FTP Caenorhabditis elegans
Gene locations Gene chromosomal coordinates WormBase FTP GFF3 Caenorhabditis elegans
Transcript locations Transcript chromosomal coordinates WormBase FTP GFF3 Caenorhabditis elegans
CDS locations CDS chromosomal coordinates WormBase FTP GFF3 Caenorhabditis elegans
Gene metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Transcript metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
CDS metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Variation metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Protein metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Phenotype metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Expression Pattern metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Anatomy Term metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Expression Cluster metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK
Life Stage metadata Select fields extracted from Ace AceDB XML dump All in AceDB LINK

What is this data mapping?

A loading program plugin has been created for InterMine which extracts data embedded in XML files directly into an InterMine instance. Mapping files are used to configure this program and detail the AceDB XML dumps to InterMine translation. XPath is used to query the XML, and can be reviewed here.

Understanding our model

The data contained in WormMine follows a central model schema. This model should be understood sufficiently to be able to query the data and create templates.

WormMine model file

This schema file contains all of the data types contained in WormMine, relationships between them, and each one's data fields.

How to read it

Looking at the protein class:

   <class name="Protein" extends="BioEntity" is-interface="true">
       <attribute  name="molecularWeight" type="java.lang.Float"/>
       <attribute  name="md5checksum" type="java.lang.String"/>
       <attribute  name="length" type="java.lang.Integer"/>
       <attribute  name="geneName" type="java.lang.String"/>
       <attribute  name="primaryAccession" type="java.lang.String"/>
       <reference  name="sequence" referenced-type="Sequence"/>
       <collection name="CDSs" referenced-type="CDS" reverse-reference="protein"/>
       <collection name="genes" referenced-type="Gene" reverse-reference="proteins"/>
       <collection name="transcripts" referenced-type="Transcript" reverse-reference="protein"/>
   </class>

Line by line:

<class name="Protein" extends="BioEntity" is-interface="true">

extends="BioEntity": Protein is a child of BioEntity therefore it inherits BioEntity's data fields.

Protein's parent, BioEntity:

   <class name="BioEntity" is-interface="true">
       <attribute name="secondaryIdentifier" type="java.lang.String"/>
       <attribute name="symbol" type="java.lang.String"/>
       <attribute name="primaryIdentifier" type="java.lang.String"/>
       <attribute name="lastUpdated" type="java.util.Date"/>
       <attribute name="name" type="java.lang.String"/>
       <reference name="organism" referenced-type="Organism"/>
       <collection name="synonyms" referenced-type="Synonym" reverse-reference="subject"/>
       <collection name="publications" referenced-type="Publication" reverse-reference="bioEntities"/>
       <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="subject"/>
       <collection name="phenotypesObserved" referenced-type="Phenotype" reverse-reference="observedIn"/>
       <collection name="phenotypesNotObserved" referenced-type="Phenotype" reverse-reference="notObservedIn"/>
       <collection name="crossReferences" referenced-type="CrossReference" reverse-reference="subject"/>
       <collection name="dataSets" referenced-type="DataSet" reverse-reference="bioEntities"/>
       <collection name="locatedFeatures" referenced-type="Location" reverse-reference="locatedOn"/>
       <collection name="locations" referenced-type="Location" reverse-reference="feature"/>
   </class> 

Protein contains copies of all these attributes, references, and collections for itself. If BioEntity inherits any fields itself, those are included as well.

<attribute name="primaryAccession" type="java.lang.String"/>: This creates an attribute of protein called primaryAccession (read primary accession) which is a string (word(s)). This line enables every protein object to hold a primaryAccession value in addition to any children which may inherit from it.

<reference name="sequence" referenced-type="Sequence"/>: This creates a reference named "sequence" to another data type, in this case Sequence. Only one sequence object can be referenced this way at a time. reverse-reference attributes may appear here, which matches the reciprocal relationship in the referenced type if one exists.

<collection name="CDSs" referenced-type="CDS" reverse-reference="protein"/>: CDSs collection. It can hold many references to CDSs, which in return are stored in the CDS "protein" (CDS.protein) field.

Using query builder

Querybuilder link.png

Use case: How do I find all proteins for a gene?

How are they intuitively connected? Genes are transcribed into transcripts which contain coding sequences which are translated into proteins. Keep this in mind while constructing a query.

  • Select "Gene" in the box under "Select a Data Type to Begin a Query"
  • Click "Select", or double click "Gene"

This list represents the Gene data available to display. Various fields can be chosen to be shown in the results table, links to other data types followed, and filters can be set to constrain the results.

What do you want your results to look like?

We want to display the ID's of each involved object, plus the gene symbol. This calls for:

  1. Gene.primaryIdentifier
  2. Gene.symbol
  3. Gene (some relationship)-> Transcript.primaryIdentifier
  4. Gene (some relationship)-> CDS.primaryIdentifier
  5. Gene (some relationship)-> Protein.primaryIdentifier

Building the query

click these areas

The DB Identifier represents the primaryIdentifier, this is the only field name replaced in this way. Click show next to any of these field names to add that attribute to the result table as a column.

  • Show the primaryIdentifier (DB Identifier), and symbol
  • Gene contains a collection of transcripts, this relationship represents transcription. Expand the relationship by clicking on the [+]

Creating templates

Creating lists

Creating widgets