Revision as of 21:37, 7 May 2013

1 Current status
2 Follow progress here:
3 Account management
4 Data contained in WormMine
- 4.1 What is this data mapping?
5 Understanding our model
- 5.1 How to read it
  - 5.1.1 Line by line:
6 Creating queries
- 6.1 Use case: How do I find all coding sequences for a gene?
  - 6.1.1 Method 1
7 Using query builder
8 Creating templates
9 Creating lists
10 Creating widgets

Current status

ONLINE

Follow progress here:

GitHub commit history

Account management

Data contained in WormMine

This lists all data sources contained in WormMine

Source	Description	Source	Species	Data mapping
GO	Ontology terms and relationships comprising GO	GO project website.	N/A
GO Annotations	Relationships between Genes and GO	GO project website.	Caenorhabditis elegans
Genomic sequences	Fasta DNA sequences	WormBase FTP	Caenorhabditis elegans
Protein sequences	Fasta peptide sequences	WormBase FTP	Caenorhabditis elegans
Gene locations	Gene chromosomal coordinates	WormBase FTP GFF3	Caenorhabditis elegans
Transcript locations	Transcript chromosomal coordinates	WormBase FTP GFF3	Caenorhabditis elegans
CDS locations	CDS chromosomal coordinates	WormBase FTP GFF3	Caenorhabditis elegans
Gene metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Transcript metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
CDS metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Variation metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Protein metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Phenotype metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Expression Pattern metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Anatomy Term metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Expression Cluster metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK
Life Stage metadata	Select fields extracted from Ace	AceDB XML dump	All in AceDB	LINK

What is this data mapping?

A loading program plugin has been created for InterMine which extracts data embedded in XML files directly into an InterMine instance. Mapping files are used to configure this program and detail the AceDB XML dumps to InterMine translation. XPath is used to query the XML, and can be reviewed here.

Understanding our model

The data contained in WormMine follows a central model schema. This model should be understood sufficiently to be able to query the data and create templates.

WormMine model file

This schema file contains all of the data types contained in WormMine, relationships between them, and each one's data fields.

How to read it

Looking at the protein class:

   <class name="Protein" extends="BioEntity" is-interface="true">
       <attribute  name="molecularWeight" type="java.lang.Float"/>
       <attribute  name="md5checksum" type="java.lang.String"/>
       <attribute  name="length" type="java.lang.Integer"/>
       <attribute  name="geneName" type="java.lang.String"/>
       <attribute  name="primaryAccession" type="java.lang.String"/>
       <reference  name="sequence" referenced-type="Sequence"/>
       <collection name="CDSs" referenced-type="CDS" reverse-reference="protein"/>
       <collection name="genes" referenced-type="Gene" reverse-reference="proteins"/>
       <collection name="transcripts" referenced-type="Transcript" reverse-reference="protein"/>
   </class>

Line by line:

<class name="Protein" extends="BioEntity" is-interface="true">

extends="BioEntity": Protein is a child of BioEntity therefore it inherits BioEntity's data fields.

Protein's parent, BioEntity:

   <class name="BioEntity" is-interface="true">
       <attribute name="secondaryIdentifier" type="java.lang.String"/>
       <attribute name="symbol" type="java.lang.String"/>
       <attribute name="primaryIdentifier" type="java.lang.String"/>
       <attribute name="lastUpdated" type="java.util.Date"/>
       <attribute name="name" type="java.lang.String"/>
       <reference name="organism" referenced-type="Organism"/>
       <collection name="synonyms" referenced-type="Synonym" reverse-reference="subject"/>
       <collection name="publications" referenced-type="Publication" reverse-reference="bioEntities"/>
       <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="subject"/>
       <collection name="phenotypesObserved" referenced-type="Phenotype" reverse-reference="observedIn"/>
       <collection name="phenotypesNotObserved" referenced-type="Phenotype" reverse-reference="notObservedIn"/>
       <collection name="crossReferences" referenced-type="CrossReference" reverse-reference="subject"/>
       <collection name="dataSets" referenced-type="DataSet" reverse-reference="bioEntities"/>
       <collection name="locatedFeatures" referenced-type="Location" reverse-reference="locatedOn"/>
       <collection name="locations" referenced-type="Location" reverse-reference="feature"/>
   </class>

Protein contains copies of all these attributes, references, and collections for itself. If BioEntity inherits any fields itself, those are included as well.

<attribute name="primaryAccession" type="java.lang.String"/>: This creates an attribute of protein called primaryAccession (read primary accession) which is a string (word(s)). This line enables every protein object to hold a primaryAccession value in addition to any children which may inherit from it.

<reference name="sequence" referenced-type="Sequence"/>: This creates a reference named "sequence" to another data type, in this case Sequence. Only one sequence object can be referenced this way at a time. reverse-reference attributes may appear here, which matches the reciprocal relationship in the referenced type if one exists.

<collection name="CDSs" referenced-type="CDS" reverse-reference="protein"/>: CDSs collection. It can hold many references to CDSs, which in return are stored in the CDS "protein" (CDS.protein) field.

Creating queries

Use case: How do I find all coding sequences for a gene?

How are they intuitively connected? To rationalize biological relationships: genes produce transcripts which produce coding sequences. Keep this in mind while constructing a query.

Method 1

Go to the model, find the gene class ( text find <class name="Gene" ).

   <class name="Gene" extends="SequenceFeature" is-interface="true">
       <attribute name="briefDescription" type="java.lang.String"/>
       <attribute name="operon" type="java.lang.String"/>
       <attribute name="description" type="java.lang.String"/>
       <reference name="referenceAllele" referenced-type="Allele"/>
       <reference name="downstreamIntergenicRegion" referenced-type="IntergenicRegion"/>
       <reference name="upstreamIntergenicRegion" referenced-type="IntergenicRegion"/>
       <collection name="expressionClusters" referenced-type="ExpressionCluster" reverse-reference="genes"/>
       <collection name="regulatoryRegions" referenced-type="RegulatoryRegion" reverse-reference="gene"/>
       <collection name="goAnnotation" referenced-type="GOAnnotation"/>
       <collection name="transcripts" referenced-type="Transcript" reverse-reference="gene"/>
       <collection name="CDSs" referenced-type="CDS" reverse-reference="gene"/>
       <collection name="flankingRegions" referenced-type="GeneFlankingRegion" reverse-reference="gene"/>
       <collection name="proteins" referenced-type="Protein" reverse-reference="genes"/>
       <collection name="UTRs" referenced-type="UTR" reverse-reference="gene"/>
       <collection name="exons" referenced-type="Exon" reverse-reference="gene"/>
       <collection name="expressionPatterns" referenced-type="ExpressionPattern" reverse-reference="gene"/>
       <collection name="alleles" referenced-type="Allele" reverse-reference="gene"/>
       <collection name="introns" referenced-type="Intron" reverse-reference="genes"/>
       <collection name="strains" referenced-type="Strain" reverse-reference="gene"/>
   </class>

It references transcripts in this line:
<collection name="transcripts" referenced-type="Transcript" reverse-reference="gene"/>
1. This is a step in the right direction, query stub is Gene.transcripts

@@ Line 11: / Line 11: @@
 = Follow progress here: =
 [https://github.com/WormBase/intermine/commits/jw GitHub commit history]
+= Account management =
 =Data contained in WormMine=
@@ Line 189: / Line 191: @@
 <code><collection name="CDSs" referenced-type="CDS" reverse-reference="protein"/></code>: CDSs collection.  It can hold many references to CDSs, which in return are stored in the CDS "protein" (CDS.protein) field.
-= Account management =
 = Creating queries =

Difference between revisions of "WormMine"

Revision as of 21:37, 7 May 2013

Contents

Current status

Follow progress here:

Account management

Data contained in WormMine

What is this data mapping?

Understanding our model

How to read it

Line by line:

Creating queries

Use case: How do I find all coding sequences for a gene?

Method 1

Using query builder

Creating templates

Creating lists

Creating widgets

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools