Suggested pre-release data checks

From WormBaseWiki
Jump to navigationJump to search

This page is for people to add checks that should be done to confirm data is correct prior to public release of the database from Sanger. Ideally, you should give a description, an acedb query and expected outcome. I've added the one Igor sent regarding the RNAi-gene connections problems that occured in WS155. When these are added they can be systematically included for each release.

-------------------------------------------------------------------------
 

DESC: The number of RNAi experiments with more than one associated Gene

QUERY: find rnai COUNT gene > 1 AND uniquely_mapped

RESULT: 1716

-------------------------------------------------------------------------
 

DESC: The number of microarray results with connections to genes

QUERY: find microarray_results gene

RESULT: 58800

-------------------------------------------------------------------------
 

DESC "The number of RNAi results with connections to genes"

QUERY 'find RNAi Gene'

RESULT 61844

-------------------------------------------------------------------------
 

DESC "PCR products overlapping CDS"

QUERY "find PCR_product Overlaps_CDS"

RESULT 62852

-------------------------------------------------------------------------
 

DESC "The number of wormpep without pep_homol"

QUERY 'find wormpep !pep_homol'

RESULT 45

-------------------------------------------------------------------------
 

DESC "tRNAs not attached to parent properly"

QUERY 'Transcript AND NEXT AND NOT NEXT'

RESULT 0

-------------------------------------------------------------------------
 

DESC "Homol_data without waba"

QUERY 'find Homol_data *waba !DNA_homol'

RESULT 0

-------------------------------------------------------------------------
 

DESC "Homol_data without Pep_homol"

QUERY 'find Homol_data *wublastx* !Pep_homol'

RESULT 5207

-------------------------------------------------------------------------
 

DESC "Inverted repeat Feature_data without features"

QUERY 'find Feature_data *inverted !feature'

RESULT 505

-------------------------------------------------------------------------
 

DESC "TRF repeat Feature_data without features"

QUERY 'find Feature_data *TRF !Feature'

RESULT 0

-------------------------------------------------------------------------
 

DESC "Oligo_sets with overlapping_CDS"

QUERY 'find Oligo_Set Overlaps_CDS'

RESULT 74615

-------------------------------------------------------------------------
 

DESC "operons without genes"

QUERY 'find operon !Contains_gene'

RESULT 0

-------------------------------------------------------------------------
 

DESC "variation gene connection"

QUERY 'find Variation Gene'

RESULT 21489

-------------------------------------------------------------------------
 

DESC "genes with structured description"

QUERY 'find Gene Structured_description'

RESULT 4157

-------------------------------------------------------------------------
 

DESC "genes with GO_term"

QUERY 'find Gene GO_term'

RESULT 9897

-------------------------------------------------------------------------
 

These are checks to compare what is in the database to what is in the GFF files These work by counting the number of lines in the GFF files that 'grep' extracts when searching for the term after GFF and comparing that to the number of objects returned from the acedb query after QUERY. Those with an EXPECT instead of QUERY are where I couldn't come up with a simple query to get the answer so took the last build figure.


DESC "Deletion and insertion alleles"

GFF "complex_change_in_nucleotide_sequence"

QUERY 'find Variation flanking_sequences AND method = "Deletion_and_insertion_allele"'

-------------------------------------------------------------------------
 

DESC "Deletion alleles"

GFF "deletion"

QUERY 'find Variation flanking_sequences AND method = "Deletion_allele"'

-------------------------------------------------------------------------
 

DESC "Substitution alleles"

GFF "substitution"

QUERY 'find Variation flanking_sequences AND method = "Substitution_allele"'

-------------------------------------------------------------------------
 

DESC "RNAi primary"

GFF "RNAi_primary"

EXPECT '134246'

-------------------------------------------------------------------------
 

DESC "RNAi secondary"

GFF "RNAi_secondary"

EXPECT' '14165'

-------------------------------------------------------------------------
 

DESC "Alleles"

GFF "sequence_variant"

QUERY 'find Variation flanking_sequences AND method = "Allele"'

-------------------------------------------------------------------------
 

DESC "Vancouver fosmids"

GFF "Vancouver_fosmid"

QUERY 'find Sequence "WRM*"' #bit of a cheat but much faster!

-------------------------------------------------------------------------
 

DESC "Coding_transcripts"

GFF "protein_coding_primary_transcript"

QUERY 'find Coding_transcripts"'

-------------------------------------------------------------------------
 

DESC "All PCR products"

GFF "PCR_product"

QUERY 'find PCR_product Canonical_parent'

-------------------------------------------------------------------------
 

DESC "Expression profiles"

GFF "Expr_profile"

QUERY 'find Expr_profile S_parent'

-------------------------------------------------------------------------
 

DESC "cDNA for RNAi"

GFF "cDNA_for_RNAi"

QUERY 'find RNAi Method Homol_homol follow Sequence'

-------------------------------------------------------------------------
 

DESC "mapped Oligo_set"

GFF "Oligo_set"

QUERY 'find Oligo_set'

-------------------------------------------------------------------------
 

DESC "mapped Operons"

GFF "operon"

QUERY 'find Operon'

-------------------------------------------------------------------------
 

DESC "SL1"

GFF "SL1_acceptor_site"

QUERY 'find Feature method = "SL1"'

-------------------------------------------------------------------------
 

DESC "SL2"

GFF "SL2_acceptor_site"

QUERY 'find Feature method = "SL2"'

-------------------------------------------------------------------------
 

DESC "polyA sites"

GFF "polyA_site"

QUERY 'find Feature method = "polyA_site"'