Difference between revisions of "ModENCODE Integration status"

From WormBaseWiki
Jump to navigationJump to search
Line 131: Line 131:
 
     This is 47 data-sets of tiling array expression data in various tissues, cells and life-stages.
 
     This is 47 data-sets of tiling array expression data in various tissues, cells and life-stages.
 
     This should be dealt with by the Caltech expression group.
 
     This should be dealt with by the Caltech expression group.
 +
    These are still only the Transcriptionally Active Regions called by the "maxgap-minrun" algorithm instead is the "MSTAD" program, as described in the paper.
  
 
===[http://intermine.modencode.org/query/experiment.do?experiment=Definition+of+comprehensive+set+of+C.+elegans+transcripts+and+expression+for+various+stages+and+conditions Definition of comprehensive set of C. elegans transcripts and expression for various stages and conditions]===
 
===[http://intermine.modencode.org/query/experiment.do?experiment=Definition+of+comprehensive+set+of+C.+elegans+transcripts+and+expression+for+various+stages+and+conditions Definition of comprehensive set of C. elegans transcripts and expression for various stages and conditions]===

Revision as of 15:52, 25 April 2012

Contents

modMine projects

modMine Project categories

http://www.modencode.org/ then follow the links down the left-hand side


Chromatin structure

Nucleosome mapping

   Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version.
   Script prepared ready to convert the final data to Feature objects.
   Will need to check we have all the data and load it in.

Genome wide chromatin profiling in C. elegans

   Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version.
   Script prepared ready to convert the final data to Feature objects.
   Will need to check we have all the data and load it in.

Gene Structure

Encyclopedia of C. elegans 3' UTRs and their regulatory elements

   Worked on by Paul Davis.
   Oct 2010 - 3' UTR regions added as Features
   Sequence data to act as evidence for the 3' UTR regions has been requested from Marc Perry/Nicole Washington several times, 
   most recently at the March 2012 modENCODE meeting where Nicole said she would get it.
   27,773 PolyA sites are in this data - Not in ACeDB.

   Not progressed any further as the remit of the project was to produce lots of additional Feature types (Binding sites etc.)
   Which never materialised in a release, so was waiting for the final wind down of the project before taking the final set of features?

SL Acceptor Site Confirmation in C. elegans

   May 2010 - These SL1/2 locations were sent to Gary by LaDeana.
   These locations should be checked and the 5' RACE sequences added.

Reanalysis of comprehensive set of C. elegans transcripts and expression for various stages and conditions

   This is an analysis of 63 RNASeq libraries giving TSS, TES, SL1/2, integrated CDS and transcript genes, and PolyA sites
   The PolyA sites were not added as many "PolyA sites" were found to be the end of exons with a poly-A region on the following exon.
   May 2010 - Data added not from here but directly from LaDeana's data.
   Jan 2011 - Data updated not from here but directly from LaDeana's data.
   Nov 2011 - Integrated CDS added not from here but directly from LaDeana's data ('1003' version)
   March 2012 (WS231) - Integrated CDS updated not from here but directly from LaDeana's data ('AG1110.v1201' version)

Intron Confirmation in C. elegans

   This project produced the 'FM' series of sequences in ENA.
   Automatically imported with other ENA sequences.
   Being used in sequence curation and automated transcript building.

Mass spectrometric validation of short ORFs

   Dec 2011 (WS230) - Added this set of 63,000 MacCoss mass-spectrometry peptides to ACeDB

Genelet Gap Closing

   This project produced the 'FR' series of sequences in ENA.
   Automatically imported with other ENA sequences.
   Being used in sequence curation and automated transcript building.

   Not the cleanest of data so some efforts have been made to clip poor quality data, but it's not perfect.

Identification of small RNAs in C. elegans

   Oct 2010 - Added these "7K data set" RNA genes directly from the supplementary data section of the "Integrative analysis of Functional Elements..." paper.
   Are some of these identified as miRNAs? Check this and add 'miRNA' tags.

Transcription Start Site Confirmation in C. elegans

   This project uses some of the 'FM' series of sequences in ENA as evidence for Transcription Start Sites.
   Automatically imported with other ENA sequences.
   Being used in sequence curation and automated transcript building.
   This is the same set of TSS sites as has been imported from the "Reanalysis of comprehensive set of C. elegans transcripts" project.

Histone modification and replacement

Chromatin ChIP-chip of Modified Histones in C. elegans

   Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version.
   Script prepared ready to convert the final data to Feature objects.
   Will need to check we have all the data and load it in.

Chromatin ChIP-seq

   Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version.
   Script prepared ready to convert the final data to Feature objects.
   Will need to check we have all the data and load it in.

Metadata only

Validation of C. elegans miRNAs by qRT-PCR

   Description of how to identify miRNAs
   No data.

Other chromatin binding sites

Chromatin ChIP-chip

   Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version.
   Script prepared ready to convert the final data to Feature objects.
   Will need to check we have all the data and load it in.

Chromatin ChIP-chip of non-Histone Chromosomal Proteins in C. elegans

   Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version.
   Script prepared ready to convert the final data to Feature objects.
   Will need to check we have all the data and load it in.

RNA expression profiling

Assaying RNA Expression in C. elegans

   This is a single tiling-array data-set of genomic regions in the 'embryo' life-stage - it looks like a test for a project that was never followed through.
   This should be dealt with by the Caltech expression group.
   To get the expression levels, click in the right-hand-side green box on the number after "expression level"

Transcriptional profiling of C. elegans N2 embryos

   This is a single tiling-array data-set of genomic regions in the 'young adult' life-stage - it looks like a test for a project that was never followed through.
   This should be dealt with by the Caltech expression group.
   There is no GFF data, only .wig files.

Small RNA expression in C. elegans embryos

   Data used to produce the '7K-data set'
   The data consists of fastq files of RNASeq data.
   We could investigate the possibility and problems involved in using tophat/cufflinks to align this data and get expression levels.

Identification of tissue and stage-specific transcribed sequences with expression profile maps

   This is 47 data-sets of tiling array expression data in various tissues, cells and life-stages.
   This should be dealt with by the Caltech expression group.
   These are still only the Transcriptionally Active Regions called by the "maxgap-minrun" algorithm instead is the "MSTAD" program, as described in the paper.

Definition of comprehensive set of C. elegans transcripts and expression for various stages and conditions

   This is the Integrated transcript data with evaluation of the expression levels.
   This is the same gene model data as "Reanalysis of comprehensive set of C. elegans transcripts and expression for various stages and conditions" above.
   LaDeana says they have added new RNASeq libraries to SRA - check and set them up in our pipeline.
   I propose that we continue to use our RNASeq cufflinks pipeline to evaluate our own expression levels as this will continue to be updated with every release.

Definition of comprehensive set of C. briggsae transcripts and expression for various stages and conditions

   Work in progress.

Changes in expression of small RNAs under stress conditions in C. elegans

   Data used to produce the '7K-data set'
   The data consists of fastq files of RNASeq data.
   We could investigate the possibility and problems involved in using tophat/cufflinks to align this data and get expression levels.

Identification of transcribed sequences under pathogenic bacterial growth conditions with expression profile maps

   This is 6 data-sets of tiling array expression data in various tissues, cells and life-stages.
   This should be dealt with by the Caltech expression group.

Changes in expression of small RNAs during aging in C. elegans

   Data used to produce the '7K-data set'
   The data consists of fastq files of RNASeq data.
   We could investigate the possibility and problems involved in using tophat/cufflinks to align this data and get expression levels.

TF binding sites

ChIP-Seq Identification of C. elegans TF Binding Sites

   Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version.
   Script prepared ready to convert the final data to Feature objects.
   Will need to check we have all the data and load it in.

Papers

Unlocking the secrets of the genome.Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, Micklem G, Piano F…Nature, 459, pp 927-930,

   A description of the proposed modENCODE project

Broad chromosomal domains of histone modification patterns in C. elegans.Liu T, Rechtsteiner A, Egelhofer TA, Vielle A, Latorre I, Cheung MS, Ercan S, Ikegami K…Genome Research, 21, pp 227-236,

   Description of the histone data methods and their distribution over the chromosomes.

High nucleosome occupancy is encoded at X-linked gene promoters in C. elegans.Ercan S, Lubling Y, Segal E, Lieb JD.Genome Research, 21, pp 237-244,

   Description of the nucleosome data methods and their distribution over the chromosomes.

Diverse transcription factor binding features revealed by genome-wide ChIP-seq in C. elegans.Niu W, Lu ZJ, Zhong M, Sarov M, Murray JI, Brdlik CM, Janette J…Genome Research, 21, pp 245-254,

   Description of the TF data methods and their distribution over the chromosomes.
   Need to contact the authors to get PWM files for: LIN-39, MAB-5, and EGL-5

A global analysis of C. elegans trans-splicing.Allen MA, Hillier LW, Waterston RH, Blumenthal T.Genome Research, 21, pp 255-264,

  Description of methods used to detect SL1 and SL2 sites and their distribution.

Multimodal RNA-seq using single-strand, double-strand, and CircLigase-based capture yields a refined and extended description of the C. elegans transcriptome.Lamm AT…Genome Research, 21, pp 265-275,

   Feb 2012 (WS231) - added the following from this paper.
       No. of new SL1 features: 504
       No. of new SL2 features: 36
       No. of new polyA_site features: 10936
       There is a new 'polysome' set of FeatureData objects.

Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data.Lu ZJ…Genome Research, 21, pp 286-300,

   Description of methods used to produce the '7K' data set.

A spatial and temporal map of C. elegans gene expression.Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C…Genome Research, 21, pp 325-341,

   Description of using selff organising maps for gene expression similarity
   Looked for over-represented motifs found by the FIRE algorithm

Genome-wide analysis of alternative splicing in Caenorhabditis elegans.Ramani AK, Calarco JA, Pan Q, Mavandadi S, Wang Y, Nelson AC, Lee LJ, Morris Q, Blencowe BJ…Genome Research, 21, pp 342-348,

   Description of identifying novel splice sites.
   We already do this as part of our RNASeq pipeline and curation database and the spliced introns are in acedb as Feature_data objects.

Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY…Science, 330 no. 6012, pp 1775-1787,

   Added the following from this paper:
       Probable parent of pseudogenes
       the '7k-set' of probable ncRNA genes
       304 HOT regions: short binding regions that were significantly enriched in most TF ChIP-seq experiments
   Check on the following not yet added:
       Our computational and experimental analysis validated 13 previously unidentified mirtrons (6, 22).
       Small-RNA data also defined 102 additional candidate canonical miRNAs.
       We tested a number of these intergenic candidates to validate expression: RT-PCR detected RNA products for 14 of 15, and Northern blots detected expression for three of five (24)
       To further characterize TF-binding sites, we searched for 8- to 12-bp cis-regulatory motifs within the ChIP-seq peaks (6) and found strong motifs for eight TFs (BLMP-1, CEH-14, CEH-30, EGL-5, HLH-1, LIN-39, NHR-6, and PHA-4) (fig. S35). Two of these are similar to previously described motifs (PHA-4 and HLH-1).
   Useful information:
       Although most transcription factors target both protein-coding and known ncRNA genes, GEI-11 preferentially targets ncRNAs