ModENCODE Integration status
From WormBaseWiki
Revision as of 11:12, 26 April 2012 by Gwilliams (talk | contribs) (→Transcriptional profiling of C. elegans N2 embryos)
Contents
- 1 modMine projects
- 1.1 modMine Project categories
- 1.2 Chromatin structure
- 1.3 Gene Structure
- 1.3.1 Encyclopedia of C. elegans 3' UTRs and their regulatory elements
- 1.3.2 SL Acceptor Site Confirmation in C. elegans
- 1.3.3 Reanalysis of comprehensive set of C. elegans transcripts and expression for various stages and conditions
- 1.3.4 Intron Confirmation in C. elegans
- 1.3.5 Mass spectrometric validation of short ORFs
- 1.3.6 Genelet Gap Closing
- 1.3.7 Identification of small RNAs in C. elegans
- 1.3.8 Transcription Start Site Confirmation in C. elegans
- 1.4 Histone modification and replacement
- 1.5 Metadata only
- 1.6 Other chromatin binding sites
- 1.7 RNA expression profiling
- 1.7.1 Assaying RNA Expression in C. elegans
- 1.7.2 Transcriptional profiling of C. elegans N2 embryos
- 1.7.3 Small RNA expression in C. elegans embryos
- 1.7.4 Identification of tissue and stage-specific transcribed sequences with expression profile maps
- 1.7.5 Definition of comprehensive set of C. elegans transcripts and expression for various stages and conditions
- 1.7.6 Definition of comprehensive set of C. briggsae transcripts and expression for various stages and conditions
- 1.7.7 Changes in expression of small RNAs under stress conditions in C. elegans
- 1.7.8 Identification of transcribed sequences under pathogenic bacterial growth conditions with expression profile maps
- 1.7.9 Changes in expression of small RNAs during aging in C. elegans
- 1.8 TF binding sites
- 2 Papers
- 2.1 Unlocking the secrets of the genome.Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, Micklem G, Piano F…Nature, 459, pp 927-930,
- 2.2 Broad chromosomal domains of histone modification patterns in C. elegans.Liu T, Rechtsteiner A, Egelhofer TA, Vielle A, Latorre I, Cheung MS, Ercan S, Ikegami K…Genome Research, 21, pp 227-236,
- 2.3 High nucleosome occupancy is encoded at X-linked gene promoters in C. elegans.Ercan S, Lubling Y, Segal E, Lieb JD.Genome Research, 21, pp 237-244,
- 2.4 Diverse transcription factor binding features revealed by genome-wide ChIP-seq in C. elegans.Niu W, Lu ZJ, Zhong M, Sarov M, Murray JI, Brdlik CM, Janette J…Genome Research, 21, pp 245-254,
- 2.5 A global analysis of C. elegans trans-splicing.Allen MA, Hillier LW, Waterston RH, Blumenthal T.Genome Research, 21, pp 255-264,
- 2.6 Multimodal RNA-seq using single-strand, double-strand, and CircLigase-based capture yields a refined and extended description of the C. elegans transcriptome.Lamm AT…Genome Research, 21, pp 265-275,
- 2.7 Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data.Lu ZJ…Genome Research, 21, pp 286-300,
- 2.8 A spatial and temporal map of C. elegans gene expression. Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C…Genome Research, 21, pp 325-341,
- 2.9 Genome-wide analysis of alternative splicing in Caenorhabditis elegans.Ramani AK, Calarco JA, Pan Q, Mavandadi S, Wang Y, Nelson AC, Lee LJ, Morris Q, Blencowe BJ…Genome Research, 21, pp 342-348,
- 2.10 Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY…Science, 330 no. 6012, pp 1775-1787,
modMine projects
modMine Project categories
http://www.modencode.org/ then follow the links down the left-hand side
Chromatin structure
Nucleosome mapping
Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version. Script prepared ready to convert the final data to Feature objects. Will need to check we have all the data and load it in.
Genome wide chromatin profiling in C. elegans
Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version. Script prepared ready to convert the final data to Feature objects. Will need to check we have all the data and load it in.
Gene Structure
Encyclopedia of C. elegans 3' UTRs and their regulatory elements
Worked on by Paul Davis. Oct 2010 - 3' UTR regions added as Features Sequence data to act as evidence for the 3' UTR regions has been requested from Marc Perry/Nicole Washington several times, most recently at the March 2012 modENCODE meeting where Nicole said she would get it. 27,773 PolyA sites are in this data - Not in ACeDB. Not progressed any further as the remit of the project was to produce lots of additional Feature types (Binding sites etc.) Which never materialised in a release, so was waiting for the final wind down of the project before taking the final set of features?
SL Acceptor Site Confirmation in C. elegans
May 2010 - These SL1/2 locations were sent to Gary by LaDeana. These locations should be checked and the 5' RACE sequences added.
Reanalysis of comprehensive set of C. elegans transcripts and expression for various stages and conditions
This is an analysis of 63 RNASeq libraries giving TSS, TES, SL1/2, integrated CDS and transcript genes, and PolyA sites The PolyA sites were not added as many "PolyA sites" were found to be the end of exons with a poly-A region on the following exon. May 2010 - Data added not from here but directly from LaDeana's data. Jan 2011 - Data updated not from here but directly from LaDeana's data. Nov 2011 - Integrated CDS added not from here but directly from LaDeana's data ('1003' version) March 2012 (WS231) - Integrated CDS updated not from here but directly from LaDeana's data ('AG1110.v1201' version)
Intron Confirmation in C. elegans
This project produced the 'FM' series of sequences in ENA. Automatically imported with other ENA sequences. Being used in sequence curation and automated transcript building.
Mass spectrometric validation of short ORFs
Dec 2011 (WS230) - Added this set of 63,000 MacCoss mass-spectrometry peptides to ACeDB
Genelet Gap Closing
This project produced the 'FR' series of sequences in ENA. Automatically imported with other ENA sequences. Being used in sequence curation and automated transcript building. Not the cleanest of data so some efforts have been made to clip poor quality data, but it's not perfect.
Identification of small RNAs in C. elegans
Oct 2010 - Added these "7K data set" RNA genes directly from the supplementary data section of the "Integrative analysis of Functional Elements..." paper. Are some of these identified as miRNAs? Check this and add 'miRNA' tags.
Transcription Start Site Confirmation in C. elegans
This project uses some of the 'FM' series of sequences in ENA as evidence for Transcription Start Sites. Automatically imported with other ENA sequences. Being used in sequence curation and automated transcript building. This is the same set of TSS sites as has been imported from the "Reanalysis of comprehensive set of C. elegans transcripts" project.
Histone modification and replacement
Chromatin ChIP-chip of Modified Histones in C. elegans
Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version. Script prepared ready to convert the final data to Feature objects. Will need to check we have all the data and load it in.
Chromatin ChIP-seq
Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version. Script prepared ready to convert the final data to Feature objects. Will need to check we have all the data and load it in.
Metadata only
Validation of C. elegans miRNAs by qRT-PCR
Description of how to identify miRNAs No data.
Other chromatin binding sites
Chromatin ChIP-chip
Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version. Script prepared ready to convert the final data to Feature objects. Will need to check we have all the data and load it in.
Chromatin ChIP-chip of non-Histone Chromosomal Proteins in C. elegans
Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version. Script prepared ready to convert the final data to Feature objects. Will need to check we have all the data and load it in.
RNA expression profiling
Assaying RNA Expression in C. elegans
This is a single tiling-array data-set of genomic regions in the 'embryo' life-stage - it looks like a test for a project that was never followed through. This should be dealt with by the Caltech expression group. To get the expression levels, click in the right-hand-side green box on the number after "expression level" There are only expression values for WS190 CDS isoform spans. There is no data that can be used to create expression values for the latest gene models.
Transcriptional profiling of C. elegans N2 embryos
This is a single tiling-array data-set of genomic regions in the 'young adult' life-stage - it looks like a test for a project that was never followed through. This should be dealt with by the Caltech expression group. There is only .wig files - no peaks have been called on this data.
Small RNA expression in C. elegans embryos
Data used to produce the '7K-data set' The data consists of fastq files of RNASeq data. We could investigate the possibility and problems involved in using tophat/cufflinks to align this data and get expression levels.
Identification of tissue and stage-specific transcribed sequences with expression profile maps
This is 47 data-sets of tiling array expression data in various tissues, cells and life-stages. This should be dealt with by the Caltech expression group. These are still only the Transcriptionally Active Regions called by the "maxgap-minrun" algorithm instead of the "MSTAD" program, as described in the paper. 10 Feb 2011 - modENCODE say they have the mSTAD data and they will make it available 26 Apr 2012 - Asked modENCODE again if they can make it available
Definition of comprehensive set of C. elegans transcripts and expression for various stages and conditions
This is the Integrated transcript data with evaluation of the expression levels. This is the same gene model data as "Reanalysis of comprehensive set of C. elegans transcripts and expression for various stages and conditions" above. LaDeana says they have added new RNASeq libraries to SRA - check and set them up in our pipeline. I propose that we continue to use our RNASeq cufflinks pipeline to evaluate our own expression levels as this will continue to be updated with every release.
Definition of comprehensive set of C. briggsae transcripts and expression for various stages and conditions
Work in progress.
Changes in expression of small RNAs under stress conditions in C. elegans
Data used to produce the '7K-data set' The data consists of fastq files of RNASeq data. We could investigate the possibility and problems involved in using tophat/cufflinks to align this data and get expression levels.
Identification of transcribed sequences under pathogenic bacterial growth conditions with expression profile maps
This is 6 data-sets of tiling array expression data in various tissues, cells and life-stages. This should be dealt with by the Caltech expression group.
Changes in expression of small RNAs during aging in C. elegans
Data used to produce the '7K-data set' The data consists of fastq files of RNASeq data. We could investigate the possibility and problems involved in using tophat/cufflinks to align this data and get expression levels.
TF binding sites
ChIP-Seq Identification of C. elegans TF Binding Sites
Not in ACeDB yet because the data will be held as Features and we don't want lots of Features retired with each new version. Script prepared ready to convert the final data to Feature objects. Will need to check we have all the data and load it in.
Papers
Unlocking the secrets of the genome.Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, Micklem G, Piano F…Nature, 459, pp 927-930,
A description of the proposed modENCODE project
Broad chromosomal domains of histone modification patterns in C. elegans.Liu T, Rechtsteiner A, Egelhofer TA, Vielle A, Latorre I, Cheung MS, Ercan S, Ikegami K…Genome Research, 21, pp 227-236,
Description of the histone data methods and their distribution over the chromosomes.
High nucleosome occupancy is encoded at X-linked gene promoters in C. elegans.Ercan S, Lubling Y, Segal E, Lieb JD.Genome Research, 21, pp 237-244,
Description of the nucleosome data methods and their distribution over the chromosomes.
Diverse transcription factor binding features revealed by genome-wide ChIP-seq in C. elegans.Niu W, Lu ZJ, Zhong M, Sarov M, Murray JI, Brdlik CM, Janette J…Genome Research, 21, pp 245-254,
Description of the TF data methods and their distribution over the chromosomes. Need to contact the authors to get PWM files for: LIN-39, MAB-5, and EGL-5
A global analysis of C. elegans trans-splicing.Allen MA, Hillier LW, Waterston RH, Blumenthal T.Genome Research, 21, pp 255-264,
Description of methods used to detect SL1 and SL2 sites and their distribution.
Multimodal RNA-seq using single-strand, double-strand, and CircLigase-based capture yields a refined and extended description of the C. elegans transcriptome.Lamm AT…Genome Research, 21, pp 265-275,
Feb 2012 (WS231) - added the following from this paper. No. of new SL1 features: 504 No. of new SL2 features: 36 No. of new polyA_site features: 10936 There is a new 'polysome' set of FeatureData objects.
Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data.Lu ZJ…Genome Research, 21, pp 286-300,
Description of methods used to produce the '7K' data set.
A spatial and temporal map of C. elegans gene expression. Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C…Genome Research, 21, pp 325-341,
Description of using self organising maps for gene expression similarity. Looked for over-represented motifs found by the FIRE algorithm. This is the paper from David Miller's lab that describes using MSTAD to call Transcriptionally active Regions (modENCODE still give onlt the data called by the "minrun-maxrun" algorithm)
Genome-wide analysis of alternative splicing in Caenorhabditis elegans.Ramani AK, Calarco JA, Pan Q, Mavandadi S, Wang Y, Nelson AC, Lee LJ, Morris Q, Blencowe BJ…Genome Research, 21, pp 342-348,
Description of identifying novel splice sites. We already do this as part of our RNASeq pipeline and curation database and the spliced introns are in acedb as Feature_data objects.
Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY…Science, 330 no. 6012, pp 1775-1787,
Added the following from this paper: Probable parent of pseudogenes the '7k-set' of probable ncRNA genes 304 HOT regions: short binding regions that were significantly enriched in most TF ChIP-seq experiments Check on the following not yet added: Our computational and experimental analysis validated 13 previously unidentified mirtrons (6, 22). Small-RNA data also defined 102 additional candidate canonical miRNAs. We tested a number of these intergenic candidates to validate expression: RT-PCR detected RNA products for 14 of 15, and Northern blots detected expression for three of five (24) To further characterize TF-binding sites, we searched for 8- to 12-bp cis-regulatory motifs within the ChIP-seq peaks (6) and found strong motifs for eight TFs (BLMP-1, CEH-14, CEH-30, EGL-5, HLH-1, LIN-39, NHR-6, and PHA-4) (fig. S35). Two of these are similar to previously described motifs (PHA-4 and HLH-1). Useful information: Although most transcription factors target both protein-coding and known ncRNA genes, GEI-11 preferentially targets ncRNAs