FANTOM5 RNA-seq
Welcome to the FANTOM5 RNA-seq analysis page. This page is intended to document efforts to incorporate RNA-seq into the FANTOM5 project at large. The need for RNA-seq in FANTOM5 was conceived as a means of discerning the nature of novel peaks first identified in the hCAGE data. The initial target of these efforts is providing supplementary data for the long noncoding RNA (lncRNA) and possibly promoterome main papers. As such, if all goes well we intend for much of this analysis to be folded into the lncRNA main paper.
Specific tasks
- selection of FANTOM5 samples for RNA-seq, sequencing, RNA-seq processing, and transcript assembly (Al, Max, RIKEN OSC)
- collection and formatting of publicly-available RNA-seq data for further assistance in 'validation' (Max, open to recommendations)
- establish guidelines to determine inclusion as novel lncRNA
- incorporate novel lncRNAs into main dataset
- classification of lncRNAs incorporating RNA-seq data
Progress
Task #1
Selection of samples: Here is the list File:F5 RNAseq library list.xlsx of samples selected for FANTOM5 RNA-seq. The selection criteria was simple: samples enabling the sequencing of the largest number of novel peaks based on the hCAGE data for the minimum number of samples were chosen. Sequencing: Currently library construction is underway. Sequencing will begin upon finishing, hopefully finishing at the end of December. RNA-seq processing: to be updated. Transcript assembly: to be updated.
Task #2 (public RNA-seq)
(Max) Libraries targeted for collection:
- David Brawan et.al. set from "The evolution of gene expression levels in mammalian organs", PMID: 22012392 (polyA, directional?)
- Cabili et.al. set from "Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses", PMID: 21890647 (polyA, directional)
- Illumina BodyMap2 (polyA, directional)
- other large sets (suggestions, pointers welcome)
Task #3 (establish guidelines to determine inclusion as novel lncRNA)
Peaks have already been filtered to remove overlap with known coding/non-coding transcripts.
- size of largest open reading frame
- (Jia Hui & Ben Brown) translational analysis to assess coding potential
- size selection?
Task #4
(OSC) Incorporation of novel lncRNAs into main dataset
Task #5 (classification of lncRNAs using RNA-seq & CAGE-scan data)
- (Eivind, Max, Helen) overlap with short RNA (TSS-based and other processing products), see short RNA page
- (?) overlap with other genome features
- (?) splicing
- frequency
- sequence composition at "strong" and "weak" splicing sites
- alternative splicing
- (CBRC) clustering of lncRNA via various metrics
- (CBRC) 2D structural clustering
- (CBRC) 2D structural motif over-representation clustering
- (CBRC) 2D structural accessibility clustering