Extend rat gene models with CAGEscan: Difference between revisions
(Link to CAGEscan_mapping_protocol.) |
(Moved data to https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan.) |
||
| (8 intermediate revisions by 3 users not shown) | |||
| Line 7: | Line 7: | ||
* Some rat gene models miss the real 5′ ends. |
* Some rat gene models miss the real 5′ ends. |
||
* [[:Category:Methods_and_Protocols|HelicosCAGE]] and [[CAGEscan]] libraries are available from a “universal” rat RNA preparation. |
* [[:Category:Methods_and_Protocols|HelicosCAGE]] and [[CAGEscan]] libraries are available from a “universal” rat RNA preparation. |
||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
= Goals = |
= Goals = |
||
| Line 55: | Line 27: | ||
* Get an overview of data |
* Get an overview of data |
||
* (Mapping), processing, filtering |
* (Mapping), processing, filtering |
||
* Descriptive analysis(How far away (both upstream and downstream) are the refseq TSS's from CAGE clusters and the other way around. I will do this for the provisional clusters(Mette) |
|||
* How to deal with triplicates. Variance filtering? |
* How to deal with triplicates. Variance filtering? |
||
* Tag clustering |
* Tag clustering |
||
| Line 63: | Line 36: | ||
= Definitions = |
= Definitions = |
||
* Correct annotation. Find the number of correct annotated genes/transcripts based on more datasources for instance UCSC genes, refseq genes and mRNAs. Set a cutoff for how close the cluster should be to a known TSS based on the descriptive analysis(Probably around +/-50). |
|||
* Correct annotation |
|||
* Alternative promoter: If we have CAGE evidence for both the known promoter and a new one close by there is probably an alternative promoter. But if there is no evidence for the known one the TSS is probably wrongly annotated. |
|||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
== CAGEscan == |
|||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
| ⚫ | |||
BED12 formatted clusters of (properly) paired-end reads : https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/CLUSTER/ |
|||
== HeliScopeCAGE == |
|||
The RNA sample 10009-101B8 , <quote>Universal RNA - Rat Normal Tissues</quote>, was also used to prepare the HeliScopeCAGE library [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/UPDATE_011/f5pipeline/rat.tissue.hCAGE/ CNhs10614, available in BAM or BED formats]. |
|||
* Alternative promoter |
|||
| ⚫ | |||
| ⚫ | |||
Latest revision as of 19:54, 30 June 2011
Background
- Some rat gene models miss the real 5′ ends.
- HelicosCAGE and CAGEscan libraries are available from a “universal” rat RNA preparation.
Goals
Contribute experimental evidence that extends and update the gene models in rat.
Key questions:
- How many rat “refseq” TSS’es are correct?
- How many rat “refseq” TSS’es are shifted, either 5’ or 3’?
- Function of shifted genes, ontological enrichment of shifted genes compared to the correctly annotated ones?
- How many new genes with support(x)?
- Investigate high confidence TSSs (evidence from both CAGE-methods)
- Alternative promoters
- Other
Project outline
- Setup of scripting repo
- Get an overview of data
- (Mapping), processing, filtering
- Descriptive analysis(How far away (both upstream and downstream) are the refseq TSS's from CAGE clusters and the other way around. I will do this for the provisional clusters(Mette)
- How to deal with triplicates. Variance filtering?
- Tag clustering
- Aggregation of existing gene models.
- Define support(x)
- Integration of existing rat RNA-seq data from SRA/GEO?
Definitions
- Correct annotation. Find the number of correct annotated genes/transcripts based on more datasources for instance UCSC genes, refseq genes and mRNAs. Set a cutoff for how close the cluster should be to a known TSS based on the descriptive analysis(Probably around +/-50).
- Alternative promoter: If we have CAGE evidence for both the known promoter and a new one close by there is probably an alternative promoter. But if there is no evidence for the known one the TSS is probably wrongly annotated.
- Definition to belong to a known promoter: Max shift allowed(see correct annotation)? The 3' cagescan tag falls in the exon of a known gene?
- Filter: Seen with both Cagescan and hCage. This could be part of the clustering.
Data
CAGEscan
Sample preparation and sequencing
10009-101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.
- NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence
GCTCAG. - NCig10071: 2 × 36 bp experimental CAGEscan, 28,936,176 reads. Index sequences
ACAGATGCTATA,ATCGTGGCTATA,CACGATGCTATA,CACTGAGCTATA,CTGACGGCTATA,GAGTGAGCTATA,GTATACGCTATA,TCGAGCGCTATA. - NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence
GCTCAG.
Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/FASTQ/>. See CAGEscan_mapping_protocol on what to trim from the reads before aligning.
Name scheme: name_lane_direction.fq.bz2. The sequencer lane is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.
FASTQ sequences of sorted and trimmed reads can be retreived from the paired-end aligned BAM files (see below), that include unmapped reads.
Mapping (rn4)
Available at: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/BAM/
Alignments uploaded on 30-Mar-2011 and 01-Apr-2011 was prepared by Charles, but the standard pipeline should produce identical or very similar results.
To retrieve properly paired 5′ ends: samtools view -b -f 0x0042 [input] > [output]
Clustering (provisional)
OSC tables for Level 1 promoters: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/L1/
BED12 formatted clusters of (properly) paired-end reads : https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/CLUSTER/
HeliScopeCAGE
The RNA sample 10009-101B8 , <quote>Universal RNA - Rat Normal Tissues</quote>, was also used to prepare the HeliScopeCAGE library CNhs10614, available in BAM or BED formats.