Latest revision as of 19:54, 30 June 2011

Background

Some rat gene models miss the real 5′ ends.
HelicosCAGE and CAGEscan libraries are available from a “universal” rat RNA preparation.

Goals

Contribute experimental evidence that extends and update the gene models in rat.

Key questions:

How many rat “refseq” TSS’es are correct?
How many rat “refseq” TSS’es are shifted, either 5’ or 3’?
Function of shifted genes, ontological enrichment of shifted genes compared to the correctly annotated ones?
How many new genes with support(x)?
Investigate high confidence TSSs (evidence from both CAGE-methods)
Alternative promoters
Other

Project outline

Setup of scripting repo
Get an overview of data
(Mapping), processing, filtering
Descriptive analysis(How far away (both upstream and downstream) are the refseq TSS's from CAGE clusters and the other way around. I will do this for the provisional clusters(Mette)
How to deal with triplicates. Variance filtering?
Tag clustering
Aggregation of existing gene models.
Define support(x)
Integration of existing rat RNA-seq data from SRA/GEO?

Definitions

Correct annotation. Find the number of correct annotated genes/transcripts based on more datasources for instance UCSC genes, refseq genes and mRNAs. Set a cutoff for how close the cluster should be to a known TSS based on the descriptive analysis(Probably around +/-50).
Alternative promoter: If we have CAGE evidence for both the known promoter and a new one close by there is probably an alternative promoter. But if there is no evidence for the known one the TSS is probably wrongly annotated.
Definition to belong to a known promoter: Max shift allowed(see correct annotation)? The 3' cagescan tag falls in the exon of a known gene?
Filter: Seen with both Cagescan and hCage. This could be part of the clustering.

Data

CAGEscan

Sample preparation and sequencing

10009-101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.

NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence GCTCAG.
NCig10071: 2 × 36 bp experimental CAGEscan, 28,936,176 reads. Index sequences ACAGATGCTATA, ATCGTGGCTATA, CACGATGCTATA, CACTGAGCTATA, CTGACGGCTATA, GAGTGAGCTATA, GTATACGCTATA, TCGAGCGCTATA.
NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence GCTCAG.

Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/FASTQ/>. See CAGEscan_mapping_protocol on what to trim from the reads before aligning.

Name scheme: name_lane_direction.fq.bz2. The sequencer lane is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.

FASTQ sequences of sorted and trimmed reads can be retreived from the paired-end aligned BAM files (see below), that include unmapped reads.

Mapping (rn4)

Available at: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/BAM/

Alignments uploaded on 30-Mar-2011 and 01-Apr-2011 was prepared by Charles, but the standard pipeline should produce identical or very similar results.

To retrieve properly paired 5′ ends: samtools view -b -f 0x0042 [input] > [output]

Clustering (provisional)

OSC tables for Level 1 promoters: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/L1/

BED12 formatted clusters of (properly) paired-end reads : https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/CLUSTER/

HeliScopeCAGE

The RNA sample 10009-101B8 , <quote>Universal RNA - Rat Normal Tissues</quote>, was also used to prepare the HeliScopeCAGE library CNhs10614, available in BAM or BED formats.

@@ Line 7: / Line 7: @@
 * Some rat gene models miss the real 5′ ends.
 * [[:Category:Methods_and_Protocols|HelicosCAGE]] and [[CAGEscan]] libraries are available from a “universal” rat RNA preparation.
-= Data =
-== Sample preparation and sequencing ==
--101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.
-* NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence <code>GCTCAG</code>.
-* NCig10071: 2 × 36 bp experimental CAGEscan, 2,893,6176 reads. Index sequences <code>ACAGATGCTATA</code>, <code>ATCGTGGCTATA</code>, <code>CACGATGCTATA</code>, <code>CACTGAGCTATA</code>, <code>CTGACGGCTATA</code>, <code>GAGTGAGCTATA</code>, <code>GTATACGCTATA</code>, <code>TCGAGCGCTATA</code>.
-* NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence <code>GCTCAG</code>.
-Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/FASTQ/>. See [[CAGEscan_mapping_protocol]] on what to trim from the reads before aligning.
-Name scheme: <code>name_lane_direction.fq.bz2</code>. The sequencer [[Dataset_introduction#Keywords.2C_jargon|lane]] is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.
-== Mapping (rn4) ==
-Will be uploaded here: https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/BAM/
-* pending...
-* should Copenhagen align as well?
-To retrieve properly paired 5′ ends: <code>samtools view -b -f 0x0042 [input] > [output]</code>
 = Goals =
@@ Line 50: / Line 27: @@
 * Get an overview of data
 * (Mapping), processing, filtering
+* Descriptive analysis(How far away (both upstream and downstream) are the refseq TSS's from CAGE clusters and the other way around. I will do this for the provisional clusters(Mette)
 * How to deal with triplicates. Variance filtering?
 * Tag clustering
@@ Line 58: / Line 36: @@
 = Definitions =
+* Correct annotation. Find the number of correct annotated genes/transcripts based on more datasources for instance UCSC genes, refseq genes and mRNAs. Set a cutoff for how close the cluster should be to a known TSS based on the descriptive analysis(Probably around +/-50).
-* Correct annotation
+* Alternative promoter: If we have CAGE evidence for both the known promoter and a new one close by there is probably an alternative promoter. But if there is no evidence for the known one the TSS is probably wrongly annotated.
+* Definition to belong to a known promoter: Max shift allowed(see correct annotation)? The 3' cagescan tag falls in the exon of a known gene?
+* Filter: Seen with both Cagescan and hCage. This could be part of the clustering.
+= Data =
+== CAGEscan ==
+=== Sample preparation and sequencing ===
+-101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.
+* NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence <code>GCTCAG</code>.
+* NCig10071: 2 × 36 bp experimental CAGEscan, 28,936,176 reads. Index sequences <code>ACAGATGCTATA</code>, <code>ATCGTGGCTATA</code>, <code>CACGATGCTATA</code>, <code>CACTGAGCTATA</code>, <code>CTGACGGCTATA</code>, <code>GAGTGAGCTATA</code>, <code>GTATACGCTATA</code>, <code>TCGAGCGCTATA</code>.
+* NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence <code>GCTCAG</code>.
+Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/FASTQ/>. See [[CAGEscan_mapping_protocol]] on what to trim from the reads before aligning.
+Name scheme: <code>name_lane_direction.fq.bz2</code>. The sequencer [[Dataset_introduction#Keywords.2C_jargon|lane]] is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.
+FASTQ sequences of sorted and trimmed reads can be retreived from the paired-end aligned BAM files (see below), that include unmapped reads.
+=== Mapping (rn4) ===
+Available at: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/BAM/
+Alignments uploaded on 30-Mar-2011 and 01-Apr-2011 was prepared by Charles, but the [[CAGEscan_mapping_protocol|standard pipeline]] should produce identical or very similar results.
+To retrieve properly paired 5′ ends: <code>samtools view -b -f 0x0042 [input] > [output]</code>
+=== Clustering (provisional) ===
+OSC tables for [[Data_Analysis_and_Integration#Promoter_clustering:_Michiel.2FTimo|Level 1]] promoters: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/L1/
+BED12 formatted clusters of (properly) paired-end reads : https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/CLUSTER/
+== HeliScopeCAGE ==
+The RNA sample 10009-101B8 , <quote>Universal RNA - Rat Normal Tissues</quote>,  was also used to prepare the HeliScopeCAGE library [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/UPDATE_011/f5pipeline/rat.tissue.hCAGE/ CNhs10614, available in BAM or BED formats].
-* Alternative promoter
-* Definition to belong to a known promoter: Max shift allowed? The second cagescan tag falls in the known gene?
-* Filter: Seen with both Cagescan and hCage

Extend rat gene models with CAGEscan: Difference between revisions

Latest revision as of 19:54, 30 June 2011

Contents

Background

Goals

Project outline

Definitions

Data

CAGEscan

Sample preparation and sequencing

Mapping (rn4)

Clustering (provisional)

HeliScopeCAGE

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools