Extend rat gene models with CAGEscan: Difference between revisions

From Wiki
Jump to navigationJump to search
(Simplified samtools command.)
(Moved data to https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan.)
 
(12 intermediate revisions by 3 users not shown)
Line 7: Line 7:
* Some rat gene models miss the real 5′ ends.
* Some rat gene models miss the real 5′ ends.
* [[:Category:Methods_and_Protocols|HelicosCAGE]] and [[CAGEscan]] libraries are available from a “universal” rat RNA preparation.
* [[:Category:Methods_and_Protocols|HelicosCAGE]] and [[CAGEscan]] libraries are available from a “universal” rat RNA preparation.

= Data =

== Sample preparation and sequencing ==

10009-101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.

* NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence <code>GCTCAG</code>.
* NCig10071: 2 × 36 bp experimental CAGEscan, 2,893,6176 reads. Index sequences <code>ACAGATGCTATA</code>, <code>ATCGTGGCTATA</code>, <code>CACGATGCTATA</code>, <code>CACTGAGCTATA</code>, <code>CTGACGGCTATA</code>, <code>GAGTGAGCTATA</code>, <code>GTATACGCTATA</code>, <code>TCGAGCGCTATA</code>.
* NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence <code>GCTCAG</code>.

Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/FASTQ/>. See [[CAGEscan_mapping_protocol]] on what to trim from the reads before aligning.

Name scheme: <code>name_lane_direction.fq.bz2</code>. The sequencer [[Dataset_introduction#Keywords.2C_jargon|lane]] is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.

== Mapping (rn4) ==

Will be uploaded here: https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/BAM/

* pending...
* should Copenhagen align as well?

To retrieve properly paired 5′ ends: <code>samtools view -b -f 0x0042 [input] > [output]</code>


= Goals =
= Goals =
Line 50: Line 27:
* Get an overview of data
* Get an overview of data
* (Mapping), processing, filtering
* (Mapping), processing, filtering
* Descriptive analysis(How far away (both upstream and downstream) are the refseq TSS's from CAGE clusters and the other way around. I will do this for the provisional clusters(Mette)
* How to deal with triplicates. Variance filtering?
* How to deal with triplicates. Variance filtering?
* Tag clustering
* Tag clustering
Line 58: Line 36:
= Definitions =
= Definitions =


* Correct annotation. Find the number of correct annotated genes/transcripts based on more datasources for instance UCSC genes, refseq genes and mRNAs. Set a cutoff for how close the cluster should be to a known TSS based on the descriptive analysis(Probably around +/-50).
* Correct annotation
* Alternative promoter: If we have CAGE evidence for both the known promoter and a new one close by there is probably an alternative promoter. But if there is no evidence for the known one the TSS is probably wrongly annotated.
* Definition to belong to a known promoter: Max shift allowed(see correct annotation)? The 3' cagescan tag falls in the exon of a known gene?
* Filter: Seen with both Cagescan and hCage. This could be part of the clustering.

= Data =

== CAGEscan ==

=== Sample preparation and sequencing ===

10009-101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.

* NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence <code>GCTCAG</code>.
* NCig10071: 2 × 36 bp experimental CAGEscan, 28,936,176 reads. Index sequences <code>ACAGATGCTATA</code>, <code>ATCGTGGCTATA</code>, <code>CACGATGCTATA</code>, <code>CACTGAGCTATA</code>, <code>CTGACGGCTATA</code>, <code>GAGTGAGCTATA</code>, <code>GTATACGCTATA</code>, <code>TCGAGCGCTATA</code>.
* NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence <code>GCTCAG</code>.

Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/FASTQ/>. See [[CAGEscan_mapping_protocol]] on what to trim from the reads before aligning.

Name scheme: <code>name_lane_direction.fq.bz2</code>. The sequencer [[Dataset_introduction#Keywords.2C_jargon|lane]] is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.

FASTQ sequences of sorted and trimmed reads can be retreived from the paired-end aligned BAM files (see below), that include unmapped reads.

=== Mapping (rn4) ===

Available at: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/BAM/

Alignments uploaded on 30-Mar-2011 and 01-Apr-2011 was prepared by Charles, but the [[CAGEscan_mapping_protocol|standard pipeline]] should produce identical or very similar results.

To retrieve properly paired 5′ ends: <code>samtools view -b -f 0x0042 [input] > [output]</code>

=== Clustering (provisional) ===

OSC tables for [[Data_Analysis_and_Integration#Promoter_clustering:_Michiel.2FTimo|Level 1]] promoters: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/L1/

BED12 formatted clusters of (properly) paired-end reads : https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/CLUSTER/

== HeliScopeCAGE ==


The RNA sample 10009-101B8 , <quote>Universal RNA - Rat Normal Tissues</quote>, was also used to prepare the HeliScopeCAGE library [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/UPDATE_011/f5pipeline/rat.tissue.hCAGE/ CNhs10614, available in BAM or BED formats].
* Alternative promoter
* Definition to belong to a known promoter: Max shift allowed? The second cagescan tag falls in the known gene?
* Filter: Seen with both Cagescan and hCage

Latest revision as of 19:54, 30 June 2011


Background

  • Some rat gene models miss the real 5′ ends.
  • HelicosCAGE and CAGEscan libraries are available from a “universal” rat RNA preparation.

Goals

Contribute experimental evidence that extends and update the gene models in rat.

Key questions:

  • How many rat “refseq” TSS’es are correct?
  • How many rat “refseq” TSS’es are shifted, either 5’ or 3’?
  • Function of shifted genes, ontological enrichment of shifted genes compared to the correctly annotated ones?
  • How many new genes with support(x)?
  • Investigate high confidence TSSs (evidence from both CAGE-methods)
  • Alternative promoters
  • Other

Project outline

  • Setup of scripting repo
  • Get an overview of data
  • (Mapping), processing, filtering
  • Descriptive analysis(How far away (both upstream and downstream) are the refseq TSS's from CAGE clusters and the other way around. I will do this for the provisional clusters(Mette)
  • How to deal with triplicates. Variance filtering?
  • Tag clustering
  • Aggregation of existing gene models.
  • Define support(x)
  • Integration of existing rat RNA-seq data from SRA/GEO?

Definitions

  • Correct annotation. Find the number of correct annotated genes/transcripts based on more datasources for instance UCSC genes, refseq genes and mRNAs. Set a cutoff for how close the cluster should be to a known TSS based on the descriptive analysis(Probably around +/-50).
  • Alternative promoter: If we have CAGE evidence for both the known promoter and a new one close by there is probably an alternative promoter. But if there is no evidence for the known one the TSS is probably wrongly annotated.
  • Definition to belong to a known promoter: Max shift allowed(see correct annotation)? The 3' cagescan tag falls in the exon of a known gene?
  • Filter: Seen with both Cagescan and hCage. This could be part of the clustering.

Data

CAGEscan

Sample preparation and sequencing

10009-101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.

  • NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence GCTCAG.
  • NCig10071: 2 × 36 bp experimental CAGEscan, 28,936,176 reads. Index sequences ACAGATGCTATA, ATCGTGGCTATA, CACGATGCTATA, CACTGAGCTATA, CTGACGGCTATA, GAGTGAGCTATA, GTATACGCTATA, TCGAGCGCTATA.
  • NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence GCTCAG.

Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/FASTQ/>. See CAGEscan_mapping_protocol on what to trim from the reads before aligning.

Name scheme: name_lane_direction.fq.bz2. The sequencer lane is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.

FASTQ sequences of sorted and trimmed reads can be retreived from the paired-end aligned BAM files (see below), that include unmapped reads.

Mapping (rn4)

Available at: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/BAM/

Alignments uploaded on 30-Mar-2011 and 01-Apr-2011 was prepared by Charles, but the standard pipeline should produce identical or very similar results.

To retrieve properly paired 5′ ends: samtools view -b -f 0x0042 [input] > [output]

Clustering (provisional)

OSC tables for Level 1 promoters: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/L1/

BED12 formatted clusters of (properly) paired-end reads : https://fantom5-collaboration.gsc.riken.jp/webdav/home/Rat_cagescan/CLUSTER/

HeliScopeCAGE

The RNA sample 10009-101B8 , <quote>Universal RNA - Rat Normal Tissues</quote>, was also used to prepare the HeliScopeCAGE library CNhs10614, available in BAM or BED formats.