Extend rat gene models with CAGEscan: Difference between revisions

From Wiki
Jump to navigationJump to search
(To get properly paired 5′ ends from CAGEscan BAM files.)
(assembly = rn4)
Line 22: Line 22:
Name scheme: <code>name_lane_direction.fq.bz2</code>. The sequencer [[Dataset_introduction#Keywords.2C_jargon|lane]] is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.
Name scheme: <code>name_lane_direction.fq.bz2</code>. The sequencer [[Dataset_introduction#Keywords.2C_jargon|lane]] is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.


== Mapping ==
== Mapping (rn4) ==


Will be uploaded here: https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/BAM/
Will be uploaded here: https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/BAM/
Line 28: Line 28:
* pending...
* pending...
* should Copenhagen align as well?
* should Copenhagen align as well?
* assembly = rn4?


To retrieve properly paired 5′ ends:
To retrieve properly paired 5′ ends:

Revision as of 14:34, 29 March 2011


Background

  • Some rat gene models miss the real 5′ ends.
  • HelicosCAGE and CAGEscan libraries are available from a “universal” rat RNA preparation.

Data

Sample preparation and sequencing

10009-101B8 is the same RNA as used for CNhs10614, the ‘Universal RNA - Rat Normal Tissues’ HelicosCAGE library.

  • NCig10012: 2 × 54 bp CAGEscan library, 6,903,269 reads. Index sequence GCTCAG.
  • NCig10071: 2 × 36 bp experimental CAGEscan, 2,893,6176 reads. Index sequences ACAGATGCTATA, ATCGTGGCTATA, CACGATGCTATA, CACTGAGCTATA, CTGACGGCTATA, GAGTGAGCTATA, GTATACGCTATA, TCGAGCGCTATA.
  • NChi10001: 2 × 51 bp CAGEscan library (HiSeq test run), 9,662,576 reads. Index sequence GCTCAG.

Bzipped FASTQ files are available in <https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/FASTQ/>. See CAGEscan_mapping_protocol on what to trim from the reads before aligning.

Name scheme: name_lane_direction.fq.bz2. The sequencer lane is indicated but should not have importance. Direction 1 is 5′ and direction 2 is 3′.

Mapping (rn4)

Will be uploaded here: https://fantom5-collaboration.gsc.riken.jp/webdav/home/plessy/BAM/

  • pending...
  • should Copenhagen align as well?

To retrieve properly paired 5′ ends:

samtools view -u -f 0x0040 [input] | samtools view -b -f 0x0002 - > [output]

Goals

Contribute experimental evidence that extends and update the gene models in rat.

Key questions:

  • How many rat “refseq” TSS’es are correct?
  • How many rat “refseq” TSS’es are shifted, either 5’ or 3’?
  • Function of shifted genes, ontological enrichment of shifted genes compared to the correctly annotated ones?
  • How many new genes with support(x)?
  • Investigate high confidence TSSs (evidence from both CAGE-methods)
  • Alternative promoters
  • Other

Project outline

  • Setup of scripting repo
  • Get an overview of data
  • (Mapping), processing, filtering
  • How to deal with triplicates. Variance filtering?
  • Tag clustering
  • Aggregation of existing gene models.
  • Define support(x)
  • Integration of existing rat RNA-seq data from SRA/GEO?

Definitions

  • Correct annotation
  • Alternative promoter
  • Definition to belong to a known promoter: Max shift allowed? The second cagescan tag falls in the known gene?
  • Filter: Seen with both Cagescan and hCage