HeliScopeCAGE analysis

This page is to share several analysis already held, observation, and consensus.

HeliScopeCAGE library protocol

Manual operation with 8 channel multi-pipet. 16 libraries can be produced with one set of operations.
Its process has been automated recently, where 96 libraries can be produced at once.
Standard protocol requires 5ug total RNA
Low quantity protocol can be applied down to 100ng - but note that the protocol is slightly different from the standard one and the number of obtained reads will be smaller.

Sequencing machine / sequencing layout

1 sequencer has two flow cells, each of them has 25 channels (corresponding to 'lanes' - in Illumina)
4 machines are installed in RIKEN OSC (but not all of them are working ... unfortunately). Some of the recent sequencing are outsourced to Helicos (Boston)
4 channels per run (channel 13 and 25 on flow cell 1, 2) are dedicated to quality control, at minimum (see below)
Channel 25 on the both of the flow cells are dedicated to sequence Helicos Control Oligo (termed HCO), which is the artifactual oligonucleotide which we already know the sequences. The performance of the sequencers/runs are mainly checked by this sequencing.
5% error rate is on these HCO control channels is 'requirements' for standard sequencing (while it sometime shows ~5.5% or such)
In Kawaji's view, the error rate on HCO channels is slight (not huge) overestimation - since unaligned reads are discarded and calculated only from the aligned reads. Anyhow, the error rate calculated on this way is the one Helicos can guarantee, somehow, and reported on publications.
Usually we dedicate channel 13 to sequence internal control RNA (a big pool of RNA prepared from mouse E17.5; we use this to monitor the CAGE library operation continuously)
Channel 1, 13, and 25 are not very stable, in comparison with other channels (error rate can be much higher, sometimes).
Sequencing start by fill and lock - you can't see the 1st base (or first non-T, if it starts with T) See http://en.wikipedia.org/wiki/File:TSMS_DNA_Preperation_2.jpg

HeliScope's raw reads and processing

No sequence quality value (!)
The sequencing length are variable even in one channel (mainly depending on base order of the reads). Mean length are around ~33nt ('filtered' reads are typically from 20nt - 50nt)
Native format is SMS (and SRF)
Helicos software to process raw reads are HeliSphere http://open.helicosbio.com/helisphere_user_guide/
filterSMS is the tool to discard artifacts and unusable reads. Practically, it discards (i) too short sequence (<20nt), (ii) CTAG repeats (termed BAO, Base Order Addition)
Many of the 5% errors are on insertion/deletions. Accordingly, we need (full/semi) dynamic programing based algorithm to align reads. indexDP is the Helicos's software, which can perform this. However, it is (relatively) slow and don't output alignment itself (only output alignment summary; where 'alignment score' is reported). http://open.helicosbio.com/helisphere_user_guide/ch07s02.html
Timo Lassmann developped Delve, which can estimate error profile of the library first, and perform alignments. Note that it reports 'potential' alignments of all input reads, even if some of them are totally independent. You have to filter out the alignments to obtain usable ones.
Delve output the alignment as a standard format, SAM/BAM.

Observations of the HeliScopeCAGE

~50 million reads we can obtain in a good condition, however, ~80% of the data is usually unusable (in Kawaji's interpretation, this is a nature of single-molecule technologies and/or lower performance of base caller itself - it outputs a lot of artifacts; note that it doesn't assign quality value individual bases nor reads at all)
1-10 million reads are usable (or worth to consider - mapping quality(Q) > 10 ), in a good condition.
But, technical reproducibility in gene expression is extremely high (very close to poisson distribution; relative error (sqrt(overdispersion estimated by edgeR)) is < 10%). Furthermore, it correlate with qRT-PCR at the most. It means that usable reads are extremely usable.
Promoter rate is a good indicator if the CAGE protocol works well or not. Typically ~60% with Q > 10. However, the rate also depends on the samples (cells and tissues), and we have to be careful to apply this criteria to FANTOM5 samples uniquely.
Insertion/deletion is much more frequent than mismatches - error profile of a typical HeliScope CAGE library ( produced by SAMstat )
Fill and lock sequencing (see above) indicate that we can't see the 1st base. On the other hand, the 1st base are usually 'G' in CAGE - due to 1st strand cDNA synthesis during cap-trapping. Combined them, the 1st base likely to represent the 1st base of the capped RNAs.
Still there are frequent mismatches at the 1st position. One interpretation is that it is alignment artifacts (Timo). Another guess is that it is caused by addition of multiple bases in revers transcription, since RT sometimes could add multiple bases (kawaji).
Mapping quality (Q) based filtering works reasonably, but we can optimize more. Geoff demonstrated some analysis based on 'e' (read length/1+errors). See https://fantom5-collaboration.gsc.riken.jp/wiki/index.php/File:HCAGE_analysis.doc and threads [fantom5:00265] One could use percent identity instead. (same concept, in principle)

Action and Discussion on going

Let's add filtering step based on 'e' or percent identity (consensus; but needs to be done practically)
Which threshold of mapping quality is reasonable? 10 or 20? - Running IDR on replicates

Related / mentioned resources

the Overview slides of HeliScope CAGE technology
the HeliScope CAGE paper Method paper (under review) - HeliScope CAGE paper and its data
Timo's presentation in the Feb Ume meeting https://fantom5-collaboration.gsc.riken.jp/wiki/images/6/64/MappingHelicosReads.pdf
Kawaji's presentation in the Feb Ume meeting https://fantom5-collaboration.gsc.riken.jp/wiki/images/1/1b/110220-F5ume-platform-comparison-Kawaji.pdf
Additional analysis (Timo and Kawaji) - https://fantom5-collaboration.gsc.riken.jp/wiki/index.php/File:110308-hCAGE.pdf
Error profiles of a typical HeliScope CAGE library ( produced by SAMstat) https://fantom5-collaboration.gsc.riken.jp/wiki/images/5/54/CNhs11077.11273-116H4.nobarcode.bam.pdf
Geoff's spot check and Timo's response https://fantom5-collaboration.gsc.riken.jp/wiki/index.php/File:HCAGE_analysis.doc

HeliScopeCAGE analysis

Contents

HeliScopeCAGE library protocol

Sequencing machine / sequencing layout

HeliScope's raw reads and processing

Observations of the HeliScopeCAGE

Action and Discussion on going

Related / mentioned resources

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools