HeliScopeCAGE analysis

This page is to share several analysis held in RIKEN and observations.

HeliScopeCAGE library protocol

Manual operation with 8 channel multi-pipet. 16 libraries can be produced with one set of operations.
Its process has been automated recently, where 96 libraries can be produced at once.
Standard protocol requires 5ug total RNA
Low quantity protocol can be applied down to 100ng - but note that the protocol is slightly different from the standard one and the number of obtained reads will be smaller.

1 sequencer has two flow cells, each of them has 25 channels (corresponding to 'lanes' - in Illumina)
RIKEN OSC has 4 machines. Some of the recent sequencing are outsourced to Helicos (Boston)
Channel 25 on the both of the flow cells are dedicated to sequence Helicos Control Oligo (termed HCO). That's is the reason we can sequence 48 samples only per run.
5% error rate on these HCO control channels is 'requirements' for standard sequencing (while it sometime shows ~5.5% or such)
In Kawaji's view, the error rate on HCO channels is slight (not huge) overestimation - since unaligned reads are discarded and calculated only from the aligned reads. Anyhow, the error rate calculated on this way is the one Helicos can guarantee, somehow.
Usually we dedicate channel 13 to sequence internal control RNA (a big pool of RNA prepared from mouse E17.5; we use this to monitor the CAGE library operation continuously)
Channel 1, 13, and 25 are not very stable, in comparison with other channels (error rate can be much higher, sometimes).
Sequencing start by fill and lock - you can't see the 1st base (or first non-T, if it starts with T) See http://en.wikipedia.org/wiki/File:TSMS_DNA_Preperation_2.jpg

No sequence quality value.
The sequencing length are variable even in one channel. Mean length are around ~33nt ('filtered' reads are typically from 20nt - 50nt)
Native format is SMS (SRF)
Helicos software to process raw reads are HeliSphere http://open.helicosbio.com/helisphere_user_guide/
filterSMS is the tool to discard artifacts and unusable reads. Practically, it discards (i) too short sequence (<20nt), (ii) CTAG repeats (termed BAO, Base Order Addition)
Many of the 5% errors are on insertion/deletions. Accordingly, we need (full/semi) dynamic programing based algorithm to align reads. indexDP is the Helicos's software, which can perform this. However, it is (relatively) slow and don't output alignment itself (only output alignment summary; where 'alignment score' is reported). http://open.helicosbio.com/helisphere_user_guide/ch07s02.html
Timo Lassmann developped Delve, which can estimate error profile of the library first, and perform alignments. Note that it reports alignments of all input reads, regardless some of them are totally independent. You have to filter out the alignments to obtain usable ones.
Delve output the alignment as a standard format, SAM/BAM.

~50 million reads we can obtained in a good condition, however, ~80% of the data is usually unusable (in Kawaji's interpretation, this is caused by lower performance of base caller itself - it outputs a lot of artifacts; note that it doesn't assign quality value individual bases nor reads at all)
1-10 million reads are usable (or worth to consider - mapping quality(Q) > 10 ), in a good condition.
Technical reproducibility is pretty high (very close to poisson distribution; relative error (sqrt(overdispersion estimated by edgeR)) is < 10%) - see here
Promoter rate is a good indicator if the CAGE protocol works well or not. Typically ~60% with Q > 10. However, the rate also depends on the samples (cells and tissues), and we have to be careful.
Insertion/deletion is much more frequent than mismatches - error profile of a typical HeliScope CAGE library ( produced by SAMstat )
Fill and lock sequencing (see above) indicate that we can't see the 1st base. On the other hand, the 1st base are usually 'G' in CAGE - due to 1st strand cDNA synthesis during cap-trapping. Combined them, the 1st base likely to represent the 1st base of the capped RNAs.
Still there are frequent mismatches at the 1st position. One interpretation is that it is alignment artifacts (Timo). Another guess is that it is caused by addition of multiple bases in revers transcription, since RT sometimes add several bases (kawaji).

You can find the Overview slides of HeliScope CAGE technology, and the HeliScope CAGE paper for details, which is just submitted not accepted yet.