HeliScopeCAGE analysis

From Wiki
Jump to navigationJump to search

This page is to share several analysis already held, observation, and consensus.

HeliScopeCAGE library protocol

  • Manual operation with 8 channel multi-pipet. 16 libraries can be produced with one set of operations.
  • Its process has been automated recently, where 96 libraries can be produced at once.
  • Standard protocol requires 5ug total RNA
  • Low quantity protocol can be applied down to 100ng - but note that the protocol is slightly different from the standard one and the number of obtained reads will be smaller.

Sequencing machine / sequencing layout

  • 1 sequencer has two flow cells, each of them has 25 channels (corresponding to 'lanes' - in Illumina)
  • 4 machines are installed in RIKEN OSC (but not all of them are working ... unfortunately). Some of the recent sequencing are outsourced to Helicos (Boston)
  • 4 channels per run (channel 13 and 25 on flow cell 1, 2) are dedicated to quality control, at minimum (see below)
  • Channel 25 on the both of the flow cells are dedicated to sequence Helicos Control Oligo (termed HCO), which is the artifactual oligonucleotide which we already know the sequences. The performance of the sequencers/runs are mainly checked by this sequencing.
  • 5% error rate is on these HCO control channels is 'requirements' for standard sequencing (while it sometime shows ~5.5% or such)
  • In Kawaji's view, the error rate on HCO channels is slight (not huge) overestimation - since unaligned reads are discarded and calculated only from the aligned reads. Anyhow, the error rate calculated on this way is the one Helicos can guarantee, somehow, and reported on publications.
  • Usually we dedicate channel 13 to sequence internal control RNA (a big pool of RNA prepared from mouse E17.5; we use this to monitor the CAGE library operation continuously)
  • Channel 1, 13, and 25 are not very stable, in comparison with other channels (error rate can be much higher, sometimes).
  • Sequencing start by fill and lock - you can't see the 1st base (or first non-T, if it starts with T) See http://en.wikipedia.org/wiki/File:TSMS_DNA_Preperation_2.jpg

HeliScope's raw reads and processing

  • No sequence quality value (!)
  • The sequencing length are variable even in one channel (mainly depending on base order of the reads). Mean length are around ~33nt ('filtered' reads are typically from 20nt - 50nt)
  • Native format is SMS (and SRF)
  • Helicos software to process raw reads are HeliSphere http://open.helicosbio.com/helisphere_user_guide/
  • filterSMS is the tool to discard artifacts and unusable reads. Practically, it discards (i) too short sequence (<20nt), (ii) CTAG repeats (termed BAO, Base Order Addition)
  • Many of the 5% errors are on insertion/deletions. Accordingly, we need (full/semi) dynamic programing based algorithm to align reads. indexDP is the Helicos's software, which can perform this. However, it is (relatively) slow and don't output alignment itself (only output alignment summary; where 'alignment score' is reported). http://open.helicosbio.com/helisphere_user_guide/ch07s02.html
  • Timo Lassmann developped Delve, which can estimate error profile of the library first, and perform alignments. Note that it reports 'potential' alignments of all input reads, even if some of them are totally independent. You have to filter out the alignments to obtain usable ones.
  • Delve output the alignment as a standard format, SAM/BAM.

Observations of the HeliScopeCAGE

  • ~50 million reads we can obtain in a good condition, however, ~80% of the data is usually unusable (in Kawaji's interpretation, this is a nature of single-molecule technologies and/or lower performance of base caller itself - it outputs a lot of artifacts; note that it doesn't assign quality value individual bases nor reads at all)
  • 1-10 million reads are usable (or worth to consider - mapping quality(Q) > 10 ), in a good condition.
  • But, technical reproducibility in gene expression is extremely high (very close to poisson distribution; relative error (sqrt(overdispersion estimated by edgeR)) is < 10%). Furthermore, it correlate with qRT-PCR at the most. It means that usable reads are extremely usable.
  • Promoter rate is a good indicator if the CAGE protocol works well or not. Typically ~60% with Q > 10. However, the rate also depends on the samples (cells and tissues), and we have to be careful to apply this criteria to FANTOM5 samples uniquely.
  • Insertion/deletion is much more frequent than mismatches - error profile of a typical HeliScope CAGE library ( produced by SAMstat )
  • Fill and lock sequencing (see above) indicate that we can't see the 1st base. On the other hand, the 1st base are usually 'G' in CAGE - due to 1st strand cDNA synthesis during cap-trapping. Combined them, the 1st base likely to represent the 1st base of the capped RNAs.
  • Still there are frequent mismatches at the 1st position. One interpretation is that it is alignment artifacts (Timo). Another guess is that it is caused by addition of multiple bases in revers transcription, since RT sometimes could add multiple bases (kawaji).
  • Mapping quality (Q) based filtering works reasonably, but we can optimize more. Geoff demonstrated some analysis based on 'e' (read length/1+errors). See https://fantom5-collaboration.gsc.riken.jp/wiki/index.php/File:HCAGE_analysis.doc and threads [fantom5:00265] One could use percent identity instead. (same concept, in principle)

Action and Discussion on going

  • Let's add filtering step based on 'e' or percent identity (consensus; but needs to be done practically)
  • Which threshold of mapping quality is reasonable? 10 or 20? - Running IDR on replicates

Related / mentioned resources