CAGEscan mapping protocol: Difference between revisions
(Paired-end alignment with BWA.) |
(Link to CAGEscan clustering (later).) |
||
| Line 27: | Line 27: | ||
We use BWA with a ''seed_length'' of 32 and otherwise standard parameters. This produces a paired-end BAM file. Properly paired 5′ reads are also extracted by filtering for the hexadecimal flags ''0x0040'' (first read in a pair) and ''0x0002'' (mapped in a proper pair). |
We use BWA with a ''seed_length'' of 32 and otherwise standard parameters. This produces a paired-end BAM file. Properly paired 5′ reads are also extracted by filtering for the hexadecimal flags ''0x0040'' (first read in a pair) and ''0x0002'' (mapped in a proper pair). |
||
== |
== Clustering == |
||
CAGEscan clusters can be created using promoters defined by the CAGEscan data itself or another CAGE library. Details will be added on a separate wiki page. |
|||
Mapped paired-end tags in BAM format |
|||
Revision as of 18:38, 4 April 2011
Sample splitting and linker removal
Input is 5′ and 3′ paired-end fastq files from the Illumina sequencers.
- The 9 first bases of the 5′ reads are trimmed. The 6 first are the index sequence (“barcode”) and the 3 next are the linker (
GGG).
- The 6 first bases of the 3′ reads are trimmed because they derive from to the random part (N6) of the reverse-transcription primer, and therefore may not reflect the RNA sequences accurately, since the reverse-transcriptase tolerates mismatches even on the last two bases. See Mizuno et al., 1999 for example of priming over mismatches.
We use the in-house command PipelinePairedEndExtraction.pl. It generates pairs of FASTQ files (5′ and 3′).
Artefact filtering
Each FASTQ file is filtered with TagDust, using an empty construct as library sequences:
AATGATACGGCGACCACCGAGATCTACACTAGTCGAACTGAAGGTCTCCAGCA[barcode]gggAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG
Removal of rDNA sequences
Each FASTQ file is filtered again to remove reads that match the ribosomal DNA repeated unit (rDNA), with the program rRNAdust.
Synchronisation of the FASTQ files
Since in a pair one end can be valid and the other end can be filtered out, the resulting pairs of FASTQ files are not suitable for paired-end alignment. We use the in-house script called sync_paired_fastq to discard unpaired reads and re-sort the FASTQ files.
Paired-end alignment with BWA
We use BWA with a seed_length of 32 and otherwise standard parameters. This produces a paired-end BAM file. Properly paired 5′ reads are also extracted by filtering for the hexadecimal flags 0x0040 (first read in a pair) and 0x0002 (mapped in a proper pair).
Clustering
CAGEscan clusters can be created using promoters defined by the CAGEscan data itself or another CAGE library. Details will be added on a separate wiki page.