CAGEscan mapping protocol: Difference between revisions
(TagDust filtering) |
(Updated the CAGEscan pipeline.) |
||
| (7 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
[[Category:CAGEscan]] |
|||
== Sample splitting and linker removal == |
== Sample splitting and linker removal == |
||
| Line 7: | Line 9: | ||
* The 6 first bases of the 3′ reads are trimmed because they derive from to the random part (N6) of the reverse-transcription primer, and therefore may not reflect the RNA sequences accurately, since the reverse-transcriptase tolerates mismatches even on the last two bases. See [http://pubmed.gov/9973624 Mizuno et al., 1999] for example of priming over mismatches. |
* The 6 first bases of the 3′ reads are trimmed because they derive from to the random part (N6) of the reverse-transcription primer, and therefore may not reflect the RNA sequences accurately, since the reverse-transcriptase tolerates mismatches even on the last two bases. See [http://pubmed.gov/9973624 Mizuno et al., 1999] for example of priming over mismatches. |
||
We use the in-house command |
We use the in-house command of MOIRAI, that generates pairs of FASTQ files (5′ and 3′). |
||
== Artefact filtering == |
== Artefact filtering == |
||
Each FASTQ file is filtered with [http://pubmed.gov/19737799 TagDust], using |
Each FASTQ file is filtered with [http://pubmed.gov/19737799 TagDust], using oligonucleotides and empty construct sequences as artefact library. |
||
== Removal of rDNA sequences == |
|||
Each FASTQ file is filtered again to remove reads that match the ribosomal DNA repeated unit ([[rDNA]]), with the program [[User:Lassmann|rRNAdust]]. |
|||
== Synchronisation of the FASTQ files == |
|||
Since in a pair one end can be valid and the other end can be filtered out, the resulting pairs of FASTQ files are not suitable for paired-end alignment, using the MOIRAI in-house command matchPairedEndSeq. |
|||
== Paired-end alignment with BWA == |
|||
We use BWA with a ''maximum_insert_size'' of 2,000,000 and otherwise standard parameters. This produces a paired-end BAM file. Properly paired 5′ reads were also extracted by filtering for the hexadecimal flag ''0x0002'', and PCR duplicates were removed. |
|||
== Clustering == |
|||
CAGEscan clusters were created using using the FANTOM5 CAGE peaks as seeds. |
|||
<code>AATGATACGGCGACCACCGAGATCTACACTAGTCGAACTGAAGGTCTCCAGCA[barcode]gggAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG</code> |
|||
== |
== Note == |
||
First round of processing (up to UPDATE_023) used slightly different parameters, like ''maximum_insert_size'' of 100000, PCR duplicates were not removed and older versions of the programs were used (for instance pairedBamToBed12 did not correct for extra Gs). We use the in-house script called [[User:nbertin|sync_paired_fastq]] was used instead of the MOIRAI command to discard unpaired reads and re-sort the FASTQ files. |
|||
Mapped paired-end tags in BAM format |
|||
Latest revision as of 09:41, 26 December 2016
Sample splitting and linker removal
Input is 5′ and 3′ paired-end fastq files from the Illumina sequencers.
- The 9 first bases of the 5′ reads are trimmed. The 6 first are the index sequence (“barcode”) and the 3 next are the linker (
GGG).
- The 6 first bases of the 3′ reads are trimmed because they derive from to the random part (N6) of the reverse-transcription primer, and therefore may not reflect the RNA sequences accurately, since the reverse-transcriptase tolerates mismatches even on the last two bases. See Mizuno et al., 1999 for example of priming over mismatches.
We use the in-house command of MOIRAI, that generates pairs of FASTQ files (5′ and 3′).
Artefact filtering
Each FASTQ file is filtered with TagDust, using oligonucleotides and empty construct sequences as artefact library.
Removal of rDNA sequences
Each FASTQ file is filtered again to remove reads that match the ribosomal DNA repeated unit (rDNA), with the program rRNAdust.
Synchronisation of the FASTQ files
Since in a pair one end can be valid and the other end can be filtered out, the resulting pairs of FASTQ files are not suitable for paired-end alignment, using the MOIRAI in-house command matchPairedEndSeq.
Paired-end alignment with BWA
We use BWA with a maximum_insert_size of 2,000,000 and otherwise standard parameters. This produces a paired-end BAM file. Properly paired 5′ reads were also extracted by filtering for the hexadecimal flag 0x0002, and PCR duplicates were removed.
Clustering
CAGEscan clusters were created using using the FANTOM5 CAGE peaks as seeds.
Note
First round of processing (up to UPDATE_023) used slightly different parameters, like maximum_insert_size of 100000, PCR duplicates were not removed and older versions of the programs were used (for instance pairedBamToBed12 did not correct for extra Gs). We use the in-house script called sync_paired_fastq was used instead of the MOIRAI command to discard unpaired reads and re-sort the FASTQ files.