User:Kawaji/101105 Gene expression Table
From Wiki
Jump to navigationJump to search
Purpose
Individual CAGE reads represent the 5'-end of a single 5'-capped RNA molecules. Here we measure expressions of known coding genes based on the CAGE data.
Note that this is experimental yet. Better solution might be raised during the F5 collaboration.
Input
- the files containing the 5'-end position of the RNA and its count of the CAGE reads (referred as level1 cluster, or CTSS/CAGE Tag Starting Site)
- +/- 500bp region from the TSSs of the refseq coding transcripts (NM entries).
Output
- Gene expression table, where each row represent a refseq coding transcript and each column represent a sample, in the OSCtable format
- The unit used here is (i) counts, for which the column names start with 'raw.' , and (ii) TPM (tags per million, or reads per million), for which the column names start with 'tpm.'
- The table has 'special gene' to represent the library information, which starting with 'STAT:'. 'STAT:TOTAL_READS' represent the number of reads aligned to be some sequences (including the reference genome and ribosomal DNA repeating unit), and 'STAT:MAPPED_READS' is the number of cleanly mapped reads to the chromosomal region.
- TPM is calculated based on the 'STAT:MAPPED_READS'
Tips
In R, you can load the file like this as gipped format:
read_table_CAGE_gene_expression <- function(infile)
{
### read the data file
data = read.table(infile,sep="\t",header=T,as.is=T,na.strings="-", check.names=F, row.names=1)
### dealing with '% encoding'
colnames(data) = sapply( colnames(data) , URLdecode)
### extract appropriate tables
counts = data[ -grep( "STAT:" , rownames(data)) , grep("^raw.",colnames(data)) ]
colnames(counts) = gsub("^raw.","", colnames(counts))
tpm = data[ -grep( "STAT:" , rownames(data)) , grep("^tpm.",colnames(data)) ]
colnames(tpm) = gsub("^tpm.","", colnames(tpm))
lib.size = data[ grep( "STAT:MAPPED_READS" , rownames(data)) , grep("^raw.",colnames(data))]
### return the results
list(counts = counts, tpm = tpm, lib.size = lib.size)
}
### setting
infile = "00_mouse.tissue.hCAGE.gene.osc.txt.gz"
### loading
data = read_table_CAGE_gene_expression(infile)
### see the results
data$counts[1:3,1:3]
data$tpm[1:3,1:3]
data$lib.size[1:3]
Method
The number of CAGE tags are simply accumulated to the +/-500bp regions. When the CTSS/level1 position overlaps multiple transcripts, the counts devided by the overlaps are added.
Example
- two transcripts starting at chr1 750bp (GeneA) and chr1 1250 (GeneB).
- 10 CAGE reads starting at chr1 500bp on forward strand (C1)
- 50 CAGE reads starting at chr1 1000bp on forward strand (C2).
In this case, their overlaps are:
- (C1) is within +/- 500bp of the (GeneA)'s TSS
- (C2) is within +/- 500bp of the (GeneA)'s TSS and (GeneB)'s one.
Thus,
- (GeneA)'s expression : 10 + 50 / 2 = 35 counts
- (GeneB)'s expression : 50 / 2 = 25 counts
This ensure you can simply accumulate the read counts of each transcript when summarize the expression based on a locus (which can include multiple transcripts)
Script
Shell script based on bedTools
intersectBed -c -s -a ${BED} -b ${ANN} \
| awk 'BEGIN{OFS="\t";OFMT="%6f"}{ if ($7 > 0){print $1,$2,$3,$4,$5/$7,$6} }' \
| intersectBed -s -wa -wb -a ${ANN} -b stdin \
| sort -k 4 \
| groupBy -i stdin -grp 4 -opCols 11 -ops sum \
| awk "BEGIN{OFS=\"\t\"}{print \$1,\$2 }"