User:Kawaji/101105 Gene expression Table

From Wiki
Jump to navigationJump to search


Purpose

Individual CAGE reads represent the 5'-end of a single 5'-capped RNA molecules. Here we measure expressions of known coding genes based on the CAGE data.

Note that this is experimental yet. Better solution might be raised during the F5 collaboration.

Input

  • the files containing the 5'-end position of the RNA and its count of the CAGE reads (referred as level1 cluster, or CTSS/CAGE Tag Starting Site)
  • +/- 500bp region from the TSSs of the refseq coding transcripts (NM entries).

Output

  • Gene expression table, where each row represent a refseq coding transcript and each column represent a sample, in the OSCtable format
  • The unit used here is (i) counts, for which the column names start with 'raw.' , and (ii) TPM (tags per million, or reads per million), for which the column names start with 'tpm.'
  • The table has 'special gene' to represent the library information, which starting with 'STAT:'. 'STAT:TOTAL_READS' represent the number of reads aligned to be some sequences (including the reference genome and ribosomal DNA repeating unit), and 'STAT:MAPPED_READS' is the number of cleanly mapped reads to the chromosomal region.
  • TPM is calculated based on the 'STAT:MAPPED_READS'

Tips

In R, you can load the file like this as gipped format:

 read_table_CAGE_gene_expression <- function(infile)
{
  ### read the data file
  data = read.table(infile,sep="\t",header=T,as.is=T,na.strings="-", check.names=F, row.names=1)
  
  ### dealing with '% encoding'
  colnames(data) = sapply( colnames(data) , URLdecode)
  
  ### extract appropriate tables
  counts = data[ -grep( "STAT:" , rownames(data)) , grep("^raw.",colnames(data)) ]
  colnames(counts) = gsub("^raw.","", colnames(counts))
  tpm    = data[ -grep( "STAT:" , rownames(data)) , grep("^tpm.",colnames(data)) ]
  colnames(tpm) = gsub("^tpm.","", colnames(tpm))
  lib.size = data[ grep( "STAT:MAPPED_READS" , rownames(data)) , grep("^raw.",colnames(data))]
  
  ### return the results
  list(counts = counts, tpm = tpm, lib.size = lib.size)
}

### setting
infile = "00_mouse.tissue.hCAGE.gene.osc.txt.gz"

### loading
data = read_table_CAGE_gene_expression(infile)

### see the results
data$counts[1:3,1:3]
data$tpm[1:3,1:3]
data$lib.size[1:3]


Method

The number of CAGE tags are simply accumulated to the +/-500bp regions. When the CTSS/level1 position overlaps multiple transcripts, the counts devided by the overlaps are added.

Example

  • two transcripts starting at chr1 750bp (GeneA) and chr1 1250 (GeneB).
  • 10 CAGE reads starting at chr1 500bp on forward strand (C1)
  • 50 CAGE reads starting at chr1 1000bp on forward strand (C2).

In this case, their overlaps are:

  • (C1) is within +/- 500bp of the (GeneA)'s TSS
  • (C2) is within +/- 500bp of the (GeneA)'s TSS and (GeneB)'s one.

Thus,

  • (GeneA)'s expression : 10 + 50 / 2 = 35 counts
  • (GeneB)'s expression : 50 / 2 = 25 counts

This ensure you can simply accumulate the read counts of each transcript when summarize the expression based on a locus (which can include multiple transcripts)

Script

Shell script based on bedTools

 intersectBed -c -s  -a ${BED} -b ${ANN} \
| awk 'BEGIN{OFS="\t";OFMT="%6f"}{ if ($7 > 0){print $1,$2,$3,$4,$5/$7,$6} }' \
| intersectBed -s -wa -wb -a ${ANN} -b stdin \
| sort -k 4 \
| groupBy -i stdin -grp 4 -opCols 11 -ops sum  \
| awk "BEGIN{OFS=\"\t\"}{print \$1,\$2 }"