Expression normalization method

There are several ways on how to normalize CAGE expression data. 'default' way of normalization has been TPM, but we could think of recent proposal such as TMM. Another point to be considered is if we should include chromosome M in the library sizes or not. To address this point, we performed several survey: (i) checking common dispersion, (ii) chrM / rRNA correlatin

common dispersion survey

Marina performed a series of experiments on TMM normalization with taking the shallowest, the deepest, or universal RNA as a reference and with/without chromosome M. And she simply checked common dispersion on those combinations.

As a result, she found that the best performance (smallest common dispersion) is:

conventional TPM (not TMM normalization),
with taking out chromosome M mapped reads from the library size

Details

Marina Lizio (m.lizio@gsc.riken.jp)
August 1, 2011

A-Dataset I selected a clean subset of human primary cells with 3 replicates (CD14 set, replicates as follows)

Monocytes                                                     
monocytes mock treated                                     
monocytes treated with B glucan                          
monocytes reated with BCG                               
monocytes treated with Candida                           
monocytes treated with Cryptococcus                      
monocytes treated with Group A streptococci          
monocytes treated with IFN N hexane              
monocytes treated with Salmonella                        
monocytes treated with Trehalose dimycolate (TDM)
monocytes treated with lipopolysaccharide                
CD14-CD16+ Monocytes                                                   
CD14+CD16- Monocytes
Universal RNA (Clontech)
HeLa control for Automation1

As a first trial I chose the RefSeq gene based expression calculated on the raw counts. B-Parameters tested

*Application of TMM normalization with respect to a reference column used for the normalization itself:

**deepest sample **shallowest sample **universal RNA sample **control sample (HeLa for automation)

*Exclusion of chrM mapped tags form the dataset prior to normalization

C-Method description Use of edgeR bioconductor package (version 2.2.5, bioconductor version 2,8, R version 2.13). Steps 1-read the gene expression table, read the library sizes 2-select the CD14 subset of replicates 3-include one control and universal RNA samples for TMM normalization.

Step3 is done only when TMM normalization is taken into account.

4-calculate the normalization factor for all 4 cases (shallowest sample, deepest sample, universal RNA sample, HeLa control sample)

Step 4 is skipped in the test case without TMM normalization applied.

5-calculate the common dispersion of the resulting normalized dataset.

These steps above are repeated for the library sizes without considering chrM
tags

The normalization method assigns a normalization factor to each sample, according to the reference sample used. For all the tests the common dispersion was calculated, to check whether a best combination of parameters as in B) exists so that the dispersion is minimized.

Pseudocode

inputs=DGEList(infiles, groups=my.groups, libsizes=my.counts)

the groups are listed in A) and the counts are either the size of the whole
library or the size of the library without the number of chrM associated tags
case of TMM normalization

for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {

       d=calcNormFactors(dex,method="TMM",refColumn=mycol)
       d.comm=estimateCommonDisp(d)

} }

case of no TMM normalization

for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {

       d.comm=estimateCommonDisp(d)

} }

D-Results In all cases of normalization, the dispersion is minimized if the universal RNA sample is considered as reference. Dispersion is further improved if the chrM counts are not considered in the library size. However, if no TMM normalization is performed, dispersion is minimized.

with TMM

> load("CD14_dispersion_replicates_no-chrM_with_control_refcol-hela_control-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2419791 > load("CD14_dispersion_replicates_with_control_refcol-hela_control-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.243903 > load("CD14_dispersion_replicates_with_control_refcol-shallowest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2351437 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-shallowest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.233168 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-deepest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2397358 > load("CD14_dispersion_replicates_with_control_refcol-deepest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.241812 > load("CD14_dispersion_replicates_with_control_refcol-universalRNA-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2340208 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-universalRNA-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2321581

Without TMM

> sqrt(d.comm$common.dispersion) [1] 0.2218773 > save(d.comm,file="CD14_dispersion_replicates_noTMM.RData") > sqrt(d.comm$common.dispersion) [1] 0.2208855 > save(d.comm,file="CD14_dispersion_replicates_noTMM_no_chrM.RData")

Expression normalization method

common dispersion survey

Details

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools