Expression normalization method

From Wiki
Revision as of 19:13, 1 August 2011 by Kawaji (talk | contribs) (Created page with ' There are several ways on how to normalize CAGE expression data. 'default' way of normalization has been TPM, but we could think of recent proposal such as TMM. Another point to…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

There are several ways on how to normalize CAGE expression data. 'default' way of normalization has been TPM, but we could think of recent proposal such as TMM. Another point to be considered is if we should include chromosome M in the library sizes or not. To address this point, we performed several survey: (i) checking common dispersion, (ii) chrM / rRNA correlatin

common dispersion survey

Marina performed a series of experiments on TMM normalization with taking the shallowest, the deepest, or universal RNA as a reference and with/without chromosome M. And she simply checked common dispersion on those combinations.

As a result, she found that the best performance (smallest common dispersion) is:

conventional TPM (not TMM normalization),
with taking out chromosome M mapped reads from the library size


Details

  1. Marina Lizio (m.lizio@gsc.riken.jp)
  2. August 1, 2011

A-Dataset I selected a clean subset of human primary cells with 3 replicates (CD14 set, replicates as follows)

Monocytes                                                     
monocytes mock treated                                     
monocytes treated with B glucan                          
monocytes reated with BCG                               
monocytes treated with Candida                           
monocytes treated with Cryptococcus                      
monocytes treated with Group A streptococci          
monocytes treated with IFN N hexane              
monocytes treated with Salmonella                        
monocytes treated with Trehalose dimycolate (TDM)
monocytes treated with lipopolysaccharide                
CD14-CD16+ Monocytes                                                   
CD14+CD16- Monocytes
Universal RNA (Clontech)
HeLa control for Automation1

As a first trial I chose the RefSeq gene based expression calculated on the raw counts. B-Parameters tested

*Application of TMM normalization with respect to a reference column used for the normalization itself:

**deepest sample **shallowest sample **universal RNA sample **control sample (HeLa for automation)

*Exclusion of chrM mapped tags form the dataset prior to normalization

C-Method description Use of edgeR bioconductor package (version 2.2.5, bioconductor version 2,8, R version 2.13). Steps 1-read the gene expression table, read the library sizes 2-select the CD14 subset of replicates 3-include one control and universal RNA samples for TMM normalization.

  1. Step3 is done only when TMM normalization is taken into account.

4-calculate the normalization factor for all 4 cases (shallowest sample, deepest sample, universal RNA sample, HeLa control sample)

  1. Step 4 is skipped in the test case without TMM normalization applied.

5-calculate the common dispersion of the resulting normalized dataset.

  1. These steps above are repeated for the library sizes without considering chrM
  2. tags

The normalization method assigns a normalization factor to each sample, according to the reference sample used. For all the tests the common dispersion was calculated, to check whether a best combination of parameters as in B) exists so that the dispersion is minimized.

  1. Pseudocode

inputs=DGEList(infiles, groups=my.groups, libsizes=my.counts)

  1. the groups are listed in A) and the counts are either the size of the whole
  2. library or the size of the library without the number of chrM associated tags
  3. case of TMM normalization

for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {

       d=calcNormFactors(dex,method="TMM",refColumn=mycol)
       d.comm=estimateCommonDisp(d)

} }

  1. case of no TMM normalization

for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {

       d.comm=estimateCommonDisp(d)

} }

D-Results In all cases of normalization, the dispersion is minimized if the universal RNA sample is considered as reference. Dispersion is further improved if the chrM counts are not considered in the library size. However, if no TMM normalization is performed, dispersion is minimized.

  1. with TMM

> load("CD14_dispersion_replicates_no-chrM_with_control_refcol-hela_control-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2419791 > load("CD14_dispersion_replicates_with_control_refcol-hela_control-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.243903 > load("CD14_dispersion_replicates_with_control_refcol-shallowest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2351437 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-shallowest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.233168 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-deepest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2397358 > load("CD14_dispersion_replicates_with_control_refcol-deepest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.241812 > load("CD14_dispersion_replicates_with_control_refcol-universalRNA-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2340208 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-universalRNA-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2321581

  1. Without TMM

> sqrt(d.comm$common.dispersion) [1] 0.2218773 > save(d.comm,file="CD14_dispersion_replicates_noTMM.RData") > sqrt(d.comm$common.dispersion) [1] 0.2208855 > save(d.comm,file="CD14_dispersion_replicates_noTMM_no_chrM.RData")