Expression normalization method
There are several ways on how to normalize CAGE expression data. 'default' way of normalization has been TPM, but we could think of recent proposal such as TMM. Another point to be considered is if we should include chromosome M in the library sizes or not. To address this point, we performed several survey: (i) checking common dispersion, (ii) chrM / rRNA correlatin
common dispersion survey
Marina performed a series of experiments on TMM normalization with taking the shallowest, the deepest, or universal RNA as a reference and with/without chromosome M. And she simply checked common dispersion on those combinations.
As a result, she found that the best performance (smallest common dispersion) is:
conventional TPM (not TMM normalization), with taking out chromosome M mapped reads from the library size
Details
- Marina Lizio (m.lizio@gsc.riken.jp)
- August 1, 2011
A-Dataset I selected a clean subset of human primary cells with 3 replicates (CD14 set, replicates as follows)
Monocytes monocytes mock treated monocytes treated with B glucan monocytes reated with BCG monocytes treated with Candida monocytes treated with Cryptococcus monocytes treated with Group A streptococci monocytes treated with IFN N hexane monocytes treated with Salmonella monocytes treated with Trehalose dimycolate (TDM) monocytes treated with lipopolysaccharide CD14-CD16+ Monocytes CD14+CD16- Monocytes Universal RNA (Clontech) HeLa control for Automation1
As a first trial I chose the RefSeq gene based expression calculated on the raw counts. B-Parameters tested
*Application of TMM normalization with respect to a reference column used for the normalization itself:
**deepest sample **shallowest sample **universal RNA sample **control sample (HeLa for automation)
*Exclusion of chrM mapped tags form the dataset prior to normalization
C-Method description Use of edgeR bioconductor package (version 2.2.5, bioconductor version 2,8, R version 2.13). Steps 1-read the gene expression table, read the library sizes 2-select the CD14 subset of replicates 3-include one control and universal RNA samples for TMM normalization.
- Step3 is done only when TMM normalization is taken into account.
4-calculate the normalization factor for all 4 cases (shallowest sample, deepest sample, universal RNA sample, HeLa control sample)
- Step 4 is skipped in the test case without TMM normalization applied.
5-calculate the common dispersion of the resulting normalized dataset.
- These steps above are repeated for the library sizes without considering chrM
- tags
The normalization method assigns a normalization factor to each sample, according to the reference sample used. For all the tests the common dispersion was calculated, to check whether a best combination of parameters as in B) exists so that the dispersion is minimized.
- Pseudocode
inputs=DGEList(infiles, groups=my.groups, libsizes=my.counts)
- the groups are listed in A) and the counts are either the size of the whole
- library or the size of the library without the number of chrM associated tags
- case of TMM normalization
for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {
d=calcNormFactors(dex,method="TMM",refColumn=mycol)
d.comm=estimateCommonDisp(d)
} }
- case of no TMM normalization
for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {
d.comm=estimateCommonDisp(d)
} }
D-Results In all cases of normalization, the dispersion is minimized if the universal RNA sample is considered as reference. Dispersion is further improved if the chrM counts are not considered in the library size. However, if no TMM normalization is performed, dispersion is minimized.
- with TMM
> load("CD14_dispersion_replicates_no-chrM_with_control_refcol-hela_control-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2419791 > load("CD14_dispersion_replicates_with_control_refcol-hela_control-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.243903 > load("CD14_dispersion_replicates_with_control_refcol-shallowest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2351437 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-shallowest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.233168 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-deepest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2397358 > load("CD14_dispersion_replicates_with_control_refcol-deepest_lib-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.241812 > load("CD14_dispersion_replicates_with_control_refcol-universalRNA-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2340208 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-universalRNA-20110726.RData") > sqrt(d.comm$common.dispersion) [1] 0.2321581
- Without TMM
> sqrt(d.comm$common.dispersion) [1] 0.2218773 > save(d.comm,file="CD14_dispersion_replicates_noTMM.RData") > sqrt(d.comm$common.dispersion) [1] 0.2208855 > save(d.comm,file="CD14_dispersion_replicates_noTMM_no_chrM.RData")