Expression normalization method

There are several ways on how to normalize CAGE expression data. 'default' way of normalization has been TPM, but we could think of recent proposal such as TMM (implemented in edgeR, genomebiology.com/2010/11/3/R25). Another point to be considered is if we should include chromosome M in the library sizes or not.

Common dispersion survey (at reliable TSS cluster level)

Ultimately we are going to use TSS expressions beyond gene level, and normalization methods have to be examined at the cluster level. RLE (relative log expression) method proposed by DESeq package has been implemented recently in edgeR in addition to TMM. Here we tested (i) no scaling (ii) RLE, (iii) TMM on the cluster expression.

Method

data - Promotome v1.0 - RC1 (release candidate 1, Sep 26th 2011), from https://fantom5-collaboration.gsc.riken.jp/webdav/home/kawaji/110804-dpi-clusters-expression/ only clusters with max_counts >= 50 were used
replicates - determined by recognizing "donor","donation","pool","biol_rep","tech_rep", in the sample names.

Results

Common dispersion:

no scaling of sample sizes: 0.2388622 [ sqrt(common.dispersion) = 0.4887353 ]
RLE: 0.2495027 [ sqrt(common.dispersion) = 0.4995025 ]
TMM: 0.2749048 [ sqrt(common.dispersion) = 0.5243136 ]

Interpretation:

no scaling is the best.
RLE is reasonably good, however no scaling is the best in terms of common dispersion. Only if RLE is proven to rescue some 'poor quality' libraries, RLE can be a reasonable choice.
TMM is the worst, but it might be improved by taking a better reference samples (the library whose upper quartile is closest to the mean upper quartile is used). However, the improvement would not be so drastic from our experiences below.

Note:

quantile-based scaling implemented in edgeR crashed with the data.

Common dispersion survey (at gene level)

Marina performed a series of experiments on TMM normalization (edgeR, genomebiology.com/2010/11/3/R25) with taking the shallowest, the deepest, or universal RNA as a reference and with/without chromosome M. And she simply checked common dispersion on those combinations.

As a result, she found that the best performance (smallest common dispersion) is:

conventional TPM is recommended (not TMM normalization). Removal of chromosome M from the library size would be a good option.

Here is the methods and the result by Marina Lizio (m.lizio@gsc.riken.jp) August 1, 2011

A-Dataset

I selected a clean subset of human primary cells with 3 replicates (CD14 set, replicates as follows)

Monocytes                                                     
monocytes mock treated                                     
monocytes treated with B glucan                          
monocytes reated with BCG                               
monocytes treated with Candida                           
monocytes treated with Cryptococcus                      
monocytes treated with Group A streptococci          
monocytes treated with IFN N hexane              
monocytes treated with Salmonella                        
monocytes treated with Trehalose dimycolate (TDM)
monocytes treated with lipopolysaccharide                
CD14-CD16+ Monocytes                                                   
CD14+CD16- Monocytes
Universal RNA (Clontech)
HeLa control for Automation1

As a first trial I chose the RefSeq gene based expression calculated on the raw counts.

B-Parameters tested

Application of TMM normalization with respect to a reference column used for the normalization itself:

deepest sample **shallowest sample **universal RNA sample **control sample (HeLa for automation)

And exclusion of chrM mapped tags form the dataset prior to normalization

C-Method description

Use of edgeR bioconductor package (version 2.2.5, bioconductor version 2,8, R version 2.13).

Steps 1-read the gene expression table, read the library sizes
2-select the CD14 subset of replicates
3-include one control and universal RNA samples for TMM normalization.
- Step3 is done only when TMM normalization is taken into account.
4-calculate the normalization factor for all 4 cases (shallowest sample, deepest sample, universal RNA sample, HeLa control sample)
- Step 4 is skipped in the test case without TMM normalization applied.
5-calculate the common dispersion of the resulting normalized dataset.

These steps above are repeated for the library sizes without considering chrM tags

The normalization method assigns a normalization factor to each sample, according to the reference sample used. For all the tests the common dispersion was calculated, to check whether a best combination of parameters as in B) exists so that the dispersion is minimized.

[Pseudocode]

 inputs=DGEList(infiles, groups=my.groups, libsizes=my.counts) 
 
 #the groups are listed in A) and the counts are either the size of the whole 
 #library or the size of the library without the number of chrM associated tags 
 #case of TMM normalization
 
 for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) { 
       d=calcNormFactors(dex,method="TMM",refColumn=mycol)
       d.comm=estimateCommonDisp(d)
 } } 
 
 #case of no TMM normalization

 for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) { 
       d.comm=estimateCommonDisp(d)
 } }

D-Results

In all cases of normalization, the dispersion is minimized if the universal RNA sample is considered as reference. Dispersion is further improved if the chrM counts are not considered in the library size. However, if no TMM normalization is performed, dispersion is minimized.

 #with TMM
 
 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-hela_control-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.2419791
 > load("CD14_dispersion_replicates_with_control_refcol-hela_control-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.243903 
 > load("CD14_dispersion_replicates_with_control_refcol-shallowest_lib-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.2351437 
 >  load("CD14_dispersion_replicates_no-chrM_with_control_refcol-shallowest_lib-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.233168 
 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-deepest_lib-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.2397358 
 >  load("CD14_dispersion_replicates_with_control_refcol-deepest_lib-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.241812 
 > load("CD14_dispersion_replicates_with_control_refcol-universalRNA-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.2340208
 > load("CD14_dispersion_replicates_no-chrM_with_control_refcol-universalRNA-20110726.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.2321581

 #Without TMM

 > sqrt(d.comm$common.dispersion) 
 [1] 0.2218773 
 > save(d.comm,file="CD14_dispersion_replicates_noTMM.RData") 
 > sqrt(d.comm$common.dispersion) 
 [1] 0.2208855 
 > save(d.comm,file="CD14_dispersion_replicates_noTMM_no_chrM.RData")

rDNA and chrM survey

Since Charles and a few people wondered if chrM mapped reads are correlated with rRNA reads, Kawaji plotted them to show their correlation. This support the strategy to remove chrM mapped reads from the library sizes.

Results

Methods

shell commands in ruby/Rake:

 FileList["#{ROOTDIR}/UPDATE_012/f5pipeline/human.primary_cell.hCAGE/*.rdna.fa.gz"].each do |infile|
   sh %! echo -ne $(basename #{infile} .nobarcode.rdna.fa.gz)"\t" >> #{t.name}!
   sh %! tmp=$(grep -c '>' #{infile}); echo -ne "$tmp\t" >> #{t.name}!
   infile = infile.sub(".nobarcode.rdna.fa.gz", ".ctss.bed.gz")
   sh %! gunzip -c #{infile} | grep ^chrM | awk 'BEGIN{sum=0}{sum += $5}END{print sum}' >> #{t.name}!
 end

small RNA normalization

Similar to above, different normalization strategies for FANTOM5 small RNA libraries have been tested including tpm (tags per million sequences), tpmm (tags per million miRNA sequences), TMM, and RLE (see above for description of TMM and RLE).

Failed small RNA libraries

Before normalization, 10 "failed" libraries were removed. These libraries had less than 300 total sequences, to give some perspective the median number of sequences for the set of libraries was ~1.2 million. Here is the list of removed libraries (sample name;sample id):

Tracheal Epithelial Cells, donor1;11292-117A5
Astrocyte - cerebellum, donor2;11580-120F5
Smooth Muscle Cells - Carotid, donor3;11434-118H3
Skeletal muscle cells differentiated into Myotubes - multinucleated, donor3;11431-118G9
Mesenchymal Stem Cells - bone marrow, donor2;11616-122A5
Preadipocyte - breast, donor1;11467-119B9
Endothelial Cells - Microvascular, donor2;11342-117G1
Renal Mesangial Cells, donor3;11679-122H5
Trabecular Meshwork Cells, donor3;11693-123A1
Fibroblast - Villous Mesenchymal, donor2;11615-122A4

Methods

The procedure was essentially the same as above, using edgeR to calculate average common dispersion across all FANTOM5 primary cell snapshot libraries with replicates. The sample whose 75%-ile is closest to the mean of 75%-iles was taken as the reference for TMM normalization (default behavior on edgeR). Counts at miRNA loci were aggregated using mirBASE v.17 hairpin definitions, which means we are not distiguishing between miRNAs derived from the 3' and 5' arms.

Results

Common dispersion:

tpm: 0.3989
tpmm: 0.2133
TMM: 0.2003
RLE: 0.1992

Estimate of the coeffecient of variation (square root of the common dispersion):

tpm: 0.6316
tpmm: 0.4619
TMM: 0.4475
RLE: 0.4463

The second list can be interpreted as the amount of variation (up or down) in a library from the "true" miRNA abundance. In other words, in the tpm method the true abundance varies 63% up or down between replicates while using tpmm normalization the true abundance varies 46% up or down.

Given this, it appears the preferred method would be something like tpm <<<<<<<<< tpmm ~= TMM ~= RLE. TMM and RLE appear to perform substantially better in at least a couple of libraries (see below). As pointed out by Al on the mailing list, the problem with tpm appears to be inherent to the protocol, perhaps stemming from fluctuations in the sequencing of different non-coding RNA populations (tRNA, rRNA, etc.), even across replicates.

Libraries with relatively poor reproducibility

Just as a warning of sorts, the following replicate sets has greater than 75% variation across replicates (poor reproducibility):

Smooth Muscle cells – Carotid
Smooth Muscle cells – Pulmonary Artery
Smooth Muscle cells – Brachiocephalioc
Prostrate Epithelial Cells
Fibroblast – Mammary
Neural stem cells

All of these libraries have 3 replicates, so it may be a matter of removing one replicate but this has not been investigated further. In theory, it could also indicate that few miRNA are expressed in these leaving very few data points.

RLE/TMM outperformance of tpmm

It should be noted that while TMM and RLE perform about the same as tpmm, there are a few libraries where performance is substantially better for TMM and RLE. Two example libraries with 10% less variation in TMM/RLE relative to tpmm: smooth muscle cells (uterine) and neural stem cells. The likely reason for this is that TMM/RLE correct for overrepresentation of a few loci in a library (see PMID id:).

Files

I am including links to 1) a file (Media:MiRNA expression table noquotes reformat.txt.gz) containing a list of raw miRNA counts for all currently-sequenced small primary cell RNA libraries and 2) a file (Media:All normalization factors.txt.gz) containing the necessary normalization information for each library; these can be used to reconstruct the results here or make quick comparisons, plots, etc. of your own. File #2 includes the library ID, the total number of sequences for all libraries (for tpm calculations), the total number of miRNA sequences (for tpmm calculations), and normalization factors for TMM and RLE. To calculate normalized values under the RLE/TMM methods, multiply the library size (total number of miRNA sequences) by the normalization factor and then divide the number of counts by this factor (and multiply by 1million, if so desired). For example:

 effective.libsize <- libsize * TMM_normalization_factor
 norm.count <- raw.count/effective.libsize*1000000

Expression normalization method

Contents

Common dispersion survey (at reliable TSS cluster level)

Method

Results

Common dispersion survey (at gene level)

A-Dataset

B-Parameters tested

C-Method description

D-Results

rDNA and chrM survey

Results

Methods

small RNA normalization

Failed small RNA libraries

Methods

Results

Libraries with relatively poor reproducibility

RLE/TMM outperformance of tpmm

Files

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools