Expression normalization method
There are several ways on how to normalize CAGE expression data. 'default' way of normalization has been TPM, but we could think of recent proposal such as TMM (implemented in edgeR, genomebiology.com/2010/11/3/R25). Another point to be considered is if we should include chromosome M in the library sizes or not.
Common dispersion survey (at reliable TSS cluster level)
Ultimately we are going to use TSS expressions beyond gene level, and normalization methods have to be examined at the cluster level. RLE (relative log expression) method proposed by DESeq package has been implemented recently in edgeR in addition to TMM. Here we tested (i) no scaling (ii) RLE, (iii) TMM on the cluster expression.
Method
- data - Promotome v1.0 - RC1 (release candidate 1, Sep 26th 2011), from https://fantom5-collaboration.gsc.riken.jp/webdav/home/kawaji/110804-dpi-clusters-expression/ only clusters with max_counts >= 50 were used
- replicates - determined by recognizing "donor","donation","pool","biol_rep","tech_rep", in the sample names.
Results
Common dispersion:
- no scaling of sample sizes: 0.2388622 [ sqrt(common.dispersion) = 0.4887353 ]
- RLE: 0.2495027 [ sqrt(common.dispersion) = 0.4995025 ]
- TMM: 0.2749048 [ sqrt(common.dispersion) = 0.5243136 ]
Interpretation:
- no scaling is the best.
- RLE is reasonably good, however no scaling is the best in terms of common dispersion. Only if RLE is proven to rescue some 'poor quality' libraries, RLE can be a reasonable choice.
- TMM is the worst, but it might be improved by taking a better reference samples (the library whose upper quartile is closest to the mean upper quartile is used). However, the improvement would not be so drastic from our experiences below.
Note:
- quantile-based scaling implemented in edgeR crashed with the data.
Common dispersion survey (at gene level)
Marina performed a series of experiments on TMM normalization (edgeR, genomebiology.com/2010/11/3/R25) with taking the shallowest, the deepest, or universal RNA as a reference and with/without chromosome M. And she simply checked common dispersion on those combinations.
As a result, she found that the best performance (smallest common dispersion) is:
- conventional TPM is recommended (not TMM normalization). Removal of chromosome M from the library size would be a good option.
Here is the methods and the result by Marina Lizio (m.lizio@gsc.riken.jp) August 1, 2011
A-Dataset
I selected a clean subset of human primary cells with 3 replicates (CD14 set, replicates as follows)
Monocytes monocytes mock treated monocytes treated with B glucan monocytes reated with BCG monocytes treated with Candida monocytes treated with Cryptococcus monocytes treated with Group A streptococci monocytes treated with IFN N hexane monocytes treated with Salmonella monocytes treated with Trehalose dimycolate (TDM) monocytes treated with lipopolysaccharide CD14-CD16+ Monocytes CD14+CD16- Monocytes Universal RNA (Clontech) HeLa control for Automation1
As a first trial I chose the RefSeq gene based expression calculated on the raw counts.
B-Parameters tested
Application of TMM normalization with respect to a reference column used for the normalization itself:
- deepest sample **shallowest sample **universal RNA sample **control sample (HeLa for automation)
And exclusion of chrM mapped tags form the dataset prior to normalization
C-Method description
Use of edgeR bioconductor package (version 2.2.5, bioconductor version 2,8, R version 2.13).
- Steps 1-read the gene expression table, read the library sizes
- 2-select the CD14 subset of replicates
- 3-include one control and universal RNA samples for TMM normalization.
- Step3 is done only when TMM normalization is taken into account.
- 4-calculate the normalization factor for all 4 cases (shallowest sample, deepest sample, universal RNA sample, HeLa control sample)
- Step 4 is skipped in the test case without TMM normalization applied.
- 5-calculate the common dispersion of the resulting normalized dataset.
These steps above are repeated for the library sizes without considering chrM tags
The normalization method assigns a normalization factor to each sample, according to the reference sample used. For all the tests the common dispersion was calculated, to check whether a best combination of parameters as in B) exists so that the dispersion is minimized.
[Pseudocode]
inputs=DGEList(infiles, groups=my.groups, libsizes=my.counts)
#the groups are listed in A) and the counts are either the size of the whole
#library or the size of the library without the number of chrM associated tags
#case of TMM normalization
for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {
d=calcNormFactors(dex,method="TMM",refColumn=mycol)
d.comm=estimateCommonDisp(d)
} }
#case of no TMM normalization
for dex in inputs { for (mycol in c(idx_shallowest,idx_ctrl, idx_deepest,idx_univ)) {
d.comm=estimateCommonDisp(d)
} }
D-Results
In all cases of normalization, the dispersion is minimized if the universal RNA sample is considered as reference. Dispersion is further improved if the chrM counts are not considered in the library size. However, if no TMM normalization is performed, dispersion is minimized.
#with TMM
> load("CD14_dispersion_replicates_no-chrM_with_control_refcol-hela_control-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.2419791
> load("CD14_dispersion_replicates_with_control_refcol-hela_control-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.243903
> load("CD14_dispersion_replicates_with_control_refcol-shallowest_lib-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.2351437
> load("CD14_dispersion_replicates_no-chrM_with_control_refcol-shallowest_lib-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.233168
> load("CD14_dispersion_replicates_no-chrM_with_control_refcol-deepest_lib-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.2397358
> load("CD14_dispersion_replicates_with_control_refcol-deepest_lib-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.241812
> load("CD14_dispersion_replicates_with_control_refcol-universalRNA-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.2340208
> load("CD14_dispersion_replicates_no-chrM_with_control_refcol-universalRNA-20110726.RData")
> sqrt(d.comm$common.dispersion)
[1] 0.2321581
#Without TMM
> sqrt(d.comm$common.dispersion) [1] 0.2218773 > save(d.comm,file="CD14_dispersion_replicates_noTMM.RData") > sqrt(d.comm$common.dispersion) [1] 0.2208855 > save(d.comm,file="CD14_dispersion_replicates_noTMM_no_chrM.RData")
rDNA and chrM survey
Since Charles and a few people wondered if chrM mapped reads are correlated with rRNA reads, Kawaji plotted them to show their correlation. This support the strategy to remove chrM mapped reads from the library sizes.
Results
Methods
shell commands in ruby/Rake:
FileList["#{ROOTDIR}/UPDATE_012/f5pipeline/human.primary_cell.hCAGE/*.rdna.fa.gz"].each do |infile|
sh %! echo -ne $(basename #{infile} .nobarcode.rdna.fa.gz)"\t" >> #{t.name}!
sh %! tmp=$(grep -c '>' #{infile}); echo -ne "$tmp\t" >> #{t.name}!
infile = infile.sub(".nobarcode.rdna.fa.gz", ".ctss.bed.gz")
sh %! gunzip -c #{infile} | grep ^chrM | awk 'BEGIN{sum=0}{sum += $5}END{print sum}' >> #{t.name}!
end
small RNA normalization
Similar to above, different normalization strategies for FANTOM5 small RNA libraries have been tested including tpm (tags per million sequences), tpmm (tags per million miRNA sequences), TMM, and RLE (see above for description of TMM and RLE).
Failed small RNA libraries
Before normalization, 10 "failed" libraries were removed. These libraries had less than 300 total sequences, to give some perspective the median number of sequences for the set of libraries was ~1.2 million. Here is the list of removed libraries (sample name;sample id):
- Tracheal Epithelial Cells, donor1;11292-117A5
- Astrocyte - cerebellum, donor2;11580-120F5
- Smooth Muscle Cells - Carotid, donor3;11434-118H3
- Skeletal muscle cells differentiated into Myotubes - multinucleated, donor3;11431-118G9
- Mesenchymal Stem Cells - bone marrow, donor2;11616-122A5
- Preadipocyte - breast, donor1;11467-119B9
- Endothelial Cells - Microvascular, donor2;11342-117G1
- Renal Mesangial Cells, donor3;11679-122H5
- Trabecular Meshwork Cells, donor3;11693-123A1
- Fibroblast - Villous Mesenchymal, donor2;11615-122A4
Methods
The procedure was essentially the same as above, using edgeR to calculate average common dispersion across all FANTOM5 primary cell snapshot libraries with replicates. The sample whose 75%-ile is closest to the mean of 75%-iles was taken as the reference for TMM normalization (default behavior on edgeR). Counts at miRNA loci were aggregated using mirBASE v.17 hairpin definitions, which means we are not distiguishing between miRNAs derived from the 3' and 5' arms.
Results
Common dispersion:
- tpm: 0.3989
- tpmm: 0.2133
- TMM: 0.2003
- RLE: 0.1992
Estimate of the coeffecient of variation (square root of the common dispersion):
- tpm: 0.6316
- tpmm: 0.4619
- TMM: 0.4475
- RLE: 0.4463
The second list can be interpreted as the amount of variation (up or down) in a library from the "true" miRNA abundance. In other words, in the tpm method the true abundance varies 63% up or down between replicates while using tpmm normalization the true abundance varies 46% up or down.
Given this, it appears the preferred method would be something like tpm <<<<<<<<< tpmm ~= TMM ~= RLE. TMM and RLE appear to perform substantially better in at least a couple of libraries (see below). As pointed out by Al on the mailing list, the problem with tpm appears to be inherent to the protocol, perhaps stemming from fluctuations in the sequencing of different non-coding RNA populations (tRNA, rRNA, etc.), even across replicates.
Libraries with relatively poor reproducibility
Just as a warning of sorts, the following replicate sets has greater than 75% variation across replicates (poor reproducibility):
- Smooth Muscle cells – Carotid
- Smooth Muscle cells – Pulmonary Artery
- Smooth Muscle cells – Brachiocephalioc
- Prostrate Epithelial Cells
- Fibroblast – Mammary
- Neural stem cells
All of these libraries have 3 replicates, so it may be a matter of removing one replicate but this has not been investigated further. In theory, it could also indicate that few miRNA are expressed in these leaving very few data points.
RLE/TMM outperformance of tpmm
It should be noted that while TMM and RLE perform about the same as tpmm, there are a few libraries where performance is substantially better for TMM and RLE. Two example libraries with 10% less variation in TMM/RLE relative to tpmm: smooth muscle cells (uterine) and neural stem cells. The likely reason for this is that TMM/RLE correct for overrepresentation of a few loci in a library (see PMID id:).
Files
I am including links to 1) a file (Media:MiRNA expression table noquotes reformat.txt.gz) containing a list of raw miRNA counts for all currently-sequenced small primary cell RNA libraries and 2) a file (Media:All normalization factors.txt.gz) containing the necessary normalization information for each library; these can be used to reconstruct the results here or make quick comparisons, plots, etc. of your own. File #2 includes the library ID, the total number of sequences for all libraries (for tpm calculations), the total number of miRNA sequences (for tpmm calculations), and normalization factors for TMM and RLE. To calculate normalized values under the RLE/TMM methods, multiply the library size (total number of miRNA sequences) by the normalization factor and then divide the number of counts by this factor (and multiply by 1million, if so desired). For example:
effective.libsize <- libsize * TMM_normalization_factor norm.count <- raw.count/effective.libsize*1000000
