SampleClassification: Difference between revisions
| Line 48: | Line 48: | ||
Using CAGE data from FANTOM 5, extract CTSS (Cage Tag Start Sits) of known Refgenes, and then map to ENSENBL proteins and SUPERFAMILY domain assignments at SCOP superfamily levels, resulting in domain-sample expression matrix. In this matrix, each entry measures average expression level per domain, which is calculated following: SUM(tags for a refgene X number of domains in this refgene) divided by total number of domains. This calculation can account for the domain copy numbers in different genomes, important for comparative study. |
Using CAGE data from FANTOM 5, extract CTSS (Cage Tag Start Sits) of known Refgenes, and then map to ENSENBL proteins and SUPERFAMILY domain assignments at SCOP superfamily levels, resulting in domain-sample expression matrix. In this matrix, each entry measures average expression level per domain, which is calculated following: SUM(tags for a refgene X number of domains in this refgene) divided by total number of domains. This calculation can account for the domain copy numbers in different genomes, important for comparative study. |
||
In order to make sure the data follows the normal distribution and correct the systematic biases introduced during the conversion from gene-level to domain level |
In order to make sure the data follows the normal distribution and correct the systematic biases introduced during the conversion from gene-level to domain level, the domain-sample expression matrix can be first log-transformed, and then standardized in a column/sample-wise manner so that the median of each column is 0 and variation equals 1. |
||
Using this transformed and standardized domain-level expression matrix, we apply conventional hierarchical algorithm for sample classifications (Euclidean distance as similarity metric, average linkage clustering method). See the result [[Media:Hierarchical_tree.pdf]], and its correlation matrix [[Media:correlation_matrix.pdf]]. |
Using this transformed and standardized domain-level expression matrix, we apply conventional hierarchical algorithm for sample classifications (Euclidean distance as similarity metric, average linkage clustering method). See the result [[Media:Hierarchical_tree.pdf]], and its correlation matrix [[Media:correlation_matrix.pdf]]. |
||
Revision as of 20:58, 1 April 2011
The link above will take you to a separate page expanding on all groups' analysis done so far, including:
- 1) Win
input data is gene expression
phylogenetic algorithm is neighbour-joining (Manhattan distance)
- 2) Robin
input data is level 2 promotor expression
phylogenetic technique is neighbour-joining (KL divergence distance)
- 3) Owen
input data is gene expression (presence/absence using a threshold)
phylogenetic technique is maximum likelihood
- 4) Owen --preliminary result--
input data is presence/absence of TF network edges based on Motif activity (same as FANTOM4 Nature Genetics paper)
phylogenetic technique is maximum likelihood
- 5) Hai
input data is domain-level expression (converted from gene-level)
phylogenetic technique is average linkage clustering
- 6) Kawaji
input data is gene expression
phylogenetic technique is average linkage clustering (Pearson correlation)
Comparing these trees (attached and numbered), we can see that 1 and 6 use the same input data, and 2 is very similar (adding in non-coding). 3 takes the same data from 1 and 6 but applies a threshold to convert it to binary (discarding information), and 4 is edge information instead of node information only for transcription factors. 5 is the same data from 1 and 6, but converted to to domains from genes.
1 and 2 use the same phylogenetic algorithm, 3 and 4 use a different one, and 5 and 6 use a third. The information contained in the input data should make more difference than the phylogenetic algorithm applied to it.
If the trees are compared, the results are in general very similar in 1,2, 6 and 5. They all group obvious clades such as macrophage, brain and blood. The trees in 3 and 4 fail to separate primary cells from tissues although they do separate the obvious clades similar to the others. The tree in 5 is unique in that it separates organisms perfectly as well as separating the sample type and obvious clades within organisms (where available).
Cross-species sample classifications based on structural domain-ome expression information
- Motivations:
Complexity systems are modular. In proteins, modularity can be found at structural domains. More importantly, these structural domains can harbour significantly functional and evolutionary signals, just as defined by SCOP at the superfamily level.
For a given organism, domain-ome is a whole collection of SCOP domains existing. From the evo-dev point of view, present-absent binary pattern of domain-ome is shaped by evolution, and numeric expression pattern of domain-ome can be reflective of tissue origins and cell types. Intuitively, it would be very interesting to see whether we can perform inter-species/intra-species comparative analysis based on domain-ome expression.
Ideally, we need protein-level expression data and protein-domain mapping matrix to generate domain-level expression data. Practically, domain-ome expression for a panel of tissues and cell types across different species can be converted from CAGE data in FANTOM 5 or other microarray-based high-throughput data in public-access sources.
- Methods and Results:
Using CAGE data from FANTOM 5, extract CTSS (Cage Tag Start Sits) of known Refgenes, and then map to ENSENBL proteins and SUPERFAMILY domain assignments at SCOP superfamily levels, resulting in domain-sample expression matrix. In this matrix, each entry measures average expression level per domain, which is calculated following: SUM(tags for a refgene X number of domains in this refgene) divided by total number of domains. This calculation can account for the domain copy numbers in different genomes, important for comparative study.
In order to make sure the data follows the normal distribution and correct the systematic biases introduced during the conversion from gene-level to domain level, the domain-sample expression matrix can be first log-transformed, and then standardized in a column/sample-wise manner so that the median of each column is 0 and variation equals 1.
Using this transformed and standardized domain-level expression matrix, we apply conventional hierarchical algorithm for sample classifications (Euclidean distance as similarity metric, average linkage clustering method). See the result Media:Hierarchical_tree.pdf, and its correlation matrix Media:correlation_matrix.pdf.
From these figures, several features stand out:
1. Samples are robustly grouped together first according to their species identity. Since this trend is so consistent, we expect it will still be preserved (as more samples are available for rat, dog, chicken).
2. Samples are generally further grouped based on the sample types. For example, in human, the primary cells are almost separated from tissues.
3. If looking into greater details for those samples belonging to the same types within the same species, the similar quality can be comparatively achieved using gene-level expression data (under the same parameter for sample classifications: Media:Hierarchical_tree_gene.pdf).
Also, we apply projection techniques (see Media:Sammon_projection.pdf and Media:SOM_projection.pdf) to capture their relationships. Clearly, the relationships revealed above are independent of algorithms used.
Why domain-centric expression? What I can come up with are: Domain-centric expression info may be another important measure of increasing phenotypic diversity: direct basis of interacting protein networks (which are considered to be molecular carriers of phenotype information).