Relative vs Absolute: Understanding Compositional Data with SimulationsKrishna YerramsettyBlockedUnblockFollowFollowingJun 16One of the things I am constantly asked to do at my work is to compare NGS data across samples.
These could be replicates or could be coming from completely different environments or processing steps.
We all intuitively understood that these data are always constrained by the sequencing depth and therefore the conclusions need to be drawn carefully.
I lacked the vocabulary or the techniques to analyze this data until I saw the pre-print of this paper by @WarrenMcG.
That started a few days of digging a little deeper into compositional data, and I generated some synthetic data to understand the nuances a little better.
In this post, I try to lay out what I’ve learned and hopefully help spread the word about the dangers of treating compositional data as regular un-constrained data.
What are Compositional DataCompositional data is any data where the numbers make up proportions of a whole.
For example, the number of heads and tails you obtain from flipping a coin say 10 times is compositional data.
The main feature of these data is that they are ‘closed’ or constrained in the sense that if I tell you we got 6 heads from flipping a 2-sided coin, you know that there are 4 tails without me providing you with any further information.
In probability terms, this means not all components in the data are independent of each other.
Therefore, compositional data do not exist in the Euclidean space, but in a special constrained space called the simplex.
For engineers out there, you might have run into a simplex plot every time you plotted a multi-phase equilibrium plot like this one here for water plotted using the R library Ternary:In essence, a specified fixed amount of water can exist in different phases in different conditions but the total amount of steam, liquid water, and ice should always sum up to the fixed amount we started with.
Compositional data exists in more places than we think.
Some examples include voting data, nutrient information on food labels, the proportion of different gases in the air you breathe, and even the RGB values on the pixels on your screen right now.
Problems with Treating Compositional Data as Unconstrained DataThere are in general 4 different criteria that any analysis method for compositional data should satisfy according to Aitchison.
Traditional analyses generally do not satisfy all these requirements and therefore lead to erroneous results.
I will use simulated data to convince myself and hopefully you to understand these pitfalls of traditional analyses methods.
I used absSimSeq to generate some dummy RNA-seq like data.
In short, I have 4 samples from a control condition and 4 samples from a treatment condition.
Each sample is made up of 100 native genes and 92 spike-in (control) genes, and the counts of all genes for a sample sum up to approximately 20K reads.
The treatment condition is simulated to have 20% of the native genes differentially expressed (fancy biology term for the change in the number of copies of the gene inside the cells/tissues) compared to the control condition.
The exact magnitudes of differential expression do not matter for what we are trying to prove here.
Now that we set-up the data lets move onto the 4 criteria that traditional analyses methods generally fail to take into account when dealing with compositional data.
You can find all the data and code I used here: https://github.
gitScale InvarianceA common property of compositional data is that they can be represented using different magnitudes of vectors.
Ex: Air is made up of 78% Nitrogen, 21% Oxygen, and the rest is 1% by Volume.
In vector notation, this can be written as [78, 21, 1].
In terms of parts per million in volume (ppmv), this can also be specified as [780840, 209460, and 9780].
Despite these very different notations, they represent the same physical sample.
Traditional analyses methods like Euclidean distances and hierarchical clustering, however, will be fooled by this difference in representation.
Let’s prove this using our simulated RNA-Seq data.
In R do the following:load('sim_counts_matrix_100.
dist <- dist(t(counts_matrix))orig.
dendro <- as.
dendrogram(hclust(d = dist(t(counts_matrix))))# Create dendrodendro.
plot <- ggdendrogram(data = orig.
dendro, rotate = TRUE)# Preview the plotprint(dendro.
plot)Clustering of original dataAs expected, we see that the four control samples (samples 1 to 4) cluster together, and the four treatment samples (samples 5 to 8) cluster together.
Now let's scale some samples differently as shown below.
This can happen for example, if you sequenced some samples to a higher depth than usual but still want to analyze all the samples together.
# Scale the samples differentlyscaled.
counts_matrix = counts_matrix %*% diag(c(1,1,5,5,1,1,5,5))scaled.
dist <- dist(t(scaled.
dendro <- as.
dendrogram(hclust(d = dist(t(scaled.
counts_matrix))))# Create dendroscaled.
plot <- ggdendrogram(data = scaled.
dendro, rotate = TRUE)# Preview the plotprint(scaled.
plot)Clustering of scaled dataThe new dendrogram now shows the scaled samples (Samples 3,4, 7, and 8) clustered together, which is not true since the only thing that changed is the total number of reads for these samples, but their individual components are the same relative to each other.
Perturbation InvarianceThis is related to the observation that the conclusions of our analyses should not depend on the units used to represent the compositional data.
Going back to the Air composition example, we can represent the data using a mix of percentages and ppmv values like so: [78%, 21%, and 9780ppmv].
This still represents the same physical sample, and we should be able to compare this to another air-like sample, in this new coordinate system and the old coordinate system, and draw the same conclusions.
Let's use our simulated data again to calculate the distances between samples in a newer perturbed coordinate space:#Multiply each row (gene) with a scalar valueperturbed.
counts_matrix = counts_matrix * c(seq(1,192,1))colnames(perturbed.
counts_matrix) = colnames(counts_matrix)perturbed.
dist <- dist(t(perturbed.
dendro <- as.
dendrogram(hclust(d = dist(t(perturbed.
counts_matrix))))# Create dendroperturbed.
plot <- ggdendrogram(data = perturbed.
dendro, rotate = TRUE)# Preview the plotprint(perturbed.
plot)Clustering of perturbed dataIt's clear from the above plot that this is a much minor problem compared to the scale invariance problem.
The samples from the control and treatment condition cluster together, but the distances between samples is different.
This can lead to wrong conclusions depending on the question being asked.
In the context of comparing samples across conditions as in most RNA-seq or NGS data, perturbation variance isn’t much of a problem.
Subcompositional CoherenceThis property ensures that the analyses lead to the same conclusions regardless of which components of the data are included.
For example, in our Air example, we can remove the component representing the 1% volume.
Now the samples are represented in the 2-dimensional space ([78.
5%] )instead of in the original 3-dimensional space ([78%, 21%, 1%]).
Any analysis in this lower dimensional space should not differ from the comparison in the original 3-dimensional space.
This is an important property from the standpoint of Next Generation Sequencing (NGS) data where we are not always guaranteed to find all the components in our data all the time.
This could be for a lot of reasons, sometimes intended and sometimes unintended and unavoidable.
To simulate this property (or lack thereof) in traditional analyses methods, I generated a second dataset made up of the first 50 genes + 92 controls from our original dataset made up 100 genes + 92 controls.
The 50 genes in the smaller dataset have the same absolute (unconstrained) values as in the original dataset.
These unconstrained data are then closed (constrained) by making the total sum of all reads to be ~20K as in the original dataset.
The correlation coefficients among the 50 common genes in both datasets are calculated and then compared:load('sim_counts_matrix_100.
all <- counts_matrix # Load the sub-compositional data made up of only the first 50 genes (features) + 92 controls from the original data of 100 genes (features) + 92 controlsload('sim_counts_matrix_50.
comp <- counts_matrix# Get the correlation between the 50 common genescor.
all <- as.
comp <- as.
comp[1:50,])))tmp <- as.
comp))names(tmp) <- c('correlation_all', 'correlation_sub_comp')tmp$abs.
diff <- as.
factor(ifelse(abs(tmp$correlation_all – tmp$correlation_sub_comp)>0.
5,1,0))# Plotggplot(tmp,aes(correlation_all,correlation_sub_comp, color=abs.
diff)) + geom_point(size=2) + th + scale_colour_manual(values = c("1" = "Red", "0" = "Blue")) + theme(legend.
position = "none")Differences in the correlation between genes (features) depending on what other genes are present in the dataIn the plot above, we can see that the majority of correlation coefficients are similar in both datasets, with some coefficients that differ significantly (shown in red).
Depending on the questions being asked, these differences could lead to significant differences in conclusions.
This could be a major problem in gene network analyses.
Permutation InvarianceThis property means that regardless of the order of the original data, the analyses methods should lead to the same conclusions.
Most analyses methods I can think of generally follow this property.
Please let me know if you know of any analyses methods that violate this, I will include them in the post.
How to Correctly Analyze Compositional DataNow that we convinced ourselves that the traditional analyses methods do not always satisfy all 4 properties important for analyzing compositional data, what can we do?.Aitchison has done some incredible work over 3 decades to come up with better techniques.
Based on my experience dealing with NGS data, I have a feeling that his recommendations have largely fallen on deaf ears.
Hopefully, this blog post will convince others to be more cautious when analyzing compositional data.
In the next post, I will show some ways to analyze compositional data with emphasis again on NGS data and applications.