Compositional Analysis with Statistical Testing

Learn how to do a compositional analysis using stacked bar plot and how to create analyses that are customized to your research experiment

min read

Written by

Yi Su

Overview

Compositional analysis allows you to investigate how cell populations are distributed across different experimental conditions or biological states. This powerful visualization tool helps identify shifts in cellular composition that may indicate biological responses, disease states, or treatment effects.

Understanding the Stacked Bar Plot

The compositional analysis uses a stacked bar plot to visualize the proportion of cells across different categories:

X-axis (Primary Variable): Typically displays clusters identified from single-cell analysis
Y-axis (Secondary Variable): Shows the distribution of experimental conditions within each cluster as stacked segments
Percentages: Each segment shows the percentage of cells belonging to that category

Example Use Cases

Comparing cell type distributions between wild type and knockout samples
Analyzing how disease states affect cellular composition
Investigating cell cycle phase distribution across different cell types
Examining treatment effects on cell population dynamics

Setting Up Your Analysis

Prerequisites

Before starting your compositional analysis:

Complete initial analysis on your dataset
Add relevant metadata (genotypes, conditions, treatments, etc.)
Annotate clusters with cell types (using auto-annotation or manual annotation)

‍

Steps

1. From the explorer page, open the stacked bar plot panel.

2. Select the primary and secondary variable. The primary variable will be shown on the x-axis and the secondary variable will be shown as the stacked values for each of the category in the primary variable.

3. Here in this example, we look at the cell types of different disease conditions: COVID-19 and healthy. The stacked bar can be used to quickly visualise the proportions of the cell types in percentages of cells.

4. The stacked bar plot can be customised to highlight specific groups of interest, the statistical data will not be affected. You can learn more about stacked bar plots from this demo: Stacked bar plot

Variable Selection

Primary Variable

The primary variable forms the x-axis categories. Common choices include:

Leiden clusters
Cell type annotations
Sample groups

Secondary Variable

The secondary variable creates the stacked segments within each bar. Examples include:

Genotype (wild type, knockout, etc.)
Disease state (healthy, diseased)
Treatment conditions
Cell cycle phase (G1, S, G2M)
Time points

Statistical Analysis

One-Sample Proportion Z-Test

The system performs rigorous statistical testing to identify significant enrichments or depletions of cell populations. For each combination of primary and secondary variable categories, we calculate:

P-value: Indicates statistical significance of the observed difference
Odds Ratio (OR): Measures the magnitude and direction of the effect

How It Works

For each secondary variable category in each primary variable section, the test determines if that category is significantly over- or under-represented compared to its overall distribution across all primary categories.

Example: Is the proportion of wild type cells in Cluster 1 significantly different from the overall wild type distribution across all clusters?

Understanding Significance Criteria

Dual Threshold System

For a difference to be marked as significant (indicated by a star ★ on the plot), BOTH conditions must be met:

Statistical Significance
- P-value < significance level (default: 0.005)
- Adjustable via the dropdown menu
Biological Significance
- For Enrichment: Odds Ratio > 2.0 (at least 2× more likely)
- For Depletion: Odds Ratio < 0.05 (at least 20× less likely)

This dual-threshold approach ensures that only meaningful biological differences are highlighted, filtering out statistically significant but trivial changes.

Using the Interface

Control Panel Features

Display Options

Show % text: Toggle percentage labels on/off
% display cutoff: Hide labels below a certain percentage threshold for cleaner visualization

Statistical Controls

P value checkbox: Enable statistical analysis display
Significance level dropdown: Adjust the p-value threshold (default: 0.005)
Enriched/Depleted radio buttons:
- Select “Enriched” to highlight over-represented populations (OR > 2)
- Select “Depleted” to highlight under-represented populations (OR < 0.05)

Customization Options

Value size: Adjust text size for percentages
Width/Height: Resize the plot dimensions
Customize labels: Modify category names for clarity
Grid: Toggle gridlines on/off

Interpreting Results

Visual Indicators

Star symbols (★): Appear next to categories that meet both p-value and odds ratio criteria
Percentage labels: Show the exact proportion of cells in each category
Color coding: Different colors represent different secondary variable categories

Statistical Interpretation

No stars: The distribution is not significantly different from expected
Stars with “Enriched” selected: That population is significantly over-represented
Stars with “Depleted” selected: That population is significantly under-represented

Downloading Results

Statistical Data Export

When you enable the P value checkbox, two download options become available:

Download p-values (csv): Export the complete p-value table
Download odds ratios (csv): Export the complete odds ratio table

Important: Downloaded files contain ALL statistical results regardless of UI filter settings (enriched/depleted selection or significance threshold). This ensures you have complete data for publication or further analysis.

CSV File Format

Both exported CSV files follow the same structure:

Rows: One row for each primary variable category (e.g., each Leiden cluster)
Columns:
First column: Primary variable name (e.g., “Leiden clusters”)
Remaining columns: One for each secondary variable category (e.g., “G1”, “G2M”, “S” for cell cycle phases)
Values:
P-value file: Statistical significance values for each combination
Odds ratio file: Effect size values for each combination

Example Output Structure

For a dataset with 15 clusters and 3 cell cycle phases, you’ll receive:

P-values CSV: 15 rows × 4 columns (cluster name + 3 phase p-values)
Odds ratios CSV: 15 rows × 4 columns (cluster name + 3 phase odds ratios)

These tables provide all the statistical information needed to identify enriched or depleted populations across your entire dataset

Practical Tips

Choosing the Right Variables

Use biologically meaningful groupings as your secondary variable
Ensure adequate cell numbers in each category for reliable statistics
Consider biological relevance when setting significance thresholds

Advanced Analysis

For complex comparisons involving multiple conditions:

Use the lasso tool to select specific cell populations
Create custom cell sets (e.g., “healthy blood samples” vs “diseased blood samples”)
Apply compositional analysis to these refined groups

Best Practices

Always check both p-values and odds ratios when interpreting results
Consider biological context when evaluating statistical significance
Use the download feature to preserve complete statistical results
Document your significance thresholds for reproducibility

Troubleshooting

No Significant Results?

Check if your significance threshold is too stringent
Ensure sufficient cell numbers in each category
Verify that your metadata is correctly assigned
Consider whether biological differences exist between conditions

Too Many Significant Results?

Increase the p-value stringency (e.g., from 0.005 to 0.001)
Focus on larger effect sizes by mentally noting only the strongest odds ratios
Consider biological relevance of the findings

Summary

The compositional analysis tool provides a statistically rigorous yet user-friendly way to explore cellular population dynamics across experimental conditions. By combining intuitive visualization with robust statistical testing, researchers can confidently identify and validate biologically meaningful changes in their single-cell data.

‍

💡 Extra tips:

What if I want to compare cells using multiple conditions? For example, I have samples from different tissues (blood and bone marrow) and different conditions (disease and healthy ) and I want to visualise the cell type composition in only blood samples with both disease and healthy condition.

Example steps:

You can use the lasso tool and Cell sets to create and save groups of cells from multiple conditions, example ‘healthy blood sample’ and ‘disease blood sample’.
Create a new custom categorical from cell sets (Learn more on: Cell Selections and Custom Categories)
Use the Hide cells function to hide any uncategorised cells that are not of interest, this will hide them from the plot in the next step. Use Stacked bar to plot the custom categorical and cell type.

Yi Su

Bioinfomatician

Compositional Analysis with Statistical Testing

Overview

Understanding the Stacked Bar Plot

Example Use Cases

Setting Up Your Analysis

Prerequisites

Steps

Variable Selection

Primary Variable

Secondary Variable

Statistical Analysis

One-Sample Proportion Z-Test

How It Works

Understanding Significance Criteria

Dual Threshold System

Using the Interface

Control Panel Features

Display Options

Statistical Controls

Customization Options

Interpreting Results

Visual Indicators

Statistical Interpretation

Downloading Results

Statistical Data Export

CSV File Format

Example Output Structure

Practical Tips

Choosing the Right Variables

Advanced Analysis

Best Practices

Troubleshooting

No Significant Results?

Too Many Significant Results?

Summary

‍

💡 Extra tips:

Yi Su

Related articles

Gene sets and signature scores

Pseudotime Analysis

Differential Gene Expression