Compositional Analysis with Statistical Testing

Learn how to do a compositional analysis using stacked bar plot and how to create analyses that are customized to your research experiment

Overview

Compositional analysis allows you to investigate how cell populations are distributed across different experimental conditions or biological states. This powerful visualization tool helps identify shifts in cellular composition that may indicate biological responses, disease states, or treatment effects.

Understanding the Stacked Bar Plot

The compositional analysis uses a stacked bar plot to visualize the proportion of cells across different categories:

  • X-axis (Primary Variable): Typically displays clusters identified from single-cell analysis
  • Y-axis (Secondary Variable): Shows the distribution of experimental conditions within each cluster as stacked segments
  • Percentages: Each segment shows the percentage of cells belonging to that category

Example Use Cases

  • Comparing cell type distributions between wild type and knockout samples
  • Analyzing how disease states affect cellular composition
  • Investigating cell cycle phase distribution across different cell types
  • Examining treatment effects on cell population dynamics

Setting Up Your Analysis

Prerequisites

Before starting your compositional analysis:

  1. Complete initial analysis on your dataset
  2. Add relevant metadata (genotypes, conditions, treatments, etc.)
  3. Annotate clusters with cell types (using auto-annotation or manual annotation)

Steps

1. From the explorer page, open the stacked bar plot panel.

2. Select the primary and secondary variable. The primary variable will be shown on the x-axis and the secondary variable will be shown as the stacked values for each of the category in the primary variable.

3. Here in this example, we look at the cell types of different disease conditions: COVID-19 and healthy. The stacked bar can be used to quickly visualise the proportions of the cell types in percentages of cells.

4. The stacked bar plot can be customised to highlight specific groups of interest, the statistical data will not be affected. You can learn more about stacked bar plots from this demo: Stacked bar plot

Variable Selection

Primary Variable

The primary variable forms the x-axis categories. Common choices include:

  • Leiden clusters
  • Cell type annotations
  • Sample groups

Secondary Variable

The secondary variable creates the stacked segments within each bar. Examples include:

  • Genotype (wild type, knockout, etc.)
  • Disease state (healthy, diseased)
  • Treatment conditions
  • Cell cycle phase (G1, S, G2M)
  • Time points

Statistical Analysis

One-Sample Proportion Z-Test

The system performs rigorous statistical testing to identify significant enrichments or depletions of cell populations. For each combination of primary and secondary variable categories, we calculate:

  1. P-value: Indicates statistical significance of the observed difference
  2. Odds Ratio (OR): Measures the magnitude and direction of the effect

How It Works

For each secondary variable category in each primary variable section, the test determines if that category is significantly over- or under-represented compared to its overall distribution across all primary categories.

Example: Is the proportion of wild type cells in Cluster 1 significantly different from the overall wild type distribution across all clusters?

Understanding Significance Criteria

Dual Threshold System

For a difference to be marked as significant (indicated by a star ★ on the plot), BOTH conditions must be met:

  1. Statistical Significance
    • P-value < significance level (default: 0.005)
    • Adjustable via the dropdown menu
  2. Biological Significance
    • For Enrichment: Odds Ratio > 2.0 (at least 2× more likely)
    • For Depletion: Odds Ratio < 0.05 (at least 20× less likely)

This dual-threshold approach ensures that only meaningful biological differences are highlighted, filtering out statistically significant but trivial changes.

Using the Interface

Control Panel Features

Display Options

  • Show % text: Toggle percentage labels on/off
  • % display cutoff: Hide labels below a certain percentage threshold for cleaner visualization

Statistical Controls

  • P value checkbox: Enable statistical analysis display
  • Significance level dropdown: Adjust the p-value threshold (default: 0.005)
  • Enriched/Depleted radio buttons:
    • Select “Enriched” to highlight over-represented populations (OR > 2)
    • Select “Depleted” to highlight under-represented populations (OR < 0.05)

Customization Options

  • Value size: Adjust text size for percentages
  • Width/Height: Resize the plot dimensions
  • Customize labels: Modify category names for clarity
  • Grid: Toggle gridlines on/off

Interpreting Results

Visual Indicators

  • Star symbols (★): Appear next to categories that meet both p-value and odds ratio criteria
  • Percentage labels: Show the exact proportion of cells in each category
  • Color coding: Different colors represent different secondary variable categories

Statistical Interpretation

  • No stars: The distribution is not significantly different from expected
  • Stars with “Enriched” selected: That population is significantly over-represented
  • Stars with “Depleted” selected: That population is significantly under-represented

Downloading Results

Statistical Data Export

When you enable the P value checkbox, two download options become available:

  1. Download p-values (csv): Export the complete p-value table
  2. Download odds ratios (csv): Export the complete odds ratio table

Important: Downloaded files contain ALL statistical results regardless of UI filter settings (enriched/depleted selection or significance threshold). This ensures you have complete data for publication or further analysis.

CSV File Format

Both exported CSV files follow the same structure:

  • Rows: One row for each primary variable category (e.g., each Leiden cluster)
  • Columns:
  • First column: Primary variable name (e.g., “Leiden clusters”)
  • Remaining columns: One for each secondary variable category (e.g., “G1”, “G2M”, “S” for cell cycle phases)
  • Values:
  • P-value file: Statistical significance values for each combination
  • Odds ratio file: Effect size values for each combination

Example Output Structure

For a dataset with 15 clusters and 3 cell cycle phases, you’ll receive:

  • P-values CSV: 15 rows × 4 columns (cluster name + 3 phase p-values)
  • Odds ratios CSV: 15 rows × 4 columns (cluster name + 3 phase odds ratios)

These tables provide all the statistical information needed to identify enriched or depleted populations across your entire dataset

Practical Tips

Choosing the Right Variables

  • Use biologically meaningful groupings as your secondary variable
  • Ensure adequate cell numbers in each category for reliable statistics
  • Consider biological relevance when setting significance thresholds

Advanced Analysis

For complex comparisons involving multiple conditions:

  1. Use the lasso tool to select specific cell populations
  2. Create custom cell sets (e.g., “healthy blood samples” vs “diseased blood samples”)
  3. Apply compositional analysis to these refined groups

Best Practices

  • Always check both p-values and odds ratios when interpreting results
  • Consider biological context when evaluating statistical significance
  • Use the download feature to preserve complete statistical results
  • Document your significance thresholds for reproducibility

Troubleshooting

No Significant Results?

  • Check if your significance threshold is too stringent
  • Ensure sufficient cell numbers in each category
  • Verify that your metadata is correctly assigned
  • Consider whether biological differences exist between conditions

Too Many Significant Results?

  • Increase the p-value stringency (e.g., from 0.005 to 0.001)
  • Focus on larger effect sizes by mentally noting only the strongest odds ratios
  • Consider biological relevance of the findings

Summary

The compositional analysis tool provides a statistically rigorous yet user-friendly way to explore cellular population dynamics across experimental conditions. By combining intuitive visualization with robust statistical testing, researchers can confidently identify and validate biologically meaningful changes in their single-cell data.

💡 Extra tips:

What if I want to compare cells using multiple conditions? For example, I have samples from different tissues (blood and bone marrow) and different conditions (disease and healthy ) and I want to visualise the cell type composition in only blood samples with both disease and healthy condition.

Example steps:

  • You can use the lasso tool and Cell sets to create and save groups of cells from multiple conditions, example ‘healthy blood sample’ and ‘disease blood sample’.
  • Create a new custom categorical from cell sets (Learn more on: Cell Selections and Custom Categories)
  • Use the Hide cells function to hide any uncategorised cells that are not of interest, this will hide them from the plot in the next step. Use Stacked bar to plot the custom categorical and cell type.

Yi Su

Bioinfomatician