Overview
Compositional analysis allows you to investigate how cell populations are distributed across different experimental conditions or biological states. This powerful visualization tool helps identify shifts in cellular composition that may indicate biological responses, disease states, or treatment effects.
Understanding the Stacked Bar Plot
The compositional analysis uses a stacked bar plot to visualize the proportion of cells across different categories:
- X-axis (Primary Variable): Typically displays clusters identified from single-cell analysis
- Y-axis (Secondary Variable): Shows the distribution of experimental conditions within each cluster as stacked segments
- Percentages: Each segment shows the percentage of cells belonging to that category
Example Use Cases
- Comparing cell type distributions between wild type and knockout samples
- Analyzing how disease states affect cellular composition
- Investigating cell cycle phase distribution across different cell types
- Examining treatment effects on cell population dynamics
Setting Up Your Analysis
Prerequisites
Before starting your compositional analysis:
- Complete initial analysis on your dataset
- Add relevant metadata (genotypes, conditions, treatments, etc.)
- Annotate clusters with cell types (using auto-annotation or manual annotation)
Steps
1. From the explorer page, open the stacked bar plot panel.
2. Select the primary and secondary variable. The primary variable will be shown on the x-axis and the secondary variable will be shown as the stacked values for each of the category in the primary variable.
3. Here in this example, we look at the cell types of different disease conditions: COVID-19 and healthy. The stacked bar can be used to quickly visualise the proportions of the cell types in percentages of cells.
4. The stacked bar plot can be customised to highlight specific groups of interest, the statistical data will not be affected. You can learn more about stacked bar plots from this demo: Stacked bar plot
Variable Selection
Primary Variable
The primary variable forms the x-axis categories. Common choices include:
- Leiden clusters
- Cell type annotations
- Sample groups
Secondary Variable
The secondary variable creates the stacked segments within each bar. Examples include:
- Genotype (wild type, knockout, etc.)
- Disease state (healthy, diseased)
- Treatment conditions
- Cell cycle phase (G1, S, G2M)
- Time points
Statistical Analysis
One-Sample Proportion Z-Test
The system performs rigorous statistical testing to identify significant enrichments or depletions of cell populations. For each combination of primary and secondary variable categories, we calculate:
- P-value: Indicates statistical significance of the observed difference
- Odds Ratio (OR): Measures the magnitude and direction of the effect
How It Works
For each secondary variable category in each primary variable section, the test determines if that category is significantly over- or under-represented compared to its overall distribution across all primary categories.
Example: Is the proportion of wild type cells in Cluster 1 significantly different from the overall wild type distribution across all clusters?
Understanding Significance Criteria
Dual Threshold System
For a difference to be marked as significant (indicated by a star ★ on the plot), BOTH conditions must be met:
- Statistical Significance
- P-value < significance level (default: 0.005)
- Adjustable via the dropdown menu
- Biological Significance
- For Enrichment: Odds Ratio > 2.0 (at least 2× more likely)
- For Depletion: Odds Ratio < 0.05 (at least 20× less likely)
This dual-threshold approach ensures that only meaningful biological differences are highlighted, filtering out statistically significant but trivial changes.
Using the Interface
Control Panel Features
Display Options
- Show % text: Toggle percentage labels on/off
- % display cutoff: Hide labels below a certain percentage threshold for cleaner visualization
Statistical Controls
- P value checkbox: Enable statistical analysis display
- Significance level dropdown: Adjust the p-value threshold (default: 0.005)
- Enriched/Depleted radio buttons:
- Select “Enriched” to highlight over-represented populations (OR > 2)
- Select “Depleted” to highlight under-represented populations (OR < 0.05)
Customization Options
- Value size: Adjust text size for percentages
- Width/Height: Resize the plot dimensions
- Customize labels: Modify category names for clarity
- Grid: Toggle gridlines on/off
Interpreting Results
Visual Indicators
- Star symbols (★): Appear next to categories that meet both p-value and odds ratio criteria
- Percentage labels: Show the exact proportion of cells in each category
- Color coding: Different colors represent different secondary variable categories
Statistical Interpretation
- No stars: The distribution is not significantly different from expected
- Stars with “Enriched” selected: That population is significantly over-represented
- Stars with “Depleted” selected: That population is significantly under-represented
Downloading Results
Statistical Data Export
When you enable the P value checkbox, two download options become available:
- Download p-values (csv): Export the complete p-value table
- Download odds ratios (csv): Export the complete odds ratio table
Important: Downloaded files contain ALL statistical results regardless of UI filter settings (enriched/depleted selection or significance threshold). This ensures you have complete data for publication or further analysis.
CSV File Format
Both exported CSV files follow the same structure:
- Rows: One row for each primary variable category (e.g., each Leiden cluster)
- Columns:
- First column: Primary variable name (e.g., “Leiden clusters”)
- Remaining columns: One for each secondary variable category (e.g., “G1”, “G2M”, “S” for cell cycle phases)
- Values:
- P-value file: Statistical significance values for each combination
- Odds ratio file: Effect size values for each combination
Example Output Structure
For a dataset with 15 clusters and 3 cell cycle phases, you’ll receive:
- P-values CSV: 15 rows × 4 columns (cluster name + 3 phase p-values)
- Odds ratios CSV: 15 rows × 4 columns (cluster name + 3 phase odds ratios)
These tables provide all the statistical information needed to identify enriched or depleted populations across your entire dataset
Practical Tips
Choosing the Right Variables
- Use biologically meaningful groupings as your secondary variable
- Ensure adequate cell numbers in each category for reliable statistics
- Consider biological relevance when setting significance thresholds
Advanced Analysis
For complex comparisons involving multiple conditions:
- Use the lasso tool to select specific cell populations
- Create custom cell sets (e.g., “healthy blood samples” vs “diseased blood samples”)
- Apply compositional analysis to these refined groups
Best Practices
- Always check both p-values and odds ratios when interpreting results
- Consider biological context when evaluating statistical significance
- Use the download feature to preserve complete statistical results
- Document your significance thresholds for reproducibility
Troubleshooting
No Significant Results?
- Check if your significance threshold is too stringent
- Ensure sufficient cell numbers in each category
- Verify that your metadata is correctly assigned
- Consider whether biological differences exist between conditions
Too Many Significant Results?
- Increase the p-value stringency (e.g., from 0.005 to 0.001)
- Focus on larger effect sizes by mentally noting only the strongest odds ratios
- Consider biological relevance of the findings
Summary
The compositional analysis tool provides a statistically rigorous yet user-friendly way to explore cellular population dynamics across experimental conditions. By combining intuitive visualization with robust statistical testing, researchers can confidently identify and validate biologically meaningful changes in their single-cell data.
💡 Extra tips:
What if I want to compare cells using multiple conditions? For example, I have samples from different tissues (blood and bone marrow) and different conditions (disease and healthy ) and I want to visualise the cell type composition in only blood samples with both disease and healthy condition.
Example steps:
- You can use the lasso tool and Cell sets to create and save groups of cells from multiple conditions, example ‘healthy blood sample’ and ‘disease blood sample’.
- Create a new custom categorical from cell sets (Learn more on: Cell Selections and Custom Categories)
- Use the Hide cells function to hide any uncategorised cells that are not of interest, this will hide them from the plot in the next step. Use Stacked bar to plot the custom categorical and cell type.