Batch Effect Correction in scRNA-Seq Data
Compare scRNA-seq normalization and batch correction methods, with current Seurat, Scanpy, Monocle 3, scran, Harmony, BBKNN and scVI workflows.
Most errors in this part of a single-cell workflow begin by treating normalization, feature scaling, batch correction, and statistical adjustment as interchangeable. They are not.
-
01
Normalize
Address cell-level sampling depth and composition.
-
02
Integrate
Change the geometry used to compare cells across datasets.
-
03
Infer
Model batch, donor, and other covariates when estimating an effect.
The same matrix should not be expected to serve all three purposes.
This guide focuses on UMI-based scRNA-seq and single-nucleus RNA-seq. Full-length protocols such as Smart-seq2 have different count properties and should not inherit droplet-based defaults without checking the assumptions. For the broader workflow around quality control, feature selection, dimensionality reduction, and clustering, see Navigating the Complexity of Single-Cell RNA-Seq Data Analysis.
The practical rules
- Keep raw counts unchanged. Store normalized values, residuals, corrected embeddings, and corrected graphs in separate assays, layers, or reductions.
- Normalize every UMI dataset, but integrate only when the scientific question requires shared geometry across samples.
- Treat donor or subject as the biological replicate. Do not remove donor structure automatically, especially when donor is the unit of inference.
- Use integrated representations for clustering, visualization, and mapping. Use raw counts with sample-aware models for formal differential expression.
- Judge integration by biological conservation as well as batch mixing. A well-mixed UMAP can still be biologically wrong.
Start with the design, not the software
A computational method cannot identify a biological effect that is perfectly confounded with batch. If every control was processed with one chemistry and every case with another, the observed difference is compatible with both explanations. No integration algorithm can recover information that the design never contained.
The most useful safeguards are simple:
- distribute biological conditions across processing runs where possible
- include independent biological replicates in each condition
- keep tissue handling, dissociation, capture, library preparation, and sequencing procedures consistent
- record operator, site, run, lane, chemistry version, preservation method, tissue handling time, and other technical metadata
- distinguish sample, donor, library, lane, and experimental condition rather than collapsing them into one column called
batch
Donor is not merely a nuisance variable. In patient studies it is normally the replicate, and donor-to-donor variation is part of the population being studied. Correcting donor away before inference can create an apparently clean cell-level result while discarding the variance needed to support a patient-level conclusion.
What each operation changes
| Operation | Typical output | Appropriate use | Common misuse |
|---|---|---|---|
| Library-size normalization | Log-normalized expression values | Visualization, feature selection, PCA, marker inspection | Treating values as absolute transcript abundance |
| Variance stabilization | Pearson residuals or related transformed values | Highly variable gene selection, PCA, clustering | Using residuals as raw counts or direct fold changes |
| Feature scaling | Gene-wise centered and scaled values | PCA and other distance-based procedures | Calling ScaleData() or sc.pp.scale() a normalization method |
| Embedding correction | Corrected PCA or latent coordinates | Neighbour graph, clustering, UMAP, reference mapping | Using coordinates for gene-level differential expression |
| Graph correction | Batch-balanced neighbour graph | Clustering, UMAP, graph-based trajectory analysis | Assuming a corrected expression matrix exists |
| Statistical adjustment | Model coefficient and uncertainty | Differential expression and effect estimation | Substituting an integrated matrix for a sample-aware model |
This distinction explains why modern single-cell objects usually retain several representations at once.
Log normalization
Library-size log normalization divides each gene count by the total counts in the cell, scales the result to a common total, and applies a logarithmic transform. A common form is:
log(1 + scale_factor × gene_count / cell_total)
The method is fast, transparent, and interoperable. It remains a sensible baseline for many UMI datasets.
What it assumes
Total-count scaling treats differences in cell totals largely as sampling-depth differences. That is only an approximation. Total UMI count also reflects true RNA content, cell size, transcriptional activity, and the abundance of highly expressed programs.
The main limitation is therefore composition bias, not a generic failure to handle zeros. If one condition strongly induces a small set of transcripts, those genes consume a larger fraction of the library. After total-count scaling, unrelated genes can appear relatively lower even when their absolute molecule number has not changed. Likewise, equalizing library totals can remove a real global increase in RNA content.
Zeros in UMI data are usually expected sampling outcomes. Log normalization does not impute them, nor should it be described as a treatment for "dropout". Imputation is a separate modelling choice and can create artificial correlations if used without a clear downstream reason.
How current tools implement log normalization
SCTransform and residual-based normalization
SCTransform models UMI counts with a regularized negative binomial model and returns variance-stabilized Pearson residuals. It was designed to reduce the dependence of downstream variance on sequencing depth while combining normalization, variable-feature selection, and scaling in one procedure.
In current Seurat v5, vst.flavor = "v2" is the default. The SCT assay has three distinct representations:
counts: depth-corrected UMI valuesdata:log1pof corrected countsscale.data: Pearson residuals, normally restricted to variable genes
obj <- SCTransform(
obj,
vst.flavor = "v2",
verbose = FALSE
)
Pearson residuals are well suited to PCA and clustering. They are not raw observations and should not be treated as ordinary normalized counts for sample-level differential expression.
vars.to.regress deserves particular care. Regressing mitochondrial percentage, cell-cycle scores, or other covariates is not automatically an improvement. A variable should be removed only when it is a nuisance for the stated question and the design contains enough information to separate it from biology. Cell cycle, for example, is a real biological programme in developmental, regenerative, and tumour datasets.
Equivalent options in other tools
- Scanpy:
scanpy.experimental.pp.normalize_pearson_residuals()provides analytic Pearson residual normalization from raw counts. It is related to SCTransform, but it is not an implementation of Seurat's regularized per-gene model. - Monocle 3: no native SCTransform workflow. Monocle can receive transformed data from another object, but trajectory and model assumptions should be checked rather than mixing representations silently.
- ScarfWeb: uses Scarf's library-size and feature-subset normalization, not SCTransform.
A Scanpy residual workflow looks like this:
adata.layers["counts"] = adata.X.copy()
residuals = sc.experimental.pp.normalize_pearson_residuals(
adata,
layer="counts",
inplace=False,
)
adata.layers["pearson_residuals"] = residuals["X"]
The function expects integer raw counts. Its default shared overdispersion and clipping rules differ from SCTransform, so results should not be described as interchangeable.
scran deconvolution size factors
Simple library-size normalization can be biased when a few abundant genes or strongly asymmetric cell populations dominate total counts. The Bioconductor package scran addresses this by pooling counts from groups of cells, estimating size factors on the pooled profiles, and deconvolving them back to cell-specific factors.
For heterogeneous datasets, preliminary clustering weakens the assumption that most genes are not differentially expressed across the entire dataset. Size factors are estimated within more comparable groups before being rescaled across groups.
library(scran)
library(scuttle)
set.seed(100)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster = clusters)
sce <- logNormCounts(sce)
The raw counts assay remains intact and normalized log-expression is stored in logcounts.
scran is particularly useful when composition bias is plausible, but it is not assumption-free. Within each pre-cluster, enough genes must still behave consistently for a stable relative size factor to be estimated. Very small or highly unusual populations need inspection rather than blind pooling.
How other tools use scran
Seurat, Scanpy, Monocle 3, and ScarfWeb do not run scran deconvolution as their default normalization. A scran-normalized matrix can be transferred between Bioconductor, AnnData, and Seurat objects, but the raw count layer and size factors should travel with it. Otherwise, later functions may mistake log-normalized values for counts.
CLR normalization is mainly for antibody tags
Centered log-ratio normalization is widely used for antibody-derived tags and hashtag oligonucleotides in CITE-seq. It should not be presented as a general replacement for RNA normalization.
In Seurat, a common ADT workflow is:
obj <- NormalizeData(
obj,
assay = "ADT",
normalization.method = "CLR",
margin = 2
)
Scarf's ADT assay uses CLR normalization by default. For RNA counts, both tools use other strategies.
For multimodal data, modality-specific normalization is essential. RNA, antibody counts, chromatin accessibility, and spatial measurements have different noise models. Forcing all modalities through one transform is usually less defensible than integrating modality-specific representations later.
Methods that should not be routine defaults
Quantile normalization forces samples or cells toward identical distributions. That assumption is poorly matched to heterogeneous UMI data and can erase real differences in RNA composition.
Imputation is not normalization. It can be useful in narrowly defined tasks, but it changes the covariance structure and can manufacture co-expression. Keep imputed values out of primary differential-expression tests unless the method and inferential target explicitly support them.
Cell-level regression of every visible covariate is not a substitute for a balanced design. Removing a variable from an embedding does not create biological replication or make a confounded contrast estimable.

Figure 1: Normalization and integration change different parts of the workflow. Match the method to the representation you need for the next analytical step.
A current tool-by-tool normalization summary
| Tool | RNA normalization used in a standard workflow | Main output | Important detail |
|---|---|---|---|
| Seurat v5 | NormalizeData(LogNormalize) or SCTransform(vst.flavor = "v2") |
Log-normalized data, or SCT corrected counts and residuals |
LogNormalize defaults to 10,000; SCT residuals belong in scale.data |
| Scanpy | normalize_total() then log1p() |
Log-normalized adata.X or selected layer |
Default target_sum=None uses the median library total; set 1e4 for CP10k |
| Monocle 3 | Size factors plus preprocess_cds(norm_method = "log") |
Log2 size-factor-normalized values used for PCA | Alignment is a separate align_cds() step |
| scran | Pooling and deconvolution with computeSumFactors() |
Size factors and logcounts |
Useful under composition bias; preliminary clustering is recommended for heterogeneous data |
| Scarf / ScarfWeb | Library-size normalization, selected-feature re-normalization, log transform before PCA | Normalized values used for graph construction | Harmony is optional and acts after normalization |
| Seurat ADT / Scarf ADT | CLR | Normalized antibody-tag values | Appropriate for ADT or HTO data, not a general RNA default |
When batch correction is justified
Multiple samples do not automatically require integration. Integration is justified when technical structure prevents the analysis from comparing shared biological populations across datasets.
Typical reasons include:
- the same marker-defined cell type separates by run, chemistry, site, or processing protocol
- a joint atlas or common reference embedding is required
- label transfer or query mapping is a core objective
- matched cell states need to be compared across donors after shared structure has been established
Integration is risky when the biological state of interest occurs in only one condition, when cell-type composition is highly asymmetric, or when condition and technical batch are confounded. Anchor-based and nearest-neighbour methods cannot reliably match a disease-specific state to a counterpart that does not exist in the control data. Forcing that match can erase the signal.
Before integration, check that datasets use compatible gene identifiers, genome builds, and feature definitions. A large mismatch in the gene universe can look like a batch effect even when the main problem is inconsistent preprocessing.
What the main integration tools actually change
| Tool | Input | What it changes | Use the result for | Main caution |
|---|---|---|---|---|
| Harmony | PCA or another low-dimensional embedding | Corrected embedding | Neighbours, clustering, UMAP | Quality depends on the upstream representation and covariate choice |
| Seurat v5 integration | Seurat assay layers and a selected integration method | Usually a corrected reduction in current IntegrateLayers() workflows |
Joint clustering, visualization, mapping | CCA can be aggressive; RPCA is generally more conservative when shared states are well represented |
| BBKNN | PCA coordinates and batch labels | Batch-balanced neighbour graph | UMAP, clustering, graph analyses | Counts and PCA remain unchanged; graph topology depends on neighbour settings and batch composition |
| scVI / scANVI | Raw UMI counts plus batch and optional labels | Probabilistic latent representation | Atlas integration, clustering, reference mapping | Covariates supplied as nuisance factors are deliberately minimized in the latent space |
Monocle 3 align_cds() |
PCA or LSI coordinates | Aligned reduced coordinates | Clustering and trajectory construction | Discrete alignment assumes comparable populations; continuous regression can remove biology if mis-specified |
| ScarfWeb Harmony workflow | Scarf-normalized PCA representation | Harmony-corrected embedding used for the graph | Joint exploration and clustering | The selected batch covariate must not be the biological contrast being interpreted |
How to evaluate an integration
A clean UMAP is not evidence that integration succeeded. UMAP is a two-dimensional projection of a chosen graph or latent space, and visually attractive mixing can be produced by overcorrection.
Evaluation should ask two separate questions.
Downstream analysis should use the right representation
A practical decision table
| Situation | Reasonable starting point | What to retain for inference |
|---|---|---|
| One routine UMI dataset, modest depth variation | Seurat LogNormalize, Scanpy CP10k plus log1p, or Scarf normalization | Raw UMI counts and sample metadata |
| Strong composition bias or heterogeneous cell populations | scran deconvolution or SCTransform, compared with a log-normalized baseline | Raw counts and estimated size factors or SCT model metadata |
| Seurat multi-sample analysis with shared populations | RPCA or Harmony after LogNormalize or SCT | RNA counts; use integrated reduction for clustering |
| Scanpy exploratory integration | BBKNN or Harmony on PCA | Raw counts; BBKNN graph or Harmony coordinates for structure |
| Large atlas or repeated reference mapping | scVI/scANVI or a Seurat reference workflow | Raw counts, model, reference labels, and held-out validation set |
| Trajectory analysis across batches | Conservative alignment only after checking stage and batch are separable | Raw counts plus uncorrected and corrected coordinates |
| CITE-seq antibody tags | CLR for ADT, separate RNA normalization | Raw RNA and ADT counts in separate modalities |
| Formal condition-level differential expression | Sample-level pseudobulk on raw counts with an estimable design | Raw counts aggregated by biological replicate |
Current official resources
The following resources are preferable to secondary summaries when checking implementation details:
- Seurat normalization reference
- SCTransform reference
- Seurat v5 integration methods
- Scanpy normalize_total
- Scanpy log1p
- Scanpy analytic Pearson residuals
- Monocle 3 getting started
- Monocle 3 cell alignment
- Bioconductor OSCA normalization chapter
- Harmony repository and current usage
- BBKNN documentation
- scvi-tools SCVI reference
- Scarf API documentation
- ScarfWeb batch correction guide
For external biological references and annotation checks, use curated portals such as the Human Cell Atlas Data Portal, CZ CELLxGENE Discover, and the Cell Ontology. These are useful reference resources, but their labels are not universal ground truth. Atlas annotations inherit the sampling, preprocessing, and expert decisions of the source study.
ScarfWeb provides the Scarf normalization and optional Harmony integration workflow through a browser interface. The methodological responsibilities remain the same: inspect the uncorrected data first, choose correction variables deliberately, preserve raw counts, and validate that the biology survives the correction.