Batch Effect Correction and Normalization in scRNA-Seq

Most errors in this part of a single-cell workflow begin by treating normalization, feature scaling, batch correction, and statistical adjustment as interchangeable. They are not.

01

Normalize

Address cell-level sampling depth and composition.
02

Integrate

Change the geometry used to compare cells across datasets.
03

Infer

Model batch, donor, and other covariates when estimating an effect.

The same matrix should not be expected to serve all three purposes.

This guide focuses on UMI-based scRNA-seq and single-nucleus RNA-seq. Full-length protocols such as Smart-seq2 have different count properties and should not inherit droplet-based defaults without checking the assumptions. For the broader workflow around quality control, feature selection, dimensionality reduction, and clustering, see Navigating the Complexity of Single-Cell RNA-Seq Data Analysis.

The practical rules

Keep raw counts unchanged. Store normalized values, residuals, corrected embeddings, and corrected graphs in separate assays, layers, or reductions.
Normalize every UMI dataset, but integrate only when the scientific question requires shared geometry across samples.
Treat donor or subject as the biological replicate. Do not remove donor structure automatically, especially when donor is the unit of inference.
Use integrated representations for clustering, visualization, and mapping. Use raw counts with sample-aware models for formal differential expression.
Judge integration by biological conservation as well as batch mixing. A well-mixed UMAP can still be biologically wrong.

Start with the design, not the software

A computational method cannot identify a biological effect that is perfectly confounded with batch. If every control was processed with one chemistry and every case with another, the observed difference is compatible with both explanations. No integration algorithm can recover information that the design never contained.

The most useful safeguards are simple:

distribute biological conditions across processing runs where possible
include independent biological replicates in each condition
keep tissue handling, dissociation, capture, library preparation, and sequencing procedures consistent
record operator, site, run, lane, chemistry version, preservation method, tissue handling time, and other technical metadata
distinguish sample, donor, library, lane, and experimental condition rather than collapsing them into one column called batch

Donor is not merely a nuisance variable. In patient studies it is normally the replicate, and donor-to-donor variation is part of the population being studied. Correcting donor away before inference can create an apparently clean cell-level result while discarding the variance needed to support a patient-level conclusion.

What each operation changes

Operation	Typical output	Appropriate use	Common misuse
Library-size normalization	Log-normalized expression values	Visualization, feature selection, PCA, marker inspection	Treating values as absolute transcript abundance
Variance stabilization	Pearson residuals or related transformed values	Highly variable gene selection, PCA, clustering	Using residuals as raw counts or direct fold changes
Feature scaling	Gene-wise centered and scaled values	PCA and other distance-based procedures	Calling `ScaleData()` or `sc.pp.scale()` a normalization method
Embedding correction	Corrected PCA or latent coordinates	Neighbour graph, clustering, UMAP, reference mapping	Using coordinates for gene-level differential expression
Graph correction	Batch-balanced neighbour graph	Clustering, UMAP, graph-based trajectory analysis	Assuming a corrected expression matrix exists
Statistical adjustment	Model coefficient and uncertainty	Differential expression and effect estimation	Substituting an integrated matrix for a sample-aware model

This distinction explains why modern single-cell objects usually retain several representations at once.

Log normalization

Library-size log normalization divides each gene count by the total counts in the cell, scales the result to a common total, and applies a logarithmic transform. A common form is:

log(1 + scale_factor × gene_count / cell_total)

The method is fast, transparent, and interoperable. It remains a sensible baseline for many UMI datasets.

What it assumes

Total-count scaling treats differences in cell totals largely as sampling-depth differences. That is only an approximation. Total UMI count also reflects true RNA content, cell size, transcriptional activity, and the abundance of highly expressed programs.

The main limitation is therefore composition bias, not a generic failure to handle zeros. If one condition strongly induces a small set of transcripts, those genes consume a larger fraction of the library. After total-count scaling, unrelated genes can appear relatively lower even when their absolute molecule number has not changed. Likewise, equalizing library totals can remove a real global increase in RNA content.

Zeros in UMI data are usually expected sampling outcomes. Log normalization does not impute them, nor should it be described as a treatment for "dropout". Imputation is a separate modelling choice and can create artificial correlations if used without a clear downstream reason.

How current tools implement log normalization

Seurat

Seurat v5 uses NormalizeData() with normalization.method = "LogNormalize" by default. Counts are divided by the cell total, multiplied by 10,000, and transformed with the natural log1p function.

obj <- NormalizeData(
  obj,
  normalization.method = "LogNormalize",
  scale.factor = 1e4
)

The result is written to the assay's data layer. ScaleData() is a later gene-wise centering and scaling step, not another library-size normalization.

Scanpy

Scanpy separates depth normalization from log transformation. sc.pp.normalize_total() uses the median pre-normalization library size when target_sum=None. Set target_sum=1e4 to reproduce a CP10k-style workflow comparable to Seurat's default, then apply sc.pp.log1p().

adata.layers["counts"] = adata.X.copy()

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.pp.log1p() uses the natural logarithm unless another base is supplied. sc.pp.scale() acts later on genes and should not be confused with count-depth normalization.

Monocle 3

Monocle 3 estimates cell-specific size factors and applies them during preprocess_cds(). The default norm_method = "log" performs size-factor normalization followed by a log2 transform with a pseudocount of one.

cds <- estimate_size_factors(cds)
cds <- preprocess_cds(
  cds,
  norm_method = "log",
  num_dim = 50
)

Because Monocle 3 uses log2 while Seurat and Scanpy use the natural logarithm by default, the normalized values are monotonic but not numerically identical. That matters when values are exported between tools or thresholds are copied from one ecosystem to another.

Scarf and ScarfWeb

Scarf's RNA assay uses library-size normalization. During graph construction, RNA values are log-transformed by default and the selected feature set is re-normalized before PCA by default. In practice, this means the highly variable gene subset is normalized again using the counts within that selected set rather than simply reusing a whole-transcriptome CP10k matrix.

ds.make_graph(
    from_assay="RNA",
    feat_key="<selected_feature_key>",
    log_transform=True,
    renormalize_subset=True,
)

ScarfWeb exposes this Scarf-based workflow through the browser. Harmony correction is optional and occurs after normalization in the low-dimensional workflow; it is not part of the library-size normalization itself.

SCTransform and residual-based normalization

SCTransform models UMI counts with a regularized negative binomial model and returns variance-stabilized Pearson residuals. It was designed to reduce the dependence of downstream variance on sequencing depth while combining normalization, variable-feature selection, and scaling in one procedure.

In current Seurat v5, vst.flavor = "v2" is the default. The SCT assay has three distinct representations:

counts: depth-corrected UMI values
data: log1p of corrected counts
scale.data: Pearson residuals, normally restricted to variable genes

obj <- SCTransform(
  obj,
  vst.flavor = "v2",
  verbose = FALSE
)

Pearson residuals are well suited to PCA and clustering. They are not raw observations and should not be treated as ordinary normalized counts for sample-level differential expression.

vars.to.regress deserves particular care. Regressing mitochondrial percentage, cell-cycle scores, or other covariates is not automatically an improvement. A variable should be removed only when it is a nuisance for the stated question and the design contains enough information to separate it from biology. Cell cycle, for example, is a real biological programme in developmental, regenerative, and tumour datasets.

Equivalent options in other tools

Scanpy: scanpy.experimental.pp.normalize_pearson_residuals() provides analytic Pearson residual normalization from raw counts. It is related to SCTransform, but it is not an implementation of Seurat's regularized per-gene model.
Monocle 3: no native SCTransform workflow. Monocle can receive transformed data from another object, but trajectory and model assumptions should be checked rather than mixing representations silently.
ScarfWeb: uses Scarf's library-size and feature-subset normalization, not SCTransform.

A Scanpy residual workflow looks like this:

adata.layers["counts"] = adata.X.copy()

residuals = sc.experimental.pp.normalize_pearson_residuals(
    adata,
    layer="counts",
    inplace=False,
)
adata.layers["pearson_residuals"] = residuals["X"]

The function expects integer raw counts. Its default shared overdispersion and clipping rules differ from SCTransform, so results should not be described as interchangeable.

scran deconvolution size factors

Simple library-size normalization can be biased when a few abundant genes or strongly asymmetric cell populations dominate total counts. The Bioconductor package scran addresses this by pooling counts from groups of cells, estimating size factors on the pooled profiles, and deconvolving them back to cell-specific factors.

For heterogeneous datasets, preliminary clustering weakens the assumption that most genes are not differentially expressed across the entire dataset. Size factors are estimated within more comparable groups before being rescaled across groups.

library(scran)
library(scuttle)

set.seed(100)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster = clusters)
sce <- logNormCounts(sce)

The raw counts assay remains intact and normalized log-expression is stored in logcounts.

scran is particularly useful when composition bias is plausible, but it is not assumption-free. Within each pre-cluster, enough genes must still behave consistently for a stable relative size factor to be estimated. Very small or highly unusual populations need inspection rather than blind pooling.

How other tools use scran

Seurat, Scanpy, Monocle 3, and ScarfWeb do not run scran deconvolution as their default normalization. A scran-normalized matrix can be transferred between Bioconductor, AnnData, and Seurat objects, but the raw count layer and size factors should travel with it. Otherwise, later functions may mistake log-normalized values for counts.

CLR normalization is mainly for antibody tags

Centered log-ratio normalization is widely used for antibody-derived tags and hashtag oligonucleotides in CITE-seq. It should not be presented as a general replacement for RNA normalization.

In Seurat, a common ADT workflow is:

obj <- NormalizeData(
  obj,
  assay = "ADT",
  normalization.method = "CLR",
  margin = 2
)

Scarf's ADT assay uses CLR normalization by default. For RNA counts, both tools use other strategies.

For multimodal data, modality-specific normalization is essential. RNA, antibody counts, chromatin accessibility, and spatial measurements have different noise models. Forcing all modalities through one transform is usually less defensible than integrating modality-specific representations later.

Methods that should not be routine defaults

Quantile normalization forces samples or cells toward identical distributions. That assumption is poorly matched to heterogeneous UMI data and can erase real differences in RNA composition.
Imputation is not normalization. It can be useful in narrowly defined tasks, but it changes the covariance structure and can manufacture co-expression. Keep imputed values out of primary differential-expression tests unless the method and inferential target explicitly support them.
Cell-level regression of every visible covariate is not a substitute for a balanced design. Removing a variable from an embedding does not create biological replication or make a confounded contrast estimable.

Infographic comparing scRNA-seq normalization methods including log normalization, SCTransform, scran pooling, CLR for antibody tags, and batch integration tools such as Harmony, Seurat integration, BBKNN, and scVI

Figure 1: Normalization and integration change different parts of the workflow. Match the method to the representation you need for the next analytical step.

A current tool-by-tool normalization summary

Tool	RNA normalization used in a standard workflow	Main output	Important detail
Seurat v5	`NormalizeData(LogNormalize)` or `SCTransform(vst.flavor = "v2")`	Log-normalized `data`, or SCT corrected counts and residuals	LogNormalize defaults to 10,000; SCT residuals belong in `scale.data`
Scanpy	`normalize_total()` then `log1p()`	Log-normalized `adata.X` or selected layer	Default `target_sum=None` uses the median library total; set `1e4` for CP10k
Monocle 3	Size factors plus `preprocess_cds(norm_method = "log")`	Log2 size-factor-normalized values used for PCA	Alignment is a separate `align_cds()` step
scran	Pooling and deconvolution with `computeSumFactors()`	Size factors and `logcounts`	Useful under composition bias; preliminary clustering is recommended for heterogeneous data
Scarf / ScarfWeb	Library-size normalization, selected-feature re-normalization, log transform before PCA	Normalized values used for graph construction	Harmony is optional and acts after normalization
Seurat ADT / Scarf ADT	CLR	Normalized antibody-tag values	Appropriate for ADT or HTO data, not a general RNA default

When batch correction is justified

Multiple samples do not automatically require integration. Integration is justified when technical structure prevents the analysis from comparing shared biological populations across datasets.

Typical reasons include:

the same marker-defined cell type separates by run, chemistry, site, or processing protocol
a joint atlas or common reference embedding is required
label transfer or query mapping is a core objective
matched cell states need to be compared across donors after shared structure has been established

Integration is risky when the biological state of interest occurs in only one condition, when cell-type composition is highly asymmetric, or when condition and technical batch are confounded. Anchor-based and nearest-neighbour methods cannot reliably match a disease-specific state to a counterpart that does not exist in the control data. Forcing that match can erase the signal.

Before integration, check that datasets use compatible gene identifiers, genome builds, and feature definitions. A large mismatch in the gene universe can look like a batch effect even when the main problem is inconsistent preprocessing.

What the main integration tools actually change

Tool	Input	What it changes	Use the result for	Main caution
Harmony	PCA or another low-dimensional embedding	Corrected embedding	Neighbours, clustering, UMAP	Quality depends on the upstream representation and covariate choice
Seurat v5 integration	Seurat assay layers and a selected integration method	Usually a corrected reduction in current `IntegrateLayers()` workflows	Joint clustering, visualization, mapping	CCA can be aggressive; RPCA is generally more conservative when shared states are well represented
BBKNN	PCA coordinates and batch labels	Batch-balanced neighbour graph	UMAP, clustering, graph analyses	Counts and PCA remain unchanged; graph topology depends on neighbour settings and batch composition
scVI / scANVI	Raw UMI counts plus batch and optional labels	Probabilistic latent representation	Atlas integration, clustering, reference mapping	Covariates supplied as nuisance factors are deliberately minimized in the latent space
Monocle 3 `align_cds()`	PCA or LSI coordinates	Aligned reduced coordinates	Clustering and trajectory construction	Discrete alignment assumes comparable populations; continuous regression can remove biology if mis-specified
ScarfWeb Harmony workflow	Scarf-normalized PCA representation	Harmony-corrected embedding used for the graph	Joint exploration and clustering	The selected batch covariate must not be the biological contrast being interpreted

Harmony

Harmony corrects a precomputed cell embedding, usually PCA. It does not rewrite raw counts. In a Seurat workflow, downstream steps should use the harmony reduction rather than pca.

obj <- RunPCA(obj)
obj <- RunHarmony(obj, "batch")
obj <- FindNeighbors(obj, reduction = "harmony", dims = 1:30)
obj <- RunUMAP(obj, reduction = "harmony", dims = 1:30)

The current Harmony generation is designed for larger integrations, but the same biological caveat remains: adding a variable to Harmony declares that variation associated with that variable should be reduced. Do not pass treatment, disease status, developmental time, or donor identity without deciding what biological information is being sacrificed.

Seurat v5 integration

Seurat v5's IntegrateLayers() interface supports CCA, RPCA, Harmony, FastMNN, and scVI integration methods. The practical choice is not simply "Seurat integration". It is a choice of algorithm and reference strategy inside the Seurat object model.

obj <- IntegrateLayers(
  object = obj,
  method = RPCAIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.rpca"
)

obj <- FindNeighbors(
  obj,
  reduction = "integrated.rpca",
  dims = 1:30
)

RPCA is often a safer starting point when batches share clearly represented cell states and preserving condition-specific differences matters. CCA can align more strongly, which may help with larger technical shifts but also increases the need to check overcorrection.

BBKNN

BBKNN replaces the ordinary Scanpy neighbour-building step. For each cell, it finds neighbours separately within each batch and combines them into one graph. It leaves the count matrix and PCA coordinates unchanged.

import bbknn

bbknn.bbknn(
    adata,
    batch_key="batch"
)

BBKNN is fast and useful for exploratory joint embeddings. It does not produce batch-corrected expression values. Adding new cells or changing the reference set requires rebuilding the graph, so it is less suitable than an explicit reference-mapping model when a stable atlas must accept repeated query datasets.

scVI and scANVI

scVI fits a probabilistic model directly to raw UMI counts and learns a latent representation conditioned on batch. scANVI adds cell labels in a semi-supervised model, which is useful for reference mapping and annotation.

import scvi

adata.layers["counts"] = adata.X.copy()

scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
)

model = scvi.model.SCVI(adata)
model.train()

adata.obsm["X_scVI"] = model.get_latent_representation()

The latent representation is the usual integration output for neighbours, clustering, and UMAP. get_normalized_expression() returns model-decoded expression, but it is not automatically one universal batch-corrected matrix. With transform_batch=None, current scvi-tools conditions each cell on its observed batch. A common counterfactual batch must be requested deliberately and interpreted as a model output, not as observed abundance.

Additional categorical or continuous covariates supplied to scVI are treated as nuisance factors. Do not include biological factors whose effects should remain visible.

Monocle 3 alignment

Monocle 3 keeps normalization and alignment separate. align_cds() uses mutual nearest-neighbour alignment for discrete groups and can use linear regression for continuous effects before trajectory reconstruction.

cds <- align_cds(
  cds,
  alignment_group = "batch"
)

For a continuous nuisance variable:

cds <- align_cds(
  cds,
  residual_model_formula_str = "~ percent_mito"
)

Both choices change the reduced coordinates used downstream. They do not make a confounded biological contrast identifiable. Monocle's current projection workflow also does not apply a saved alignment transform to a new query dataset, so batch-corrected query mapping requires co-embedding rather than a simple aligned projection.

How to evaluate an integration

A clean UMAP is not evidence that integration succeeded. UMAP is a two-dimensional projection of a chosen graph or latent space, and visually attractive mixing can be produced by overcorrection.

Evaluation should ask two separate questions.

Was biology conserved?

Useful measures include:

cell-type silhouette width or cLISI
NMI or ARI between known labels and clusters, when reliable labels exist
isolated-label performance for rare populations
conservation of highly variable genes, marker programmes, and trajectories
agreement with donor-level pseudobulk patterns and orthogonal assays

No single metric answers both questions. Strong mixing can improve iLISI while collapsing distinct cell states. Conversely, preserving every batch-specific structure can produce excellent label separation while leaving technical effects untouched.

At minimum, inspect the integrated and unintegrated data side by side, coloured by:

sample and donor
experimental condition
known cell type or lineage markers
library size, detected genes, mitochondrial fraction, and other relevant QC covariates

Rare and condition-specific states deserve their own checks. Global averages are easily dominated by abundant populations.

Downstream analysis should use the right representation

A practical decision table

Situation	Reasonable starting point	What to retain for inference
One routine UMI dataset, modest depth variation	Seurat LogNormalize, Scanpy CP10k plus log1p, or Scarf normalization	Raw UMI counts and sample metadata
Strong composition bias or heterogeneous cell populations	scran deconvolution or SCTransform, compared with a log-normalized baseline	Raw counts and estimated size factors or SCT model metadata
Seurat multi-sample analysis with shared populations	RPCA or Harmony after LogNormalize or SCT	RNA counts; use integrated reduction for clustering
Scanpy exploratory integration	BBKNN or Harmony on PCA	Raw counts; BBKNN graph or Harmony coordinates for structure
Large atlas or repeated reference mapping	scVI/scANVI or a Seurat reference workflow	Raw counts, model, reference labels, and held-out validation set
Trajectory analysis across batches	Conservative alignment only after checking stage and batch are separable	Raw counts plus uncorrected and corrected coordinates
CITE-seq antibody tags	CLR for ADT, separate RNA normalization	Raw RNA and ADT counts in separate modalities
Formal condition-level differential expression	Sample-level pseudobulk on raw counts with an estimable design	Raw counts aggregated by biological replicate

Current official resources

The following resources are preferable to secondary summaries when checking implementation details:

For external biological references and annotation checks, use curated portals such as the Human Cell Atlas Data Portal, CZ CELLxGENE Discover, and the Cell Ontology. These are useful reference resources, but their labels are not universal ground truth. Atlas annotations inherit the sampling, preprocessing, and expert decisions of the source study.

ScarfWeb provides the Scarf normalization and optional Harmony integration workflow through a browser interface. The methodological responsibilities remain the same: inspect the uncorrected data first, choose correction variables deliberately, preserve raw counts, and validate that the biology survives the correction.

Explore ScarfWeb → | Schedule a demo →