Cover image for batch effect correction in scRNA-seq, showing normalization and integration workflow concepts on a light background
← All blog posts

Batch Effect Correction in scRNA-Seq Data

Compare scRNA-seq normalization and batch correction methods, with current Seurat, Scanpy, Monocle 3, scran, Harmony, BBKNN and scVI workflows.

Most errors in this part of a single-cell workflow begin by treating normalization, feature scaling, batch correction, and statistical adjustment as interchangeable. They are not.

  • 01

    Normalize

    Address cell-level sampling depth and composition.

  • 02

    Integrate

    Change the geometry used to compare cells across datasets.

  • 03

    Infer

    Model batch, donor, and other covariates when estimating an effect.

The same matrix should not be expected to serve all three purposes.

This guide focuses on UMI-based scRNA-seq and single-nucleus RNA-seq. Full-length protocols such as Smart-seq2 have different count properties and should not inherit droplet-based defaults without checking the assumptions. For the broader workflow around quality control, feature selection, dimensionality reduction, and clustering, see Navigating the Complexity of Single-Cell RNA-Seq Data Analysis.

The practical rules

  1. Keep raw counts unchanged. Store normalized values, residuals, corrected embeddings, and corrected graphs in separate assays, layers, or reductions.
  2. Normalize every UMI dataset, but integrate only when the scientific question requires shared geometry across samples.
  3. Treat donor or subject as the biological replicate. Do not remove donor structure automatically, especially when donor is the unit of inference.
  4. Use integrated representations for clustering, visualization, and mapping. Use raw counts with sample-aware models for formal differential expression.
  5. Judge integration by biological conservation as well as batch mixing. A well-mixed UMAP can still be biologically wrong.

Start with the design, not the software

A computational method cannot identify a biological effect that is perfectly confounded with batch. If every control was processed with one chemistry and every case with another, the observed difference is compatible with both explanations. No integration algorithm can recover information that the design never contained.

The most useful safeguards are simple:

  • distribute biological conditions across processing runs where possible
  • include independent biological replicates in each condition
  • keep tissue handling, dissociation, capture, library preparation, and sequencing procedures consistent
  • record operator, site, run, lane, chemistry version, preservation method, tissue handling time, and other technical metadata
  • distinguish sample, donor, library, lane, and experimental condition rather than collapsing them into one column called batch

Donor is not merely a nuisance variable. In patient studies it is normally the replicate, and donor-to-donor variation is part of the population being studied. Correcting donor away before inference can create an apparently clean cell-level result while discarding the variance needed to support a patient-level conclusion.

What each operation changes

Operation Typical output Appropriate use Common misuse
Library-size normalization Log-normalized expression values Visualization, feature selection, PCA, marker inspection Treating values as absolute transcript abundance
Variance stabilization Pearson residuals or related transformed values Highly variable gene selection, PCA, clustering Using residuals as raw counts or direct fold changes
Feature scaling Gene-wise centered and scaled values PCA and other distance-based procedures Calling ScaleData() or sc.pp.scale() a normalization method
Embedding correction Corrected PCA or latent coordinates Neighbour graph, clustering, UMAP, reference mapping Using coordinates for gene-level differential expression
Graph correction Batch-balanced neighbour graph Clustering, UMAP, graph-based trajectory analysis Assuming a corrected expression matrix exists
Statistical adjustment Model coefficient and uncertainty Differential expression and effect estimation Substituting an integrated matrix for a sample-aware model

This distinction explains why modern single-cell objects usually retain several representations at once.

Log normalization

Library-size log normalization divides each gene count by the total counts in the cell, scales the result to a common total, and applies a logarithmic transform. A common form is:

log(1 + scale_factor × gene_count / cell_total)

The method is fast, transparent, and interoperable. It remains a sensible baseline for many UMI datasets.

What it assumes

Total-count scaling treats differences in cell totals largely as sampling-depth differences. That is only an approximation. Total UMI count also reflects true RNA content, cell size, transcriptional activity, and the abundance of highly expressed programs.

The main limitation is therefore composition bias, not a generic failure to handle zeros. If one condition strongly induces a small set of transcripts, those genes consume a larger fraction of the library. After total-count scaling, unrelated genes can appear relatively lower even when their absolute molecule number has not changed. Likewise, equalizing library totals can remove a real global increase in RNA content.

Zeros in UMI data are usually expected sampling outcomes. Log normalization does not impute them, nor should it be described as a treatment for "dropout". Imputation is a separate modelling choice and can create artificial correlations if used without a clear downstream reason.

How current tools implement log normalization

SCTransform and residual-based normalization

SCTransform models UMI counts with a regularized negative binomial model and returns variance-stabilized Pearson residuals. It was designed to reduce the dependence of downstream variance on sequencing depth while combining normalization, variable-feature selection, and scaling in one procedure.

In current Seurat v5, vst.flavor = "v2" is the default. The SCT assay has three distinct representations:

  • counts: depth-corrected UMI values
  • data: log1p of corrected counts
  • scale.data: Pearson residuals, normally restricted to variable genes
obj <- SCTransform(
  obj,
  vst.flavor = "v2",
  verbose = FALSE
)

Pearson residuals are well suited to PCA and clustering. They are not raw observations and should not be treated as ordinary normalized counts for sample-level differential expression.

vars.to.regress deserves particular care. Regressing mitochondrial percentage, cell-cycle scores, or other covariates is not automatically an improvement. A variable should be removed only when it is a nuisance for the stated question and the design contains enough information to separate it from biology. Cell cycle, for example, is a real biological programme in developmental, regenerative, and tumour datasets.

Equivalent options in other tools

  • Scanpy: scanpy.experimental.pp.normalize_pearson_residuals() provides analytic Pearson residual normalization from raw counts. It is related to SCTransform, but it is not an implementation of Seurat's regularized per-gene model.
  • Monocle 3: no native SCTransform workflow. Monocle can receive transformed data from another object, but trajectory and model assumptions should be checked rather than mixing representations silently.
  • ScarfWeb: uses Scarf's library-size and feature-subset normalization, not SCTransform.

A Scanpy residual workflow looks like this:

adata.layers["counts"] = adata.X.copy()

residuals = sc.experimental.pp.normalize_pearson_residuals(
    adata,
    layer="counts",
    inplace=False,
)
adata.layers["pearson_residuals"] = residuals["X"]

The function expects integer raw counts. Its default shared overdispersion and clipping rules differ from SCTransform, so results should not be described as interchangeable.

scran deconvolution size factors

Simple library-size normalization can be biased when a few abundant genes or strongly asymmetric cell populations dominate total counts. The Bioconductor package scran addresses this by pooling counts from groups of cells, estimating size factors on the pooled profiles, and deconvolving them back to cell-specific factors.

For heterogeneous datasets, preliminary clustering weakens the assumption that most genes are not differentially expressed across the entire dataset. Size factors are estimated within more comparable groups before being rescaled across groups.

library(scran)
library(scuttle)

set.seed(100)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster = clusters)
sce <- logNormCounts(sce)

The raw counts assay remains intact and normalized log-expression is stored in logcounts.

scran is particularly useful when composition bias is plausible, but it is not assumption-free. Within each pre-cluster, enough genes must still behave consistently for a stable relative size factor to be estimated. Very small or highly unusual populations need inspection rather than blind pooling.

How other tools use scran

Seurat, Scanpy, Monocle 3, and ScarfWeb do not run scran deconvolution as their default normalization. A scran-normalized matrix can be transferred between Bioconductor, AnnData, and Seurat objects, but the raw count layer and size factors should travel with it. Otherwise, later functions may mistake log-normalized values for counts.

CLR normalization is mainly for antibody tags

Centered log-ratio normalization is widely used for antibody-derived tags and hashtag oligonucleotides in CITE-seq. It should not be presented as a general replacement for RNA normalization.

In Seurat, a common ADT workflow is:

obj <- NormalizeData(
  obj,
  assay = "ADT",
  normalization.method = "CLR",
  margin = 2
)

Scarf's ADT assay uses CLR normalization by default. For RNA counts, both tools use other strategies.

For multimodal data, modality-specific normalization is essential. RNA, antibody counts, chromatin accessibility, and spatial measurements have different noise models. Forcing all modalities through one transform is usually less defensible than integrating modality-specific representations later.

Methods that should not be routine defaults

  • Quantile normalization forces samples or cells toward identical distributions. That assumption is poorly matched to heterogeneous UMI data and can erase real differences in RNA composition.

  • Imputation is not normalization. It can be useful in narrowly defined tasks, but it changes the covariance structure and can manufacture co-expression. Keep imputed values out of primary differential-expression tests unless the method and inferential target explicitly support them.

  • Cell-level regression of every visible covariate is not a substitute for a balanced design. Removing a variable from an embedding does not create biological replication or make a confounded contrast estimable.

Infographic comparing scRNA-seq normalization methods including log normalization, SCTransform, scran pooling, CLR for antibody tags, and batch integration tools such as Harmony, Seurat integration, BBKNN, and scVI

Figure 1: Normalization and integration change different parts of the workflow. Match the method to the representation you need for the next analytical step.

A current tool-by-tool normalization summary

Tool RNA normalization used in a standard workflow Main output Important detail
Seurat v5 NormalizeData(LogNormalize) or SCTransform(vst.flavor = "v2") Log-normalized data, or SCT corrected counts and residuals LogNormalize defaults to 10,000; SCT residuals belong in scale.data
Scanpy normalize_total() then log1p() Log-normalized adata.X or selected layer Default target_sum=None uses the median library total; set 1e4 for CP10k
Monocle 3 Size factors plus preprocess_cds(norm_method = "log") Log2 size-factor-normalized values used for PCA Alignment is a separate align_cds() step
scran Pooling and deconvolution with computeSumFactors() Size factors and logcounts Useful under composition bias; preliminary clustering is recommended for heterogeneous data
Scarf / ScarfWeb Library-size normalization, selected-feature re-normalization, log transform before PCA Normalized values used for graph construction Harmony is optional and acts after normalization
Seurat ADT / Scarf ADT CLR Normalized antibody-tag values Appropriate for ADT or HTO data, not a general RNA default

When batch correction is justified

Multiple samples do not automatically require integration. Integration is justified when technical structure prevents the analysis from comparing shared biological populations across datasets.

Typical reasons include:

  • the same marker-defined cell type separates by run, chemistry, site, or processing protocol
  • a joint atlas or common reference embedding is required
  • label transfer or query mapping is a core objective
  • matched cell states need to be compared across donors after shared structure has been established

Integration is risky when the biological state of interest occurs in only one condition, when cell-type composition is highly asymmetric, or when condition and technical batch are confounded. Anchor-based and nearest-neighbour methods cannot reliably match a disease-specific state to a counterpart that does not exist in the control data. Forcing that match can erase the signal.

Before integration, check that datasets use compatible gene identifiers, genome builds, and feature definitions. A large mismatch in the gene universe can look like a batch effect even when the main problem is inconsistent preprocessing.

What the main integration tools actually change

Tool Input What it changes Use the result for Main caution
Harmony PCA or another low-dimensional embedding Corrected embedding Neighbours, clustering, UMAP Quality depends on the upstream representation and covariate choice
Seurat v5 integration Seurat assay layers and a selected integration method Usually a corrected reduction in current IntegrateLayers() workflows Joint clustering, visualization, mapping CCA can be aggressive; RPCA is generally more conservative when shared states are well represented
BBKNN PCA coordinates and batch labels Batch-balanced neighbour graph UMAP, clustering, graph analyses Counts and PCA remain unchanged; graph topology depends on neighbour settings and batch composition
scVI / scANVI Raw UMI counts plus batch and optional labels Probabilistic latent representation Atlas integration, clustering, reference mapping Covariates supplied as nuisance factors are deliberately minimized in the latent space
Monocle 3 align_cds() PCA or LSI coordinates Aligned reduced coordinates Clustering and trajectory construction Discrete alignment assumes comparable populations; continuous regression can remove biology if mis-specified
ScarfWeb Harmony workflow Scarf-normalized PCA representation Harmony-corrected embedding used for the graph Joint exploration and clustering The selected batch covariate must not be the biological contrast being interpreted

How to evaluate an integration

A clean UMAP is not evidence that integration succeeded. UMAP is a two-dimensional projection of a chosen graph or latent space, and visually attractive mixing can be produced by overcorrection.

Evaluation should ask two separate questions.

Downstream analysis should use the right representation

A practical decision table

Situation Reasonable starting point What to retain for inference
One routine UMI dataset, modest depth variation Seurat LogNormalize, Scanpy CP10k plus log1p, or Scarf normalization Raw UMI counts and sample metadata
Strong composition bias or heterogeneous cell populations scran deconvolution or SCTransform, compared with a log-normalized baseline Raw counts and estimated size factors or SCT model metadata
Seurat multi-sample analysis with shared populations RPCA or Harmony after LogNormalize or SCT RNA counts; use integrated reduction for clustering
Scanpy exploratory integration BBKNN or Harmony on PCA Raw counts; BBKNN graph or Harmony coordinates for structure
Large atlas or repeated reference mapping scVI/scANVI or a Seurat reference workflow Raw counts, model, reference labels, and held-out validation set
Trajectory analysis across batches Conservative alignment only after checking stage and batch are separable Raw counts plus uncorrected and corrected coordinates
CITE-seq antibody tags CLR for ADT, separate RNA normalization Raw RNA and ADT counts in separate modalities
Formal condition-level differential expression Sample-level pseudobulk on raw counts with an estimable design Raw counts aggregated by biological replicate

Current official resources

The following resources are preferable to secondary summaries when checking implementation details:

For external biological references and annotation checks, use curated portals such as the Human Cell Atlas Data Portal, CZ CELLxGENE Discover, and the Cell Ontology. These are useful reference resources, but their labels are not universal ground truth. Atlas annotations inherit the sampling, preprocessing, and expert decisions of the source study.


ScarfWeb provides the Scarf normalization and optional Harmony integration workflow through a browser interface. The methodological responsibilities remain the same: inspect the uncorrected data first, choose correction variables deliberately, preserve raw counts, and validate that the biology survives the correction.

Explore ScarfWeb → | Schedule a demo →

Share this article