CyteType Documentation

Everything you need to annotate single-cell data with CyteType — from installation to interpreting your report.

← CyteType

Prerequisites

Before running CyteType, make sure your data meets these requirements.

Data requirements:

Gene symbols, not Ensembl IDs, in your feature names
Differential expression results computed per cluster
Clustering results stored in your object (Leiden, Louvain, or Seurat clusters)
Normalized gene expression data (log1p-normalization recommended)

Python prerequisites:

A preprocessed AnnData object with sc.tl.rank_genes_groups results
Python ≥ 3.12

R prerequisites:

devtools installed for GitHub installation
A Seurat object with FindAllMarkers() output
R ≥ 4.1.0

Installation

Install the CyteType client for your preferred environment.

pip install cytetype

install.packages("devtools")
library(devtools)
install_github("NygenAnalytics/CyteTypeR")

Quick Start

A minimal end-to-end run. No API key required for the default configuration.

import scanpy as sc
from cytetype import CyteType

# Load your preprocessed AnnData
# adata must have clusters in adata.obs and rank_genes_groups in adata.uns
group_key = "leiden"

annotator = CyteType(
    adata,
    group_key=group_key,
    rank_key="rank_genes_" + group_key,
    n_top_genes=50,
)

adata = annotator.run(
    study_context="Human PBMC from healthy donor, 10X Genomics 3' scRNA-seq"
)

# Annotations are now in adata.obs
sc.pl.umap(adata, color=f"cytetype_annotation_{group_key}")

library(Seurat)
library(CyteTypeR)

# Find markers (if not already done)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)

# Step 1: prepare data
prepped_data <- PrepareCyteTypeR(
  obj = pbmc,
  marker_table = pbmc.markers,
  group_key = "seurat_clusters",
  n_top_genes = 50,
  coordinates_key = "umap"
)

# Step 2: submit and annotate
pbmc <- CyteTypeR(
  obj = pbmc,
  prepped_data = prepped_data,
  study_context = "Human PBMC from healthy donor, 10X Genomics 3' scRNA-seq"
)

# Annotations are now in obj@meta.data
DimPlot(pbmc, group.by = "cytetype_seurat_clusters")

A link to your interactive HTML report is printed during the run. Results are also written directly back to your object.

The CyteType Dashboard is a web interface for managing API tokens and annotation reports. Navigate to /dashboard on your CyteType server.

Workspaces

The dashboard supports two workspaces, selectable from the switcher at the top of the sidebar.

Personal — Your free individual account. Tokens and reports here are private by default and subject to the free-tier daily limit (three annotation runs per day).

Organization — Available if your account is linked to an organization license. Switching to this workspace shows tokens and reports scoped to your organization, with shared quota, org-wide visibility options, and team administration features.

API Tokens

API tokens authenticate your requests to the CyteType API. Open the API Tokens section from the sidebar navigation.

Creating a token:

Click Create New Token.
Enter a descriptive name for the token (e.g. Production Server or Local notebook).
Optionally set a Quota Limit — the maximum number of clusters this token is allowed to annotate. Leave empty for no limit.
Click Create Token.

The token value is displayed in the table immediately. Copy it and store it securely — it is used as the auth_token parameter in the Python and R clients:

annotator = CyteType(adata, group_key="leiden", auth_token="your-token-here")

result <- CyteTypeR(
  obj = seurat_obj,
  prepped_data = prepped_data,
  study_context = "...",
  auth_token = "your-token-here"
)

Managing existing tokens:

The token table shows each token's status, total jobs run, clusters annotated, quota, and creation date.

To disable a token permanently, click Disable. Disabled tokens cannot be re-enabled.
To copy a token value, click the copy button beside it.
To rename a token or update its quota, click the edit icon next to the name or quota field.

Use the Sort dropdown to order tokens by date created, name, usage, quota, or status. The filter pills toggle between showing only active tokens and all tokens including disabled ones.

Organization admins:

In the organization workspace, admins see all tokens created across the organization. Admins can create tokens on behalf of other members by entering a member's email address in the User Email field when creating a token.

Members who are not admins cannot create tokens in the organization workspace. Contact your organization admin to request one.

Reports

The Reports section lists annotation jobs submitted under your account. Two tabs are available:

Reports shared with me — Jobs from other organization members made visible to you (organization workspace only).
My Reports — Jobs you own.

Controlling visibility:

Each report has a visibility setting. Click the edit icon in the Visibility column to change it.

Setting	Who can access the report
Private	Only you and users you have explicitly shared with
Organization	All members of your organization
Public	Anyone with the report link

Select one or more reports using their checkboxes.
Click Share in the action bar.
Enter each person's email address and click Add.
Click Save Changes.

You can also click the value in the Shared With column of a single report to open the share dialog for that report directly.

When sharing multiple reports at once, new users are added to each report's existing share list without removing anyone already shared. When editing a single report's shares, the list is replaced entirely — remove users by deleting them from the list before saving.

Archiving reports:

Archiving hides reports from the default view without deleting them. To archive one or more reports, select them and click Archive. To view archived reports, enable Show archived above the table. Select an archived report and click Unarchive to restore it.

Transferring ownership (organization workspace):

To transfer one or more reports to another organization member:

Select the reports.
Click Transfer in the action bar.
Enter the new owner's email address.
Click Transfer All.

Transfer is permanent. The receiving user must be a member of the same organization.

Preprocessing your data

CyteType requires differential expression results computed per cluster before submission.

import scanpy as sc

# Assumes adata already has normalized counts in adata.X
# and cluster labels in adata.obs["leiden"]

sc.tl.rank_genes_groups(
    adata,
    groupby="leiden",
    method="wilcoxon",
    key_added="rank_genes_leiden",
    n_genes=100,
)

# If gene symbols are stored separately from var_names:
# adata.var["gene_symbols"] = adata.var_names  # or set during load

library(Seurat)
library(dplyr)

pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, nfeatures = 2000)
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE) %>%
  dplyr::filter(avg_log2FC > 1)

Initializing the annotator

The initialization step validates your data and precomputes expression percentages. This can take a few minutes on large datasets but only needs to run once per object.

from cytetype import CyteType

annotator = CyteType(
    adata,
    group_key="leiden",             # column in adata.obs with cluster labels
    rank_key="rank_genes_leiden",   # key in adata.uns with DE results
    gene_symbols_column="gene_symbols",  # column in adata.var with gene symbols
    n_top_genes=50,                 # top marker genes per cluster
    aggregate_metadata=True,        # include obs metadata in context
    min_percentage=10,              # min % threshold for metadata
    coordinates_key="X_umap",      # key in adata.obsm for UMAP coordinates
    max_cells_per_group=1000,       # cells sampled per cluster for visualization
    auth_token=None,                # Bearer token (if your deployment requires one)
)

# PrepareCyteTypeR handles all preprocessing in one step
prepped_data <- PrepareCyteTypeR(
  obj = seurat_obj,
  marker_table = markers_df,       # output of FindAllMarkers()
  group_key = "seurat_clusters",   # metadata column with cluster assignments
  gene_symbols = "gene_symbols",   # gene symbol field name
  n_top_genes = 50,                # top marker genes per cluster
  aggregate_metadata = TRUE,       # include metadata context
  min_percentage = 10,             # min % threshold for metadata
  coordinates_key = "umap",        # dimensional reduction for visualization
  max_cells_per_group = 1000       # cells sampled per cluster
)

Running annotation

The study_context is the most important parameter for annotation quality. Describe your tissue, organism, disease state, and experimental setup in one or two sentences.

adata = annotator.run(
    study_context="Human colorectal cancer biopsy, tumor microenvironment, 10X Genomics 5' scRNA-seq, treatment-naive patients",
    metadata={
        "Study": "My TME atlas",
        "GEO": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123456",
        "DOI": "https://doi.org/10.1038/example"
    },
    n_parallel_clusters=4,          # increase for faster annotation (watch rate limits)
    results_prefix="cytetype",      # prefix for result columns
    timeout_seconds=7200,
    show_progress=True,
)

pbmc <- CyteTypeR(
  obj = seurat_obj,
  prepped_data = prepped_data,
  study_context = "Human colorectal cancer biopsy, tumor microenvironment, treatment-naive patients",
  metadata = list(
    "Study" = "My TME atlas",
    "GEO" = "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123456",
    "DOI" = "https://doi.org/10.1038/example"
  ),
  n_parallel_clusters = 4L,
  results_prefix = "cytetype",
  timeout_seconds = 7200L,
  show_progress = TRUE
)

Accessing results

After a successful run, annotations are stored in your object and can be accessed immediately.

# Annotation columns added to adata.obs:
# cytetype_annotation_leiden       — cell type label
# cytetype_cellOntologyTerm_leiden — Cell Ontology term name
# cytetype_cellOntologyTermID_leiden — CL:xxxxxxx ID
# cytetype_cellState_leiden        — functional state

import json
results = json.loads(adata.uns["cytetype_results"]["result"])

for ann in results["annotations"]:
    print(ann["clusterId"], ann["annotation"], ann["ontologyTermID"])

# If you need to re-fetch results after disconnecting:
results = annotator.get_results()

# Annotation columns added to obj@meta.data:
# cytetype_annotation_seurat_clusters
# cytetype_cellOntologyTerm_seurat_clusters
# cytetype_cellOntologyTermID_seurat_clusters
# cytetype_cellState_seurat_clusters

# Full results table in obj@misc:
results_df <- seurat_obj@misc[["cytetype_results"]]
View(results_df)

# Available columns in results_df:
# clusterId, annotation, ontologyTerm, ontologyTermID,
# granularAnnotation, cellState, justification,
# supportingMarkers, conflictingMarkers, missingExpression, unexpectedExpression

# If you need to re-fetch results after disconnecting:
results <- GetResults(seurat_obj)

Custom LLM configuration

By default CyteType uses its own hosted model. You can bring your own LLM from any supported provider.

# Single LLM configuration
adata = annotator.run(
    study_context="...",
    llm_configs=[{
        "provider": "openai",
        "name": "gpt-4o",
        "apiKey": "sk-...",
        "baseUrl": "https://api.openai.com/v1",   # optional
        "modelSettings": {"temperature": 0.0, "max_tokens": 4096}
    }]
)

# Multiple providers (different agents can use different models)
adata = annotator.run(
    study_context="...",
    llm_configs=[
        {
            "provider": "anthropic",
            "name": "claude-3-5-sonnet-20241022",
            "apiKey": "sk-ant-...",
            "targetAgents": ["annotator", "reviewer"]
        },
        {
            "provider": "openai",
            "name": "gpt-4o-mini",
            "apiKey": "sk-...",
            "targetAgents": ["summarizer", "clinician"]
        }
    ]
)

# AWS Bedrock
adata = annotator.run(
    study_context="...",
    llm_configs=[{
        "provider": "bedrock",
        "name": "us.anthropic.claude-3-5-sonnet-20241022-v2:0",
        "awsAccessKeyId": "AKIA...",
        "awsSecretAccessKey": "...",
        "awsDefaultRegion": "us-east-1"
    }]
)

# Single LLM configuration
result <- CyteTypeR(
  obj = seurat_obj,
  prepped_data = prepped_data,
  study_context = "...",
  llm_configs = list(
    provider = "openai",
    name = "gpt-4o",
    apiKey = "sk-...",
    baseUrl = "https://api.openai.com/v1",
    modelSettings = list(temperature = 0.0, max_tokens = 4096L)
  )
)

# Multiple providers
result <- CyteTypeR(
  obj = seurat_obj,
  prepped_data = prepped_data,
  study_context = "...",
  llm_configs = list(
    list(
      provider = "anthropic",
      name = "claude-3-5-sonnet-20241022",
      apiKey = "sk-ant-...",
      targetAgents = c("annotator", "reviewer")
    ),
    list(
      provider = "openai",
      name = "gpt-4o-mini",
      apiKey = "sk-...",
      targetAgents = c("summarizer", "clinician")
    )
  )
)

Supported providers: anthropic, bedrock, fireworks, google, groq, huggingface, mistral, openai, openrouter, vertex, xai.

Authentication

For deployments that require a bearer token (private or enterprise instances):

# Pass at initialization (applies to all run() calls)
annotator = CyteType(adata, group_key="leiden", auth_token="your-token")

# Or override at run time
adata = annotator.run(study_context="...", auth_token="your-token")

# Pass auth_token to CyteTypeR
result <- CyteTypeR(
  obj = seurat_obj,
  prepped_data = prepped_data,
  study_context = "...",
  auth_token = "your-token"
)

# Or when re-fetching results
results <- GetResults(seurat_obj, auth_token = "your-token")

Every CyteType report is structured around the same set of sections. Each answers a specific question your biology team will ask when reviewing annotations.

Ontology-anchored annotation

What it is: Each cluster is assigned a Cell Ontology (CL) term — a globally standardized identifier for cell types — alongside a confidence score and a label match score comparing the CyteType call to any labels you already had.

How to read it: The CL ID links to the official ontology definition. Confidence (0–1) reflects model certainty. Label match shows alignment with your prior annotation if you provided cluster labels.

When to act: Low confidence (<0.6) or low label match when you trust your prior labels suggests the cluster may need manual review or re-clustering.

Functional state resolution

What it is: Cell states (activation, exhaustion, ECM remodeling, antigen presentation, etc.) are resolved as distinct gene programs separate from the cell type label.

How to read it: The cell state field reports co-occurring functional programs. Multiple states can be active simultaneously. Each is supported by a specific gene set.

When to act: Unexpected states — exhaustion in naive T cells, proliferation in quiescent stromal cells — should prompt review of cluster composition. These are often the most biologically interesting signals.

Coarse lineage map

What it is: Clusters are grouped into major lineages (myeloid, lymphoid, epithelial, stromal, endothelial, etc.) for rapid high-level orientation before diving into subtypes.

How to read it: Use this for first-pass triage. Lineage groups reflect broad biology before subtype resolution. Confidence indicators show certainty at the lineage level.

When to act: Clusters misassigned to an unexpected lineage may indicate doublets, ambient RNA contamination, or marker gene quality issues.

Marker-level evidence

What it is: The supporting, missing, and unexpected gene breakdown for each annotation call. Each gene is annotated with its biological role and linked to published evidence.

How to read it:

Unexpected markers are genes inconsistent with the label (potential contamination or annotation error)
Missing markers are canonical genes expected but not detected (may indicate incomplete capture or a cell subtype)
Supporting markers positively support the assigned cell type

When to act: Many unexpected markers or absent canonical markers alongside low confidence is a strong signal to question the call.

Confidence and heterogeneity QC

What it is: Badges summarizing certainty and intra-cluster diversity, with narrative reasoning describing what is solid versus mixed.

How to read it: Confidence reflects how strongly the evidence points to the assigned cell type. Heterogeneity reflects how mixed the cluster is internally.

When to act: High heterogeneity clusters are candidates for re-clustering. The narrative often names which marker groups are causing the mixed signal.

Multi-expert synthesis

What it is: Several specialized AI reviewers independently assess each annotation. Their agreements, disagreements, and alternative hypotheses are surfaced before the final label is locked.

How to read it: Reviewer consensus strengthens confidence. Disagreements are not errors — they reflect genuine ambiguity in the data. Alternative hypotheses are ordered by plausibility.

When to act: When reviewers disagree on the top call, treat the listed alternatives as equally valid candidates to investigate with orthogonal methods.

Study-aware context

What it is: The annotation model is grounded in your study_context. Disease-specific, tissue-specific, and organism-specific knowledge shapes which cell types are considered and how markers are weighted.

How to read it: Context fit shows how well the assigned label fits the biological framing of your study. Keywords extracted from the study context are shown for transparency.

When to act: If labels feel generic or off-target, revisit your study_context. A vague context ("human cells") produces generic labels. A specific context ("inflamed synovium from RA patients, synovectomy samples") produces disease-relevant annotations.

Ranked pathway signals

What it is: GO and WikiPathways enrichment results for each cluster, ranked by NES (Normalized Enrichment Score). Both up-regulated and down-regulated programs are shown.

How to read it: NES indicates strength and direction of pathway activation. Focus on pathways with |NES| > 1.5 for mechanistic interpretation. Use this to connect cell identity to cellular programs.

When to act: Unexpected pathway activation may indicate contamination, a stress response, or a biologically interesting subpopulation worth further investigation.

Linked citations trail

What it is: Inline PubMed citations link specific gene claims to published literature. Hover previews and one-click links allow rapid verification without leaving the report.

How to read it: Every marker-gene explanation is backed by a citation. The source publication, journal, and year are shown in the preview.

When to act: In regulated environments, the citation trail is part of your audit record. Flag citations from out-of-context tissues or organisms for manual verification.

Decision traceability

What it is: The full candidate evaluation funnel — what cell types were considered, why each was accepted or rejected, and the quantitative evidence (Log2FC, expression %, tissue context) driving each decision.

How to read it: Accepted candidates met the evidence threshold. Rejected candidates are listed with the reason for rejection. Scores close to the winning label indicate genuine ambiguity.

When to act: When the final label surprises you, check the rejected candidate table. If a rejected candidate had a score within 0.1 of the winner, investigate both labels before accepting the call.

Interactive Cluster Copilot

What it is: A built-in chat interface connected to that cluster's expression data and pathway context. Ask biological questions and get data-grounded answers.

How to use it: Ask it questions like: "Why does this cluster express CXCL13?" or "What differentiates this cluster from the neighboring CD8 T cells?". Answers are grounded in your actual expression matrix, not generic responses.

When to use it: Use this to resolve interpretation questions in-context before escalating to manual analysis or posting to a lab meeting.

Audit-ready export

What it is: A full annotation table exportable as CSV/TSV, containing all fields needed for downstream analysis and regulatory documentation.

Exported columns: cluster ID, CL term name, CL term ID, cell type, granular annotation, cell state, confidence score, label match score, supporting markers, conflicting markers.

When to use it: Use the export for integration into adata.obs or obj@meta.data, consortium data submissions, and clinical/regulatory sign-off packages.

The multi-agent pipeline

CyteType is not a single LLM call with a prompt. It is a coordinated pipeline of specialized agents that collaborate to produce a single annotation with a full evidence trail.

Data preparation (local): Before anything is sent to the API, the client SDK preprocesses your object locally: extracting top marker genes per cluster, computing expression percentages across all genes, aggregating observation metadata per cluster, and sampling UMAP coordinates for visualization. This preprocessing runs in your environment and does not require a network connection.

Artifact generation: Two artifacts are created and uploaded alongside the annotation request:

obs.duckdb — a DuckDB database of cell metadata, powering metadata filtering and exploration in the report
vars.h5 — a compressed HDF5 file of the normalized expression matrix, used by the server for on-demand gene expression lookups in the interactive report

Agent roles: Six specialized agents run per cluster:

Agent	Role
Contextualizer	Frames the biological context from study_context, cluster metadata, and marker expression before annotation begins
Annotator	Proposes candidate cell types using markers, expression percentages, and Cell Ontology knowledge
Reviewer	Multiple independent reviewers evaluate each candidate, surfacing strengths, weaknesses, and alternatives
Summarizer	Synthesizes reviewer outputs into a final annotation with confidence scores
Clinician	Applies disease-context validation to catch biologically implausible calls
Chat	Powers the Cluster Copilot, staying connected to the cluster's expression data

Cell Ontology mapping: Every annotation is mapped to a Cell Ontology (CL) term. CL is a community-maintained, hierarchical vocabulary for cell types. CL IDs enable cross-study comparison, downstream ontology-based analyses (e.g. enrichment against cell type databases), and regulatory traceability. The ID format is CL:xxxxxxx.

LLM infrastructure: Each cluster requires hundreds of LLM calls. The CyteType API handles rate limit management, automatic retries, health-aware model fallbacks, and parallel cluster processing. The n_parallel_clusters parameter controls how many clusters are annotated simultaneously.

Data flow summary

Your object (AnnData / Seurat)
  ↓ SDK preprocessing (local)
    marker genes, expression %, metadata, UMAP coords
  ↓ Artifact upload
    vars.h5 (expression matrix) + obs.duckdb (cell metadata)
  ↓ /annotate API call
    payload: study context, markers, expression, metadata
  ↓ Multi-agent pipeline (per cluster, in parallel)
    Contextualizer → Annotator → Reviewer × N → Summarizer → Clinician
  ↓ /results fetch
    annotations, ontology terms, confidence scores, evidence
  ↓ Results written back to your object
    adata.obs / obj@meta.data + adata.uns / obj@misc
  ↓ Interactive HTML report
    live at prod.cytetype.nygen.io/report/{job_id}

CyteType's MCP server lets you explore your scRNA-seq results (clusters, gene expression, annotations, differential expression) directly through Claude using natural conversation. No SDK or API keys required.

MCP server URL https://mcp.cytetype.nygen.io

Setup

Open Connectors

In Claude, click Customize in the left sidebar, then select Connectors.

Add a custom connector

Click + Add Custom Connector. Give it a name (e.g. CyteType), paste the URL https://mcp.cytetype.nygen.io, and click Add. If it doesn't connect automatically, click Connect.

You'll be taken to the CyteType login page. No existing account needed — enter your email, verify with the one-time code sent to your inbox, and you'll be redirected back to Claude.

Allow all tools

Back in Claude, open the CyteType connector menu and select Always allow from the dropdown to enable all 23 tools without per-call prompts.

Start exploring

Ask Claude to connect to a job and start asking questions. Try: "Connect to my latest job and show me all clusters."

Public reports work without an existing account: you still log in via email OTP, but you can immediately explore any public job ID. Authenticated users also get access to their own private, shared, and organization-visible reports.

Available tools

The connector exposes 23 tools grouped by category.

Session Management — 4 tools

connect_job — Connect by ID or interactively pick from your jobs
list_my_jobs — Browse all jobs you own or have access to
list_connected_jobs — See all jobs loaded in this session
disconnect_job — Free a job's memory when you are done

Dataset Exploration — 8 tools

list_clusters — All clusters with annotations and top markers
get_cluster_detail — Full detail for a single cluster
lookup_gene_expression — Mean expression of genes across clusters
describe_metadata — Metadata schema with semantic role hints
query_metadata — SQL queries on cell metadata via DuckDB
run_differential_expression — DE analysis between cluster groups
get_cell_interactions — LIANA cell-cell communication data
cluster_metadata_enrichment — Per-cluster breakdown by metadata value

Annotation Introspection — 6 tools

get_annotation_evidence — Gene-level evidence with literature citations
get_annotation_context_fit — Tissue and pathway context fit
get_cell_identity — Cell state, ontology term, granular type
get_annotation_candidates — All candidates considered and reasoning
get_candidate_gene_groups — Supporting and refuting gene sets
get_embedding_coordinates — Cell coordinates for visualization

Annotation Deep Dive — 5 tools

get_quality_control — Confidence, heterogeneity, classifier accuracy
get_expert_reviews — Expert reviewer panel assessments
get_neighbor_degs — Differential genes vs. nearest neighbors
get_co_expression_evidence — Co-expression pairs with statistics
get_pathway_details — Pathway enrichment with NES scores

Multi-job sessions. Connect up to 10 jobs in a single session. Call connect_job with different job IDs, then pass job_id to any tool. When only one job is connected, the parameter is optional.

Example prompts

Natural language is enough; you do not need to memorize tool names.

Explore

Connect to my latest job and show me all clusters with their annotations and confidence levels.

Opus 4.6 Extended ⌄

Gene Expression

What is the expression of TP53, BRCA1, and CD8A across all clusters?

Opus 4.6 Extended ⌄

Differential Expression

Run DE between the T cell clusters and the macrophage clusters. Show me the top upregulated genes.

Opus 4.6 Extended ⌄

Annotation Forensics

Why was cluster 2 annotated as exhausted CD8+ T cells? Show me the evidence and what else was considered.

Opus 4.6 Extended ⌄

Quality Check

Which clusters have low confidence? What did the expert reviewers flag?

Opus 4.6 Extended ⌄

Cross-Dataset

Connect to both my treatment and control jobs and compare the T cell populations.

Opus 4.6 Extended ⌄

CyteType is a multi-agent AI system for automated annotation of single-cell RNA-seq data. Several agents independently evaluate marker expression, reference similarity, ontology structure, and literature context. Their outputs are merged into a final annotation with confidence scores and traceable reasoning. The full method is described in the CyteType preprint (Ahuja G et al., bioRxiv 2025).

Traditional approaches depend on a single reference or a set of marker genes. CyteType instead integrates several biological signals through a structured agent workflow, leading to higher robustness in rare, transitional, or disease-associated cell populations.

The performance gains originate from the workflow rather than the LLM tier. Each agent contributes a different biological perspective, and a reconciliation step produces a stable, evidence-supported annotation.

CyteType was created by Nygen Analytics, a research-focused biotech company in Sweden working on AI systems for single-cell omics.

CyteType is available in Python for AnnData/Scanpy workflows and in R (CyteTypeR) for Seurat.

Python: pip install cytetype (Python ≥ 3.11). R: devtools::install_github("NygenAnalytics/CyteTypeR").

AnnData and Seurat objects. Standard Scanpy or Seurat preprocessing workflows do not require reformatting.

CyteType requires internet connectivity for LLM-based annotation. Jobs up to approximately 500,000 cells per request are supported. Larger datasets are automatically batched.

CyteType expects user-provided cluster marker genes as priors. Marker selection is often study-specific, so the tool does not override user-defined markers. CyteType supplements these priors by analysing pseudobulked cluster profiles to identify additional genes outside the provided list.

Marker gene computation is left to the user to maintain compatibility with different preprocessing workflows and clustering strategies. CyteType uses these markers as a structured informational anchor during reasoning.

Pseudobulk profiles allow CyteType to evaluate genes that are not included in user-provided marker lists. This expands the evidence base for annotation and improves accuracy in transitional or poorly characterised populations.

Human and mouse datasets are fully supported. Other species including rat and zebrafish can be analysed through ortholog mapping, subject to gene homology quality.

Yes. Reference sets optimised for PBMC, brain, liver, kidney, lung, and pancreas are provided. Custom references can also be supplied.

Yes. Because the method does not rely solely on healthy-tissue references, it performs well on tumour samples, perturbed systems, and transitional states.

CyteType was benchmarked across 20 datasets and 977 clusters. Using the CyteOnto semantic similarity metric against author-assigned labels, CyteType achieved up to 3.8-fold improvement over GPTCellType, 2.68-fold over CellTypist, and 1.01-fold over SingleR. Forty-one percent of clusters received enhanced functional annotation and twenty-nine percent received refined subtype resolution relative to author labels.

Typical runtime is two to three minutes per cluster. A dataset of about fifteen clusters usually completes in thirty to forty-five minutes. The system supports high concurrency for large studies.

CyteType assigns annotations at the cluster level and propagates them to individual cells. This balances computational efficiency with biological resolution.

CyteType relies on LLMs, which introduce controlled stochastic variation. Textual reasoning traces may differ across runs, but the method converges to stable biological annotations. Variation occurs at the token level, not at the interpretive level. Archive query.json for full reproducibility records.

Confidence values are provided for cell type, subtype, and activation state. Scores above 0.8 generally indicate high reliability. Lower scores mark ambiguous or poorly supported populations that warrant manual review.

CyteType adds annotation metadata to the input object, including labels, ontology terms, confidence scores, and evidence summaries. A standalone HTML report provides interactive reasoning, citations, and interpretation tools.

CyteType maps annotations to official Cell Ontology (CL) terms. When subtype precision is limited, it proposes the closest parent term and includes ranked alternatives.

The Literature and Context agent uses LLM-assisted retrieval to identify relevant publications and summarise supporting evidence. Citations are included for independent verification.

Clusters with unexpected marker signatures are flagged. These can represent doublets, low-quality populations, or potentially novel biological states.

No. UMAP subsampling is purely for rendering. All cells contribute to cluster-level pseudobulk profiles, and annotations are based on full data.

CyteType does not enforce strict tissue constraints. Reasoning is influenced by the study_context provided by the user, but agents are free to consider the full biological landscape. This avoids prematurely excluding plausible alternatives.

Input files and intermediate representations are stored only to support report generation and the report's interactive chat. Nygen does not access or reuse user data for model development.

Reports remain available by default. A report management system is being introduced that will allow users to define retention periods and delete reports directly.

No. User data is not used for training or benchmarking. Only system logs and metadata are monitored to ensure stability.

CyteType uses providers such as Anthropic and xAI. Some operate under zero-retention agreements. Nygen does not allow providers to store user data.

Yes. Users may request deletion by providing the job ID to support@nygen.io. Self-service deletion will be available with the new report management backend.

CyteType runs on infrastructure aligned with ISO 27001 and SOC 2 controls. Data is encrypted in transit and at rest, with strict tenant isolation.

For regulated environments, contact contact@nygen.io about enterprise options: on-premises deployment with customer-managed LLMs, isolated storage, and zero data retention policies. The default cloud deployment should not be used with identifiable patient data without explicit data processing agreements.

Yes. CyteType writes outputs directly into object metadata (adata.obs/obj@meta.data). Existing downstream pipelines remain unchanged.

Standard QC, normalisation, and clustering. Apply batch correction (such as Harmony or Scanorama) when working with multi-batch datasets before submission.

Yes. Custom marker genes and reference datasets can be supplied for specialised tissues or perturbation studies.

CyteType currently expects transcriptomic input. Support for multimodal and CITE-seq data is under evaluation.

Pass metadata as a dict (Python) or named list (R) to the run step. Keys appear as headers in the report. Values that look like URLs are rendered as clickable links — useful for linking GEO accessions, DOIs, or internal data portals.

The HTML report allows manual override, re-annotation requests, and reasoning queries via the Cluster Copilot. All changes are logged.

The Reviewer agent highlights ambiguity and provides ranked alternative hypotheses with supporting evidence. Low confidence scores indicate clusters that warrant closer inspection.

Jobs continue running on the server even after your local session disconnects. Use annotator.get_results() (Python) or GetResults(obj) (R) to retrieve results after reconnecting. The report URL printed during submission updates live — check it to see per-cluster progress.

Set require_artifacts=False to skip artifact uploading and proceed with annotation-only mode. Annotation will still complete; the interactive report will lack expression lookups and metadata filtering. For large datasets, load AnnData in backed mode: sc.read_h5ad("file.h5ad", backed="r").

Potential doublets and low-quality populations are flagged through confidence scoring. Removal should occur during preprocessing before submission.

Spatial support is in development and will be released after validation.

Yes. Expose your local Ollama instance using ngrok, then pass the public URL as baseUrl with provider="openai". The model must support tool calling. See the Ollama integration guide for step-by-step instructions.

Cell Ontology updates are synced quarterly. Literature context is refreshed regularly. Workflow improvements are deployed automatically.

Please cite the bioRxiv preprint: Ahuja G et al., Multi-agent AI enables evidence-based cell annotation in single-cell transcriptomics, bioRxiv 2025. doi:10.1101/2025.11.06.686964

GitHub: github.com/NygenAnalytics/CyteType — Discord: discord.gg/V6QFM4AN — Support: support@nygen.io

Yes. A Google Colab tutorial is available for an interactive end-to-end walkthrough: Open in Colab

CyteType Documentation

Documentation

Prerequisites

Data requirements:

Python prerequisites:

R prerequisites:

Installation

Quick Start

Workspaces

API Tokens

Creating a token:

Managing existing tokens:

Organization admins:

Reports

Controlling visibility:

Sharing a report with specific users (organization workspace):

Archiving reports:

Transferring ownership (organization workspace):

Preprocessing your data

Initializing the annotator

Running annotation

Accessing results

Custom LLM configuration

Authentication

Ontology-anchored annotation

Functional state resolution

Coarse lineage map

Marker-level evidence

How to read it:

Confidence and heterogeneity QC

Multi-expert synthesis

Study-aware context

Ranked pathway signals

Linked citations trail

Decision traceability

Interactive Cluster Copilot

Audit-ready export

The multi-agent pipeline

Data flow summary

Setup

Available tools

Example prompts

Frequently Asked Questions