CyteType Documentation
Everything you need to annotate single-cell data with CyteType — from installation to interpreting your report.
Getting Started
Prerequisites
Before running CyteType, make sure your data meets these requirements.
Data requirements:
- Gene symbols, not Ensembl IDs, in your feature names
- Differential expression results computed per cluster
- Clustering results stored in your object (Leiden, Louvain, or Seurat clusters)
- Normalized gene expression data (log1p-normalization recommended)
Python prerequisites:
- A preprocessed AnnData object with
sc.tl.rank_genes_groupsresults - Python ≥ 3.12
R prerequisites:
- devtools installed for GitHub installation
- A Seurat object with
FindAllMarkers()output - R ≥ 4.1.0
Installation
Install the CyteType client for your preferred environment.
pip install cytetype
install.packages("devtools")
library(devtools)
install_github("NygenAnalytics/CyteTypeR")
Quick Start
A minimal end-to-end run. No API key required for the default configuration.
import scanpy as sc
from cytetype import CyteType
# Load your preprocessed AnnData
# adata must have clusters in adata.obs and rank_genes_groups in adata.uns
group_key = "leiden"
annotator = CyteType(
adata,
group_key=group_key,
rank_key="rank_genes_" + group_key,
n_top_genes=50,
)
adata = annotator.run(
study_context="Human PBMC from healthy donor, 10X Genomics 3' scRNA-seq"
)
# Annotations are now in adata.obs
sc.pl.umap(adata, color=f"cytetype_annotation_{group_key}")
library(Seurat)
library(CyteTypeR)
# Find markers (if not already done)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
# Step 1: prepare data
prepped_data <- PrepareCyteTypeR(
obj = pbmc,
marker_table = pbmc.markers,
group_key = "seurat_clusters",
n_top_genes = 50,
coordinates_key = "umap"
)
# Step 2: submit and annotate
pbmc <- CyteTypeR(
obj = pbmc,
prepped_data = prepped_data,
study_context = "Human PBMC from healthy donor, 10X Genomics 3' scRNA-seq"
)
# Annotations are now in obj@meta.data
DimPlot(pbmc, group.by = "cytetype_seurat_clusters")
A link to your interactive HTML report is printed during the run. Results are also written directly back to your object.
Dashboard
The CyteType Dashboard is a web interface for managing API tokens and annotation reports. Navigate to /dashboard on your CyteType server.
Workspaces
The dashboard supports two workspaces, selectable from the switcher at the top of the sidebar.
Personal — Your free individual account. Tokens and reports here are private by default and subject to the free-tier daily limit (three annotation runs per day).
Organization — Available if your account is linked to an organization license. Switching to this workspace shows tokens and reports scoped to your organization, with shared quota, org-wide visibility options, and team administration features.
API Tokens
API tokens authenticate your requests to the CyteType API. Open the API Tokens section from the sidebar navigation.
Creating a token:
- Click Create New Token.
- Enter a descriptive name for the token (e.g.
Production ServerorLocal notebook). - Optionally set a Quota Limit — the maximum number of clusters this token is allowed to annotate. Leave empty for no limit.
- Click Create Token.
The token value is displayed in the table immediately. Copy it and store it securely — it is used as the auth_token parameter in the Python and R clients:
annotator = CyteType(adata, group_key="leiden", auth_token="your-token-here")
result <- CyteTypeR(
obj = seurat_obj,
prepped_data = prepped_data,
study_context = "...",
auth_token = "your-token-here"
)
Managing existing tokens:
The token table shows each token's status, total jobs run, clusters annotated, quota, and creation date.
- To disable a token permanently, click Disable. Disabled tokens cannot be re-enabled.
- To copy a token value, click the copy button beside it.
- To rename a token or update its quota, click the edit icon next to the name or quota field.
Use the Sort dropdown to order tokens by date created, name, usage, quota, or status. The filter pills toggle between showing only active tokens and all tokens including disabled ones.
Organization admins:
In the organization workspace, admins see all tokens created across the organization. Admins can create tokens on behalf of other members by entering a member's email address in the User Email field when creating a token.
Members who are not admins cannot create tokens in the organization workspace. Contact your organization admin to request one.
Reports
The Reports section lists annotation jobs submitted under your account. Two tabs are available:
- Reports shared with me — Jobs from other organization members made visible to you (organization workspace only).
- My Reports — Jobs you own.
Controlling visibility:
Each report has a visibility setting. Click the edit icon in the Visibility column to change it.
| Setting | Who can access the report |
|---|---|
| Private | Only you and users you have explicitly shared with |
| Organization | All members of your organization |
| Public | Anyone with the report link |
Sharing a report with specific users (organization workspace):
- Select one or more reports using their checkboxes.
- Click Share in the action bar.
- Enter each person's email address and click Add.
- Click Save Changes.
You can also click the value in the Shared With column of a single report to open the share dialog for that report directly.
When sharing multiple reports at once, new users are added to each report's existing share list without removing anyone already shared. When editing a single report's shares, the list is replaced entirely — remove users by deleting them from the list before saving.
Archiving reports:
Archiving hides reports from the default view without deleting them. To archive one or more reports, select them and click Archive. To view archived reports, enable Show archived above the table. Select an archived report and click Unarchive to restore it.
Transferring ownership (organization workspace):
To transfer one or more reports to another organization member:
- Select the reports.
- Click Transfer in the action bar.
- Enter the new owner's email address.
- Click Transfer All.
Transfer is permanent. The receiving user must be a member of the same organization.
Client Reference
Preprocessing your data
CyteType requires differential expression results computed per cluster before submission.
import scanpy as sc
# Assumes adata already has normalized counts in adata.X
# and cluster labels in adata.obs["leiden"]
sc.tl.rank_genes_groups(
adata,
groupby="leiden",
method="wilcoxon",
key_added="rank_genes_leiden",
n_genes=100,
)
# If gene symbols are stored separately from var_names:
# adata.var["gene_symbols"] = adata.var_names # or set during load
library(Seurat)
library(dplyr)
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, nfeatures = 2000)
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE) %>%
dplyr::filter(avg_log2FC > 1)
Initializing the annotator
The initialization step validates your data and precomputes expression percentages. This can take a few minutes on large datasets but only needs to run once per object.
from cytetype import CyteType
annotator = CyteType(
adata,
group_key="leiden", # column in adata.obs with cluster labels
rank_key="rank_genes_leiden", # key in adata.uns with DE results
gene_symbols_column="gene_symbols", # column in adata.var with gene symbols
n_top_genes=50, # top marker genes per cluster
aggregate_metadata=True, # include obs metadata in context
min_percentage=10, # min % threshold for metadata
coordinates_key="X_umap", # key in adata.obsm for UMAP coordinates
max_cells_per_group=1000, # cells sampled per cluster for visualization
auth_token=None, # Bearer token (if your deployment requires one)
)
# PrepareCyteTypeR handles all preprocessing in one step
prepped_data <- PrepareCyteTypeR(
obj = seurat_obj,
marker_table = markers_df, # output of FindAllMarkers()
group_key = "seurat_clusters", # metadata column with cluster assignments
gene_symbols = "gene_symbols", # gene symbol field name
n_top_genes = 50, # top marker genes per cluster
aggregate_metadata = TRUE, # include metadata context
min_percentage = 10, # min % threshold for metadata
coordinates_key = "umap", # dimensional reduction for visualization
max_cells_per_group = 1000 # cells sampled per cluster
)
Running annotation
The study_context is the most important parameter for annotation quality. Describe your tissue, organism, disease state, and experimental setup in one or two sentences.
adata = annotator.run(
study_context="Human colorectal cancer biopsy, tumor microenvironment, 10X Genomics 5' scRNA-seq, treatment-naive patients",
metadata={
"Study": "My TME atlas",
"GEO": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123456",
"DOI": "https://doi.org/10.1038/example"
},
n_parallel_clusters=4, # increase for faster annotation (watch rate limits)
results_prefix="cytetype", # prefix for result columns
timeout_seconds=7200,
show_progress=True,
)
pbmc <- CyteTypeR(
obj = seurat_obj,
prepped_data = prepped_data,
study_context = "Human colorectal cancer biopsy, tumor microenvironment, treatment-naive patients",
metadata = list(
"Study" = "My TME atlas",
"GEO" = "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123456",
"DOI" = "https://doi.org/10.1038/example"
),
n_parallel_clusters = 4L,
results_prefix = "cytetype",
timeout_seconds = 7200L,
show_progress = TRUE
)
Accessing results
After a successful run, annotations are stored in your object and can be accessed immediately.
# Annotation columns added to adata.obs:
# cytetype_annotation_leiden — cell type label
# cytetype_cellOntologyTerm_leiden — Cell Ontology term name
# cytetype_cellOntologyTermID_leiden — CL:xxxxxxx ID
# cytetype_cellState_leiden — functional state
import json
results = json.loads(adata.uns["cytetype_results"]["result"])
for ann in results["annotations"]:
print(ann["clusterId"], ann["annotation"], ann["ontologyTermID"])
# If you need to re-fetch results after disconnecting:
results = annotator.get_results()
# Annotation columns added to obj@meta.data:
# cytetype_annotation_seurat_clusters
# cytetype_cellOntologyTerm_seurat_clusters
# cytetype_cellOntologyTermID_seurat_clusters
# cytetype_cellState_seurat_clusters
# Full results table in obj@misc:
results_df <- seurat_obj@misc[["cytetype_results"]]
View(results_df)
# Available columns in results_df:
# clusterId, annotation, ontologyTerm, ontologyTermID,
# granularAnnotation, cellState, justification,
# supportingMarkers, conflictingMarkers, missingExpression, unexpectedExpression
# If you need to re-fetch results after disconnecting:
results <- GetResults(seurat_obj)
Custom LLM configuration
By default CyteType uses its own hosted model. You can bring your own LLM from any supported provider.
# Single LLM configuration
adata = annotator.run(
study_context="...",
llm_configs=[{
"provider": "openai",
"name": "gpt-4o",
"apiKey": "sk-...",
"baseUrl": "https://api.openai.com/v1", # optional
"modelSettings": {"temperature": 0.0, "max_tokens": 4096}
}]
)
# Multiple providers (different agents can use different models)
adata = annotator.run(
study_context="...",
llm_configs=[
{
"provider": "anthropic",
"name": "claude-3-5-sonnet-20241022",
"apiKey": "sk-ant-...",
"targetAgents": ["annotator", "reviewer"]
},
{
"provider": "openai",
"name": "gpt-4o-mini",
"apiKey": "sk-...",
"targetAgents": ["summarizer", "clinician"]
}
]
)
# AWS Bedrock
adata = annotator.run(
study_context="...",
llm_configs=[{
"provider": "bedrock",
"name": "us.anthropic.claude-3-5-sonnet-20241022-v2:0",
"awsAccessKeyId": "AKIA...",
"awsSecretAccessKey": "...",
"awsDefaultRegion": "us-east-1"
}]
)
# Single LLM configuration
result <- CyteTypeR(
obj = seurat_obj,
prepped_data = prepped_data,
study_context = "...",
llm_configs = list(
provider = "openai",
name = "gpt-4o",
apiKey = "sk-...",
baseUrl = "https://api.openai.com/v1",
modelSettings = list(temperature = 0.0, max_tokens = 4096L)
)
)
# Multiple providers
result <- CyteTypeR(
obj = seurat_obj,
prepped_data = prepped_data,
study_context = "...",
llm_configs = list(
list(
provider = "anthropic",
name = "claude-3-5-sonnet-20241022",
apiKey = "sk-ant-...",
targetAgents = c("annotator", "reviewer")
),
list(
provider = "openai",
name = "gpt-4o-mini",
apiKey = "sk-...",
targetAgents = c("summarizer", "clinician")
)
)
)
Supported providers: anthropic, bedrock, fireworks, google, groq, huggingface, mistral, openai, openrouter, vertex, xai.
Authentication
For deployments that require a bearer token (private or enterprise instances):
# Pass at initialization (applies to all run() calls)
annotator = CyteType(adata, group_key="leiden", auth_token="your-token")
# Or override at run time
adata = annotator.run(study_context="...", auth_token="your-token")
# Pass auth_token to CyteTypeR
result <- CyteTypeR(
obj = seurat_obj,
prepped_data = prepped_data,
study_context = "...",
auth_token = "your-token"
)
# Or when re-fetching results
results <- GetResults(seurat_obj, auth_token = "your-token")
Understanding Your Report
Every CyteType report is structured around the same set of sections. Each answers a specific question your biology team will ask when reviewing annotations.
Ontology-anchored annotation
What it is: Each cluster is assigned a Cell Ontology (CL) term — a globally standardized identifier for cell types — alongside a confidence score and a label match score comparing the CyteType call to any labels you already had.
How to read it: The CL ID links to the official ontology definition. Confidence (0–1) reflects model certainty. Label match shows alignment with your prior annotation if you provided cluster labels.
When to act: Low confidence (<0.6) or low label match when you trust your prior labels suggests the cluster may need manual review or re-clustering.
Functional state resolution
What it is: Cell states (activation, exhaustion, ECM remodeling, antigen presentation, etc.) are resolved as distinct gene programs separate from the cell type label.
How to read it: The cell state field reports co-occurring functional programs. Multiple states can be active simultaneously. Each is supported by a specific gene set.
When to act: Unexpected states — exhaustion in naive T cells, proliferation in quiescent stromal cells — should prompt review of cluster composition. These are often the most biologically interesting signals.
Coarse lineage map
What it is: Clusters are grouped into major lineages (myeloid, lymphoid, epithelial, stromal, endothelial, etc.) for rapid high-level orientation before diving into subtypes.
How to read it: Use this for first-pass triage. Lineage groups reflect broad biology before subtype resolution. Confidence indicators show certainty at the lineage level.
When to act: Clusters misassigned to an unexpected lineage may indicate doublets, ambient RNA contamination, or marker gene quality issues.
Marker-level evidence
What it is: The supporting, missing, and unexpected gene breakdown for each annotation call. Each gene is annotated with its biological role and linked to published evidence.
How to read it:
- Unexpected markers are genes inconsistent with the label (potential contamination or annotation error)
- Missing markers are canonical genes expected but not detected (may indicate incomplete capture or a cell subtype)
- Supporting markers positively support the assigned cell type
When to act: Many unexpected markers or absent canonical markers alongside low confidence is a strong signal to question the call.
Confidence and heterogeneity QC
What it is: Badges summarizing certainty and intra-cluster diversity, with narrative reasoning describing what is solid versus mixed.
How to read it: Confidence reflects how strongly the evidence points to the assigned cell type. Heterogeneity reflects how mixed the cluster is internally.
When to act: High heterogeneity clusters are candidates for re-clustering. The narrative often names which marker groups are causing the mixed signal.
Multi-expert synthesis
What it is: Several specialized AI reviewers independently assess each annotation. Their agreements, disagreements, and alternative hypotheses are surfaced before the final label is locked.
How to read it: Reviewer consensus strengthens confidence. Disagreements are not errors — they reflect genuine ambiguity in the data. Alternative hypotheses are ordered by plausibility.
When to act: When reviewers disagree on the top call, treat the listed alternatives as equally valid candidates to investigate with orthogonal methods.
Study-aware context
What it is: The annotation model is grounded in your study_context. Disease-specific, tissue-specific, and organism-specific knowledge shapes which cell types are considered and how markers are weighted.
How to read it: Context fit shows how well the assigned label fits the biological framing of your study. Keywords extracted from the study context are shown for transparency.
When to act: If labels feel generic or off-target, revisit your study_context. A vague context ("human cells") produces generic labels. A specific context ("inflamed synovium from RA patients, synovectomy samples") produces disease-relevant annotations.
Ranked pathway signals
What it is: GO and WikiPathways enrichment results for each cluster, ranked by NES (Normalized Enrichment Score). Both up-regulated and down-regulated programs are shown.
How to read it: NES indicates strength and direction of pathway activation. Focus on pathways with |NES| > 1.5 for mechanistic interpretation. Use this to connect cell identity to cellular programs.
When to act: Unexpected pathway activation may indicate contamination, a stress response, or a biologically interesting subpopulation worth further investigation.
Linked citations trail
What it is: Inline PubMed citations link specific gene claims to published literature. Hover previews and one-click links allow rapid verification without leaving the report.
How to read it: Every marker-gene explanation is backed by a citation. The source publication, journal, and year are shown in the preview.
When to act: In regulated environments, the citation trail is part of your audit record. Flag citations from out-of-context tissues or organisms for manual verification.
Decision traceability
What it is: The full candidate evaluation funnel — what cell types were considered, why each was accepted or rejected, and the quantitative evidence (Log2FC, expression %, tissue context) driving each decision.
How to read it: Accepted candidates met the evidence threshold. Rejected candidates are listed with the reason for rejection. Scores close to the winning label indicate genuine ambiguity.
When to act: When the final label surprises you, check the rejected candidate table. If a rejected candidate had a score within 0.1 of the winner, investigate both labels before accepting the call.
Interactive Cluster Copilot
What it is: A built-in chat interface connected to that cluster's expression data and pathway context. Ask biological questions and get data-grounded answers.
How to use it: Ask it questions like: "Why does this cluster express CXCL13?" or "What differentiates this cluster from the neighboring CD8 T cells?". Answers are grounded in your actual expression matrix, not generic responses.
When to use it: Use this to resolve interpretation questions in-context before escalating to manual analysis or posting to a lab meeting.
Audit-ready export
What it is: A full annotation table exportable as CSV/TSV, containing all fields needed for downstream analysis and regulatory documentation.
Exported columns: cluster ID, CL term name, CL term ID, cell type, granular annotation, cell state, confidence score, label match score, supporting markers, conflicting markers.
When to use it: Use the export for integration into adata.obs or obj@meta.data, consortium data submissions, and clinical/regulatory sign-off packages.
How CyteType Works
The multi-agent pipeline
CyteType is not a single LLM call with a prompt. It is a coordinated pipeline of specialized agents that collaborate to produce a single annotation with a full evidence trail.
Data preparation (local): Before anything is sent to the API, the client SDK preprocesses your object locally: extracting top marker genes per cluster, computing expression percentages across all genes, aggregating observation metadata per cluster, and sampling UMAP coordinates for visualization. This preprocessing runs in your environment and does not require a network connection.
Artifact generation: Two artifacts are created and uploaded alongside the annotation request:
obs.duckdb— a DuckDB database of cell metadata, powering metadata filtering and exploration in the reportvars.h5— a compressed HDF5 file of the normalized expression matrix, used by the server for on-demand gene expression lookups in the interactive report
Agent roles: Six specialized agents run per cluster:
| Agent | Role |
|---|---|
| Contextualizer | Frames the biological context from study_context, cluster metadata, and marker expression before annotation begins |
| Annotator | Proposes candidate cell types using markers, expression percentages, and Cell Ontology knowledge |
| Reviewer | Multiple independent reviewers evaluate each candidate, surfacing strengths, weaknesses, and alternatives |
| Summarizer | Synthesizes reviewer outputs into a final annotation with confidence scores |
| Clinician | Applies disease-context validation to catch biologically implausible calls |
| Chat | Powers the Cluster Copilot, staying connected to the cluster's expression data |
Cell Ontology mapping: Every annotation is mapped to a Cell Ontology (CL) term. CL is a community-maintained, hierarchical vocabulary for cell types. CL IDs enable cross-study comparison, downstream ontology-based analyses (e.g. enrichment against cell type databases), and regulatory traceability. The ID format is CL:xxxxxxx.
LLM infrastructure: Each cluster requires hundreds of LLM calls. The CyteType API handles rate limit management, automatic retries, health-aware model fallbacks, and parallel cluster processing. The n_parallel_clusters parameter controls how many clusters are annotated simultaneously.
Data flow summary
Your object (AnnData / Seurat)
↓ SDK preprocessing (local)
marker genes, expression %, metadata, UMAP coords
↓ Artifact upload
vars.h5 (expression matrix) + obs.duckdb (cell metadata)
↓ /annotate API call
payload: study context, markers, expression, metadata
↓ Multi-agent pipeline (per cluster, in parallel)
Contextualizer → Annotator → Reviewer × N → Summarizer → Clinician
↓ /results fetch
annotations, ontology terms, confidence scores, evidence
↓ Results written back to your object
adata.obs / obj@meta.data + adata.uns / obj@misc
↓ Interactive HTML report
live at prod.cytetype.nygen.io/report/{job_id}
FAQ
General
What is CyteType?
CyteType is a multi-agent AI system for automated annotation of single-cell RNA-seq data. Several agents independently evaluate marker expression, reference similarity, ontology structure, and literature context. Their outputs are merged into a final annotation with confidence scores and traceable reasoning. The full method is described in the CyteType preprint (Ahuja G et al., bioRxiv 2025).
How is CyteType different from existing tools?
Traditional approaches depend on a single reference or a set of marker genes. CyteType instead integrates several biological signals through a structured agent workflow, leading to higher robustness in rare, transitional, or disease-associated cell populations.
Why is the multi-agent workflow important?
The performance gains originate from the workflow rather than the LLM tier. Each agent contributes a different biological perspective, and a reconciliation step produces a stable, evidence-supported annotation.
Who developed CyteType?
CyteType was created by Nygen Analytics, a research-focused biotech company in Sweden working on AI systems for single-cell omics.
Technical requirements
Which programming environments are supported?
CyteType is available in Python for AnnData/Scanpy workflows and in R (CyteTypeR) for Seurat.
How do I install CyteType?
Python: pip install cytetype (Python ≥ 3.11). R: devtools::install_github("NygenAnalytics/CyteTypeR").
Which input formats can I use?
AnnData and Seurat objects. Standard Scanpy or Seurat preprocessing workflows do not require reformatting.
What resources are needed to run CyteType?
CyteType requires internet connectivity for LLM-based annotation. Jobs up to approximately 500,000 cells per request are supported. Larger datasets are automatically batched.
Dataset preparation
Does CyteType compute marker genes?
CyteType expects user-provided cluster marker genes as priors. Marker selection is often study-specific, so the tool does not override user-defined markers. CyteType supplements these priors by analysing pseudobulked cluster profiles to identify additional genes outside the provided list.
Why do I need to compute markers beforehand?
Marker gene computation is left to the user to maintain compatibility with different preprocessing workflows and clustering strategies. CyteType uses these markers as a structured informational anchor during reasoning.
Why does CyteType pseudobulk clusters?
Pseudobulk profiles allow CyteType to evaluate genes that are not included in user-provided marker lists. This expands the evidence base for annotation and improves accuracy in transitional or poorly characterised populations.
Species and tissue support
Which species can be annotated?
Human and mouse datasets are fully supported. Other species including rat and zebrafish can be analysed through ortholog mapping, subject to gene homology quality.
Are tissue-specific reference sets included?
Yes. Reference sets optimised for PBMC, brain, liver, kidney, lung, and pancreas are provided. Custom references can also be supplied.
Does CyteType annotate disease samples?
Yes. Because the method does not rely solely on healthy-tissue references, it performs well on tumour samples, perturbed systems, and transitional states.
Performance and reproducibility
How accurate is CyteType?
CyteType was benchmarked across 20 datasets and 977 clusters. Using the CyteOnto semantic similarity metric against author-assigned labels, CyteType achieved up to 3.8-fold improvement over GPTCellType, 2.68-fold over CellTypist, and 1.01-fold over SingleR. Forty-one percent of clusters received enhanced functional annotation and twenty-nine percent received refined subtype resolution relative to author labels.
How long does annotation take?
Typical runtime is two to three minutes per cluster. A dataset of about fifteen clusters usually completes in thirty to forty-five minutes. The system supports high concurrency for large studies.
Are annotations assigned at the cell or cluster level?
CyteType assigns annotations at the cluster level and propagates them to individual cells. This balances computational efficiency with biological resolution.
Why are repeated runs similar but not identical?
CyteType relies on LLMs, which introduce controlled stochastic variation. Textual reasoning traces may differ across runs, but the method converges to stable biological annotations. Variation occurs at the token level, not at the interpretive level. Archive query.json for full reproducibility records.
How should confidence scores be interpreted?
Confidence values are provided for cell type, subtype, and activation state. Scores above 0.8 generally indicate high reliability. Lower scores mark ambiguous or poorly supported populations that warrant manual review.
Outputs and evidence
What does CyteType return?
CyteType adds annotation metadata to the input object, including labels, ontology terms, confidence scores, and evidence summaries. A standalone HTML report provides interactive reasoning, citations, and interpretation tools.
How does Cell Ontology linking work?
CyteType maps annotations to official Cell Ontology (CL) terms. When subtype precision is limited, it proposes the closest parent term and includes ranked alternatives.
How is literature evidence generated?
The Literature and Context agent uses LLM-assisted retrieval to identify relevant publications and summarise supporting evidence. Citations are included for independent verification.
What does anomaly detection identify?
Clusters with unexpected marker signatures are flagged. These can represent doublets, low-quality populations, or potentially novel biological states.
Does the 1000-cell UMAP display affect annotation?
No. UMAP subsampling is purely for rendering. All cells contribute to cluster-level pseudobulk profiles, and annotations are based on full data.
Can CyteType restrict reasoning to specific tissues or lineages?
CyteType does not enforce strict tissue constraints. Reasoning is influenced by the study_context provided by the user, but agents are free to consider the full biological landscape. This avoids prematurely excluding plausible alternatives.
Data privacy and retention
What data is retained?
Input files and intermediate representations are stored only to support report generation and the report's interactive chat. Nygen does not access or reuse user data for model development.
How long are reports stored?
Reports remain available by default. A report management system is being introduced that will allow users to define retention periods and delete reports directly.
Is user data used for training?
No. User data is not used for training or benchmarking. Only system logs and metadata are monitored to ensure stability.
Are third-party LLM providers involved?
CyteType uses providers such as Anthropic and xAI. Some operate under zero-retention agreements. Nygen does not allow providers to store user data.
Can I delete my report?
Yes. Users may request deletion by providing the job ID to support@nygen.io. Self-service deletion will be available with the new report management backend.
What security infrastructure is used?
CyteType runs on infrastructure aligned with ISO 27001 and SOC 2 controls. Data is encrypted in transit and at rest, with strict tenant isolation.
Can I use CyteType with sensitive patient data?
For regulated environments, contact contact@nygen.io about enterprise options: on-premises deployment with customer-managed LLMs, isolated storage, and zero data retention policies. The default cloud deployment should not be used with identifiable patient data without explicit data processing agreements.
Pricing and access
Is CyteType free for academic use?
Yes. Academic and non-commercial use is free, with a limit of three annotation runs per day. CyteType is licensed under CC BY-NC-SA 4.0.
How can I exceed the daily limit?
Users may supply their own LLM API keys or run local LLMs. Provider rate limits may constrain concurrency to three to five clusters in parallel.
How is commercial use licensed?
Licensing includes an annual fee and a per-cluster annotation cost. Nygen offers Discovery, Enterprise, and Partnership tiers. Contact contact@nygen.io for details.
Are enterprise deployments available?
Yes. Options include private cloud, AWS Bedrock integration, and on-premises installations supporting air-gapped environments and local LLMs.
Workflow integration
Does CyteType fit into existing Scanpy and Seurat workflows?
Yes. CyteType writes outputs directly into object metadata (adata.obs/obj@meta.data). Existing downstream pipelines remain unchanged.
What preprocessing is required?
Standard QC, normalisation, and clustering. Apply batch correction (such as Harmony or Scanorama) when working with multi-batch datasets before submission.
Can I use custom marker lists or references?
Yes. Custom marker genes and reference datasets can be supplied for specialised tissues or perturbation studies.
Does CyteType support multimodal data?
CyteType currently expects transcriptomic input. Support for multimodal and CITE-seq data is under evaluation.
How do I add custom metadata to my report?
Pass metadata as a dict (Python) or named list (R) to the run step. Keys appear as headers in the report. Values that look like URLs are rendered as clickable links — useful for linking GEO accessions, DOIs, or internal data portals.
Troubleshooting
What if I disagree with an annotation?
The HTML report allows manual override, re-annotation requests, and reasoning queries via the Cluster Copilot. All changes are logged.
How does CyteType address ambiguous populations?
The Reviewer agent highlights ambiguity and provides ranked alternative hypotheses with supporting evidence. Low confidence scores indicate clusters that warrant closer inspection.
My job is taking a long time. Is it stuck?
Jobs continue running on the server even after your local session disconnects. Use annotator.get_results() (Python) or GetResults(obj) (R) to retrieve results after reconnecting. The report URL printed during submission updates live — check it to see per-cluster progress.
The artifact upload is failing. What should I do?
Set require_artifacts=False to skip artifact uploading and proceed with annotation-only mode. Annotation will still complete; the interactive report will lack expression lookups and metadata filtering. For large datasets, load AnnData in backed mode: sc.read_h5ad("file.h5ad", backed="r").
Does CyteType detect doublets or low-quality cells?
Potential doublets and low-quality populations are flagged through confidence scoring. Removal should occur during preprocessing before submission.
Is spatial transcriptomics supported?
Spatial support is in development and will be released after validation.
Can I use a local model with Ollama?
Yes. Expose your local Ollama instance using ngrok, then pass the public URL as baseUrl with provider="openai". The model must support tool calling. See the Ollama integration guide for step-by-step instructions.
Updates and support
How often are databases and references updated?
Cell Ontology updates are synced quarterly. Literature context is refreshed regularly. Workflow improvements are deployed automatically.
How do I cite CyteType in my publication?
Please cite the bioRxiv preprint: Ahuja G et al., Multi-agent AI enables evidence-based cell annotation in single-cell transcriptomics, bioRxiv 2025. doi:10.1101/2025.11.06.686964
Where can I get support?
GitHub: github.com/NygenAnalytics/CyteType — Discord: discord.gg/V6QFM4AN — Support: support@nygen.io
Is there a tutorial or interactive notebook?
Yes. A Google Colab tutorial is available for an interactive end-to-end walkthrough: Open in Colab