GitHub - AstraZeneca/cellatria: An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis

✨ Introduction

CellAtria is an agentic AI system that enables full-lifecycle, document-to-analysis automation in single-cell research. It integrates natural language interaction with a robust, graph-based, multi-actor execution framework. The system orchestrates diverse tasks, ranging from literature parsing and metadata extraction to dataset retrieval and downstream scRNA-seq analysis via the co-developed CellExpress pipeline.

Through its comprehensive interface, CellAtria empowers users to engage with a language model augmented by task-specific tools. This eliminates the need for manual command-line operations, accelerating data onboarding and the reuse of public single-cell resources.

Language model-mediated orchestration of toolchains. Upon receiving a user prompt, the CellAtria interface transfers the request to the LLM agent, which interprets intent and autonomously invokes relevant tools. Outputs are returned through the interface, completing a full cycle of context-aware execution.

💡 Key Features

Flexible Input: Accepts primary research articles as PDFs or URLs for seamless integration.
Automated Metadata Extraction: Extracts structured metadata, including sample annotations, organism, tissue type, and GEO (Gene Expression Omnibus) accession identifiers.
Intelligent Data Retrieval: Resolves and organizes GEO datasets by accessing both GSE (study-level) and GSM (sample-level) records, ensuring structured and comprehensive data retrieval.
Integrated Analysis Pipeline: Orchestrates full pipeline configuration and launches CellExpress, a containerized framework for standardized scRNA-seq analysis, ensuring reproducible results.
Enhanced User Control: Enables metadata editing, secure file transfers, and direct file system management within the agent session.
Modular & Reusable Architecture: Composes all core actions into reusable, graph-based tools that serve as callable agent nodes, fostering extensibility.

Additional details on the underlying toolkits can be found in the toolkit reference

🚀 Getting Started

(1) Prerequisites

Docker: Install Docker and ensure the Docker daemon is running.
Environment Configuration: Provide a .env file with credentials and parameters (see LLM Configuration section below).

(2) Docker Images

The CellAtria repository includes a GitHub Actions workflow that builds and publishes a preconfigured Docker image to the GitHub Container Registry.

Pull the latest CellAtria Docker image using:

# Run this command in your terminal
docker pull ghcr.io/astrazeneca/cellatria:v1.0.0

This image contains all dependencies needed to run the CellAtria agent in a consistent environment.

(3) Launching Agent

Start the agent with the following command (replace paths with your actual directories):

# Run this command in your terminal
docker run -it --rm \
  -p 7860:7860 \
  -v /path/to/your/project/directory:/data \
  -v /path/to/your/env/directory:/envdir \
  ghcr.io/astrazeneca/cellatria:v1.0.0 cellatria \
  --env_path /envdir

Command Breakdown:

-p 7860:7860: Exposes the agent user interface (UI) on port 7860.
-v /path/to/your/project/directory:/data: Mounts your project directory into the container.
-v /path/to/your/env/directory:/envdir: Mounts your .env directory for configuration (see LLM Configuration section below).
ghcr.io/astrazeneca/cellatria:v1.0.0 cellatria: Specifies the Docker image and the entrypoint command to launch the app inside the container.
--env_path /envdir: Tells agent where to find the .env file for provider setup.

macOS users with Apple Silicon (M1/M2): You may encounter a warning due to platform mismatch. To ensure compatibility, add --platform=linux/amd64 when running the container (i.e., docker run --platform=linux/amd64 -it --rm).

Once launched, the agent will initialize and provide a local URL for interaction. Simply open the link printed in your terminal to begin using CellAtria through your browser.

Mounting a Working Directory:

When running the container, any host directory you want the container to access must be explicitly mounted using Docker’s -v (volume) flag. The container can only see and interact with the directories you specify at runtime.

For example, the following command:

-v /absolute/path/on/host:/data

makes the contents of /absolute/path/on/host on your host machine available inside the container at /data.

If you set a working directory inside the container (e.g., my_project), make sure to reference it using the container’s path — for instance: /data/my_project. Attempting to access files or directories outside the mounted path from within the container will fail, as they are not visible to the container’s filesystem.

🛠️ LLM Configuration

Quick Start

CellAtria requires a .env file to configure access to your chosen LLM provider. You can download the template .env, fill in the necessary credentials and parameters. Ensure the directory containing the .env file is mounted into the container.

Supported LLM Backends

azure: Azure OpenAI (enterprise-grade access to GPT models)
openai: Standard OpenAI API (e.g., GPT-4, GPT-3.5)
anthropic: Claude models via the Anthropic API
google: Gemini models via Google Cloud / Vertex AI
local: Offline models (e.g., Llama.cpp, Ollama, Hugging Face)

Set the PROVIDER variable in your .env file to one of the supported values above. Only one provider can be active at a time.

You only need to configure the block for the provider you're using. The rest can remain commented.

🚂 CellExpress Engine

CellExpress is a companion pipeline embedded within the CellAtria framework. It delivers a reproducible and automated workflow for processing single-cell RNA-seq datasets (scRNA-seq) - from raw count matrices to comprehensive cell type annotations and report generation.

Designed to lower bioinformatics barriers, CellExpress implements a comprehensive set of state-of-the-art, Scanpy-based processing stages, including quality control (performed globally or per sample), data transformation (including normalization, highly variable gene selection, and scaling), dimensionality reduction (UMAP and t-SNE), graph-based clustering, and marker gene identification. Additional tools are integrated to support advanced analysis tasks, including doublet detection, batch correction, and automated cell type annotation using both tissue-agnostic and tissue-specific models. All analytical steps are executed sequentially under centralized control, with parameters fully configurable via a comprehensive input schema.

Run CellExpress in Standalone Mode

CellExpress is a fully standalone pipeline for comprehensive scRNA-seq data analysis. It can be orchestrated either through an agentic system - as incorporated into the CellAtria framework - or via direct command-line execution.

To execute the CellExpress pipeline directly using Docker, use the following command:

# Run this command in your terminal
docker run -it --rm \
  -v /path/to/your/local/data:/data \
  ghcr.io/astrazeneca/cellatria:v1.0.0 cellexpress \
    --input /data \
    --project your_project_name \
    --species `species` \
    --tissue `tissue` \
    --disease `disease` \
    [--additional `options`...]

Command Breakdown:

-v /path/to/your/local/data:/data: Mounts your project directory into the container.
ghcr.io/astrazeneca/cellatria:v1.0.0 cellexpress: Specifies the Docker image and the entrypoint command to launch CellExpress inside the container.
[--additional options...]: arguments to configure pipeline.

macOS users with Apple Silicon (M1/M2): You may encounter a warning due to platform mismatch. To ensure compatibility, add --platform=linux/amd64 when running the container (i.e., docker run --platform=linux/amd64 -it --rm).

For full details, usage instructions, and configuration options, refer to the CellExpress README.

🛠️ Computing Environment

The Dockerfile defines the dedicated computing environment for executing CellAtria and the co-developed CellExpress pipelie in a consistent and reproducible manner. It includes all required Python and R dependencies, along with support for HTML reporting and visualization. Built on an Ubuntu-based system, the environment also provides essential system-level packages to support end-to-end pipeline execution.

🧠 Usage Intuition

While CellAtria supports flexible, user-driven interactions, its functionality is governed by an underlying execution narrative — a structured flow of modular actions that define how tasks are interpreted, routed, and executed. Users may invoke any module independently; however, for optimal results and seamless orchestration, we recommend following the intended workflow trajectory below.

CellAtria's internal logic integrates the following key stages:

Document Parsing - Extracts structured metadata from narrative-formatted scientific documents (article URL or PDF).
Accession Resolution - Identifies relevant GEO (Gene Expression Omnibus) accession IDs from parsed metadata.
Dataset Retrieval - Downloads datasets directly from public repositories.
File & Data Organization - Structures downloaded content into a consistent directory schema for analysis.
Pipeline Configuration - Prepares CellExpress arguments and environmental parameters for execution.
CellExpress Execution - Launches the standardized single-cell analysis pipeline in a detached mode.

This modular, agent-guided framework allows users to begin at any point while preserving logical consistency across steps.

📖 Related Publication

If you use this repository, please cite:

Nima Nouri, et al. (2025). An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis. bioRxiv. https://doi.org/10.1101/2025.07.31.667880

@article{nouri2025agentic,
  title={An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis},
  author={Nouri, Nima and Artzi, Ronen and Savova, Virginia},
  journal={bioRxiv},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

📬 Contact

Role	Name	Contact
Author/Maintainer	Nima Nouri	nima.nouri@astrazeneca.com

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
.github/workflows		.github/workflows
agent		agent
cellexpress		cellexpress
.env		.env
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
NEWS.md		NEWS.md
README.md		README.md
cellatria_git_fig1.png		cellatria_git_fig1.png
cellatria_git_logo.png		cellatria_git_logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✨ Introduction

💡 Key Features

🚀 Getting Started

(1) Prerequisites

(2) Docker Images

(3) Launching Agent

🛠️ LLM Configuration

Quick Start

Supported LLM Backends

🚂 CellExpress Engine

Run CellExpress in Standalone Mode

🛠️ Computing Environment

🧠 Usage Intuition

📖 Related Publication

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

License

AstraZeneca/cellatria

Folders and files

Latest commit

History

Repository files navigation

✨ Introduction

💡 Key Features

🚀 Getting Started

(1) Prerequisites

(2) Docker Images

(3) Launching Agent

🛠️ LLM Configuration

Quick Start

Supported LLM Backends

🚂 CellExpress Engine

Run CellExpress in Standalone Mode

🛠️ Computing Environment

🧠 Usage Intuition

📖 Related Publication

📬 Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages