HTANalyzer-LLM: Large Language Model for Analyzing HTAN Spatial Transcriptomics Data

2024 Human Tumor Atlas Network Data Jamboree

November 6, 2024 - November 8, 2024

Team members

Team Leader: Arun Das (University of Pittsburgh)

Tech Lead: Krithika Bhuvaneshwar (Advanced Biomedical Computational Science (ABCS), Frederick National Lab)

Writers:

Jeanna Arbesfeld-Qiu (Harvard Medical School)
Sanna Madan (National Cancer Institute)
Noam Gaiger (Yale University)

Members:

Ashish Mahabal (Caltech)
Khoa Huynh (Virginia Commonwealth University)
Mingyu Yang (Yale University)

Background

In situ spatial transcriptomic (ST) technologies are a breakthrough class of methods that have redefined cancer biology research by providing insight into tumor microenvironment structure, organization, and heterogeneity. ST data have also allowed researchers to understand cell-cell signaling, such as in the context of the tumor microenvironment or the formation of developmental gradients.

What's the problem?

A growing number of spatial databases are being generated and are publicly available, yet navigating them can be challenging. For example, the Human Tumor Atlas Network (HTAN) is a valuable resource for biomedical researchers; however, making the data more easily accessible will potentiate its effective use. The large-scale and complex data generated by ST requires the development of specific tools for data analysis and interpretability. Currently, such tools require the computational expertise of a trained bioinformatician and close collaboration between wet-lab and dry-lab scientists.

Our solution: HTANalyzer

HTANalyzer is a multiagent large language model specifically designed to analyze spatial transcriptomic data from the Human Tumor Atlas Network (HTAN). HTANalyzer allows users to interact with datasets on HTAN using conversational, natural language queries, making spatial transcriptomics analysis accessible to those with limited expertise in bioinformatics.

Approach

Our approach uses three agents to allow for:

Data integration with HTAN resources
Bioinformatics analysis using standard Python packages
Interactive chat interface providing human-friendly output

A reasoning engine directs the user's queries to the correct agent.

Examples of questions you can ask HTANalyzer:

Can you download the synapse data 'syn51133602' to /content/datasets/?

Show me a UMAP of the data.

Show me a spatial plot of genes SAMD11 and NOC2L and cell types.

Calculate the cell-cell interactions network of the ANGPTL pathway and show me the cell-cell interaction between ANGPTL4 and CDH11.

Installation

Platform Requirements

HTANalyzer is designed to run on Google Cloud Platform's Vertex AI. Users should have a corresponding project ID and billing info.

Dependencies

Required Python packages:

synapseclient==4.6.0
gcsfs==2024.6.1
scanpy==1.10.3
squidpy==1.6.1
pandas==2.2.2
matplotlib==3.7.1
google==2.0.3
vertexai==1.70.0
anndata-0.11.0
numpy==1.26.4

Authentication Setup

Create a Synapse account and generate a personal authentication token with full view, download, and modify permissions.
Use the token in your code:

import synapseclient
syn = synapseclient.login(authToken="your_token_here")

Required Imports

import pandas_gbq
from google.cloud import bigquery
import pandas as pd
import base64
import vertexai
from vertexai.generative_models import GenerativeModel, Part, SafetySetting
import scanpy, squidpy
import matplotlib.pyplot as plt

Vertex AI Setup

vertexai.init(project="your_project_id_here", location="us-central1")
model = GenerativeModel("gemini-1.5-flash-002")

Our workflow

To create each agent, we engineered prompts based on data and metadata structure, examples of Google BigQuery and synpase client usage, tutorials of single-cell RNA-seq and 10X Visium analysis from Python packages (e.g. SquidPy, ScanPy, and COMMOT), and iterative LLM-based optimization of prompts. The code for each agent can be found on the following Jupyter notebooks:

Super Agent: demo_app_with_super_agent.ipynb

Agent 1: agent1.ipynb

Agent 2: agent2.ipynb

Agent 3: agent3.ipynb

Issues we encountered

While using Google CoLab notebooks, we occasionally ran into an error regarding the CuPy library, which prevented import of various python packages. We solved this by running the following line.
```
!sudo apt-get install libnvidia-compute-550
```
The output from the HTANalyzer can be stochastic. We attempted to mitigate this by reducing the temperature of the LLM to 0.3 - 0.5 and adding additional detail to the prompts.

Future Directions

Creating a recursive error-handling agent to debug code generated from Agent 1 and Agent 2.
Engineer agents to have memory of user's previous queries.
Improve user functionality by developing a "chatbot" frontend and developing a package for users to easily download HTANalyzer.
Agent 1 (Data Integration) could be improved by adding prompts related to clinical metadata available on HTAN google buckets.
Testing different LLM APIs (e.g. OpenAI, Claude)
Currently, HTANalyzer is only designed for 10X Visium analysis and we hope to expand ST analysis to additional technologies and bioinformatics tools.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
LICENSE		LICENSE
README.md		README.md
agent1.ipynb		agent1.ipynb
agent2.ipynb		agent2.ipynb
agent3.ipynb		agent3.ipynb
demo_app_with_super_agent.ipynb		demo_app_with_super_agent.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTANalyzer-LLM: Large Language Model for Analyzing HTAN Spatial Transcriptomics Data

Team members

Background

What's the problem?

Our solution: HTANalyzer

Approach

Examples of questions you can ask HTANalyzer:

Installation

Platform Requirements

Dependencies

Authentication Setup

Required Imports

Vertex AI Setup

Our workflow

Issues we encountered

Future Directions

About

Releases

Packages

Contributors 6

Languages

License

NCI-HTAN-Jamborees/HTANalyzer-LLM

Folders and files

Latest commit

History

Repository files navigation

HTANalyzer-LLM: Large Language Model for Analyzing HTAN Spatial Transcriptomics Data

Team members

Background

What's the problem?

Our solution: HTANalyzer

Approach

Examples of questions you can ask HTANalyzer:

Installation

Platform Requirements

Dependencies

Authentication Setup

Required Imports

Vertex AI Setup

Our workflow

Issues we encountered

Future Directions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages