Skip to content

HTANalyzer-LLM: Large Language Model for Analyzing HTAN Spatial Transcriptomics Data

License

Notifications You must be signed in to change notification settings

NCI-HTAN-Jamborees/HTANalyzer-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

HTANalyzer-LLM: Large Language Model for Analyzing HTAN Spatial Transcriptomics Data

2024 Human Tumor Atlas Network Data Jamboree

November 6, 2024 - November 8, 2024

Team members

Team Leader: Arun Das (University of Pittsburgh)

Tech Lead: Krithika Bhuvaneshwar (Advanced Biomedical Computational Science (ABCS), Frederick National Lab)

Writers:

  • Jeanna Arbesfeld-Qiu (Harvard Medical School)
  • Sanna Madan (National Cancer Institute)
  • Noam Gaiger (Yale University)

Members:

  • Ashish Mahabal (Caltech)
  • Khoa Huynh (Virginia Commonwealth University)
  • Mingyu Yang (Yale University)

Background

In situ spatial transcriptomic (ST) technologies are a breakthrough class of methods that have redefined cancer biology research by providing insight into tumor microenvironment structure, organization, and heterogeneity. ST data have also allowed researchers to understand cell-cell signaling, such as in the context of the tumor microenvironment or the formation of developmental gradients.

What's the problem?

A growing number of spatial databases are being generated and are publicly available, yet navigating them can be challenging. For example, the Human Tumor Atlas Network (HTAN) is a valuable resource for biomedical researchers; however, making the data more easily accessible will potentiate its effective use. The large-scale and complex data generated by ST requires the development of specific tools for data analysis and interpretability. Currently, such tools require the computational expertise of a trained bioinformatician and close collaboration between wet-lab and dry-lab scientists.

Our solution: HTANalyzer

HTANalyzer is a multiagent large language model specifically designed to analyze spatial transcriptomic data from the Human Tumor Atlas Network (HTAN). HTANalyzer allows users to interact with datasets on HTAN using conversational, natural language queries, making spatial transcriptomics analysis accessible to those with limited expertise in bioinformatics.

Approach

Our approach uses three agents to allow for:

  • Data integration with HTAN resources
  • Bioinformatics analysis using standard Python packages
  • Interactive chat interface providing human-friendly output

A reasoning engine directs the user's queries to the correct agent.

image image

Examples of questions you can ask HTANalyzer:

Can you download the synapse data 'syn51133602' to /content/datasets/?

image

Show me a UMAP of the data.

image

Show me a spatial plot of genes SAMD11 and NOC2L and cell types.

image

Calculate the cell-cell interactions network of the ANGPTL pathway and show me the cell-cell interaction between ANGPTL4 and CDH11.

image

Installation

Platform Requirements

HTANalyzer is designed to run on Google Cloud Platform's Vertex AI. Users should have a corresponding project ID and billing info.

Dependencies

Required Python packages:

synapseclient==4.6.0
gcsfs==2024.6.1
scanpy==1.10.3
squidpy==1.6.1
pandas==2.2.2
matplotlib==3.7.1
google==2.0.3
vertexai==1.70.0
anndata-0.11.0
numpy==1.26.4

Authentication Setup

  1. Create a Synapse account and generate a personal authentication token with full view, download, and modify permissions.
  2. Use the token in your code:
import synapseclient
syn = synapseclient.login(authToken="your_token_here")

Required Imports

import pandas_gbq
from google.cloud import bigquery
import pandas as pd
import base64
import vertexai
from vertexai.generative_models import GenerativeModel, Part, SafetySetting
import scanpy, squidpy
import matplotlib.pyplot as plt

Vertex AI Setup

vertexai.init(project="your_project_id_here", location="us-central1")
model = GenerativeModel("gemini-1.5-flash-002")

Our workflow

To create each agent, we engineered prompts based on data and metadata structure, examples of Google BigQuery and synpase client usage, tutorials of single-cell RNA-seq and 10X Visium analysis from Python packages (e.g. SquidPy, ScanPy, and COMMOT), and iterative LLM-based optimization of prompts. The code for each agent can be found on the following Jupyter notebooks:

Super Agent: demo_app_with_super_agent.ipynb

Agent 1: agent1.ipynb

Agent 2: agent2.ipynb

Agent 3: agent3.ipynb

Issues we encountered

  • While using Google CoLab notebooks, we occasionally ran into an error regarding the CuPy library, which prevented import of various python packages. We solved this by running the following line.
    !sudo apt-get install libnvidia-compute-550
    
  • The output from the HTANalyzer can be stochastic. We attempted to mitigate this by reducing the temperature of the LLM to 0.3 - 0.5 and adding additional detail to the prompts.

Future Directions

  • Creating a recursive error-handling agent to debug code generated from Agent 1 and Agent 2.
  • Engineer agents to have memory of user's previous queries.
  • Improve user functionality by developing a "chatbot" frontend and developing a package for users to easily download HTANalyzer.
  • Agent 1 (Data Integration) could be improved by adding prompts related to clinical metadata available on HTAN google buckets.
  • Testing different LLM APIs (e.g. OpenAI, Claude)
  • Currently, HTANalyzer is only designed for 10X Visium analysis and we hope to expand ST analysis to additional technologies and bioinformatics tools.

About

HTANalyzer-LLM: Large Language Model for Analyzing HTAN Spatial Transcriptomics Data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published