2024 Human Tumor Atlas Network Data Jamboree
November 6, 2024 - November 8, 2024
Team Leader: Arun Das (University of Pittsburgh)
Tech Lead: Krithika Bhuvaneshwar (Advanced Biomedical Computational Science (ABCS), Frederick National Lab)
Writers:
- Jeanna Arbesfeld-Qiu (Harvard Medical School)
- Sanna Madan (National Cancer Institute)
- Noam Gaiger (Yale University)
Members:
- Ashish Mahabal (Caltech)
- Khoa Huynh (Virginia Commonwealth University)
- Mingyu Yang (Yale University)
In situ spatial transcriptomic (ST) technologies are a breakthrough class of methods that have redefined cancer biology research by providing insight into tumor microenvironment structure, organization, and heterogeneity. ST data have also allowed researchers to understand cell-cell signaling, such as in the context of the tumor microenvironment or the formation of developmental gradients.
A growing number of spatial databases are being generated and are publicly available, yet navigating them can be challenging. For example, the Human Tumor Atlas Network (HTAN) is a valuable resource for biomedical researchers; however, making the data more easily accessible will potentiate its effective use. The large-scale and complex data generated by ST requires the development of specific tools for data analysis and interpretability. Currently, such tools require the computational expertise of a trained bioinformatician and close collaboration between wet-lab and dry-lab scientists.
HTANalyzer is a multiagent large language model specifically designed to analyze spatial transcriptomic data from the Human Tumor Atlas Network (HTAN). HTANalyzer allows users to interact with datasets on HTAN using conversational, natural language queries, making spatial transcriptomics analysis accessible to those with limited expertise in bioinformatics.
Our approach uses three agents to allow for:
- Data integration with HTAN resources
- Bioinformatics analysis using standard Python packages
- Interactive chat interface providing human-friendly output
A reasoning engine directs the user's queries to the correct agent.
Can you download the synapse data 'syn51133602' to /content/datasets/?
Show me a UMAP of the data.
Show me a spatial plot of genes SAMD11 and NOC2L and cell types.
Calculate the cell-cell interactions network of the ANGPTL pathway and show me the cell-cell interaction between ANGPTL4 and CDH11.
HTANalyzer is designed to run on Google Cloud Platform's Vertex AI. Users should have a corresponding project ID and billing info.
Required Python packages:
synapseclient==4.6.0
gcsfs==2024.6.1
scanpy==1.10.3
squidpy==1.6.1
pandas==2.2.2
matplotlib==3.7.1
google==2.0.3
vertexai==1.70.0
anndata-0.11.0
numpy==1.26.4
- Create a Synapse account and generate a personal authentication token with full view, download, and modify permissions.
- Use the token in your code:
import synapseclient
syn = synapseclient.login(authToken="your_token_here")
import pandas_gbq
from google.cloud import bigquery
import pandas as pd
import base64
import vertexai
from vertexai.generative_models import GenerativeModel, Part, SafetySetting
import scanpy, squidpy
import matplotlib.pyplot as plt
vertexai.init(project="your_project_id_here", location="us-central1")
model = GenerativeModel("gemini-1.5-flash-002")
To create each agent, we engineered prompts based on data and metadata structure, examples of Google BigQuery and synpase client usage, tutorials of single-cell RNA-seq and 10X Visium analysis from Python packages (e.g. SquidPy, ScanPy, and COMMOT), and iterative LLM-based optimization of prompts. The code for each agent can be found on the following Jupyter notebooks:
Super Agent: demo_app_with_super_agent.ipynb
Agent 1: agent1.ipynb
Agent 2: agent2.ipynb
Agent 3: agent3.ipynb
- While using Google CoLab notebooks, we occasionally ran into an error regarding the CuPy library, which prevented import of various python packages. We solved this by running the following line.
!sudo apt-get install libnvidia-compute-550
- The output from the HTANalyzer can be stochastic. We attempted to mitigate this by reducing the temperature of the LLM to 0.3 - 0.5 and adding additional detail to the prompts.
- Creating a recursive error-handling agent to debug code generated from Agent 1 and Agent 2.
- Engineer agents to have memory of user's previous queries.
- Improve user functionality by developing a "chatbot" frontend and developing a package for users to easily download HTANalyzer.
- Agent 1 (Data Integration) could be improved by adding prompts related to clinical metadata available on HTAN google buckets.
- Testing different LLM APIs (e.g. OpenAI, Claude)
- Currently, HTANalyzer is only designed for 10X Visium analysis and we hope to expand ST analysis to additional technologies and bioinformatics tools.