Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3.2-Vision Implementation #1160

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
9 changes: 9 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,15 @@ AZURE_GPT4O_MINI_API_KEY=""
AZURE_GPT4O_MINI_API_BASE=""
AZURE_GPT4O_MINI_API_VERSION=""

# ENABLE_LLAMA: Set to true to enable Llama as a language model provider
ENABLE_LLAMA=false
# LLAMA_API_BASE: The base URL for Llama API (default: http://localhost:11434)
LLAMA_API_BASE=""
# LLAMA_MODEL_NAME: The model name to use (e.g., llama3.2-vision)
LLAMA_MODEL_NAME=""
# LLAMA_API_ROUTE: The API route for Llama (default: /api/chat)
LLAMA_API_ROUTE=""

# LLM_KEY: The chosen language model to use. This should be one of the models
# provided by the enabled LLM providers (e.g., OPENAI_GPT4_TURBO, OPENAI_GPT4V, ANTHROPIC_CLAUDE3, AZURE_OPENAI_GPT4V).
LLM_KEY=""
Expand Down
12 changes: 9 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,21 @@ RUN playwright install-deps
RUN playwright install
RUN apt-get install -y xauth x11-apps netpbm && apt-get clean

# Add these lines to install dos2unix and convert entrypoint scripts
RUN apt-get update && \
apt-get install -y dos2unix && \
apt-get clean

COPY . /app

# Convert line endings
RUN dos2unix /app/entrypoint-skyvern.sh && \
chmod +x /app/entrypoint-skyvern.sh

ENV PYTHONPATH="/app:$PYTHONPATH"
ENV VIDEO_PATH=/data/videos
ENV HAR_PATH=/data/har
ENV LOG_PATH=/data/log
ENV ARTIFACT_STORAGE_PATH=/data/artifacts

COPY ./entrypoint-skyvern.sh /app/entrypoint-skyvern.sh
RUN chmod +x /app/entrypoint-skyvern.sh

CMD [ "/bin/bash", "/app/entrypoint-skyvern.sh" ]
35 changes: 21 additions & 14 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,12 @@ services:
retries: 5

skyvern:
image: public.ecr.aws/skyvern/skyvern:latest
# Replace the public image with a local build
build:
context: .
dockerfile: Dockerfile
# Keep the rest of the configuration
restart: on-failure
# comment out if you want to externally call skyvern API
ports:
- 8000:8000
volumes:
Expand All @@ -35,18 +38,20 @@ services:
environment:
- DATABASE_STRING=postgresql+psycopg://skyvern:skyvern@postgres:5432/skyvern
- BROWSER_TYPE=chromium-headful
- ENABLE_OPENAI=true
- OPENAI_API_KEY=<your_openai_key>
# If you want to use other LLM provider, like azure and anthropic:
# - ENABLE_ANTHROPIC=true
# - LLM_KEY=ANTHROPIC_CLAUDE3_OPUS
# - ANTHROPIC_API_KEY=<your_anthropic_key>
# - ENABLE_AZURE=true
# - LLM_KEY=AZURE_OPENAI
# - AZURE_DEPLOYMENT=<your_azure_deployment>
# - AZURE_API_KEY=<your_azure_api_key>
# - AZURE_API_BASE=<your_azure_api_base>
# - AZURE_API_VERSION=<your_azure_api_version>
- ENABLE_LLAMA=true
- LLM_KEY=LLAMA3
- LLAMA_API_BASE=http://192.168.1.65:11434
- LLAMA_MODEL_NAME=llama3.2-vision
- LLAMA_API_ROUTE=/api/chat
- ENABLE_OPENAI=false
- ENABLE_ANTHROPIC=false
- ENABLE_AZURE=false
- ENABLE_BEDROCK=false
- ENABLE_AZURE_GPT4O_MINI=false
- LLAMA_BASE_URL=http://192.168.1.65:11434
- LLAMA_MODEL=llama3.2-vision
- ENV=local
- SECONDARY_LLM_KEY=LLAMA3
depends_on:
postgres:
condition: service_healthy
Expand All @@ -55,6 +60,8 @@ services:
interval: 5s
timeout: 5s
retries: 5
extra_hosts:
- "host.docker.internal:host-gateway"

skyvern-ui:
image: public.ecr.aws/skyvern/skyvern-ui:latest
Expand Down
40 changes: 25 additions & 15 deletions setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ log_event() {

# Function to check if a command exists
command_exists() {
command -v "$1" &> /dev/null
command -v "$1" &>/dev/null
}

ensure_required_commands() {
Expand All @@ -31,7 +31,7 @@ update_or_add_env_var() {
sed -i.bak "s/^$key=.*/$key=$value/" .env && rm -f .env.bak
else
# Add new variable
echo "$key=$value" >> .env
echo "$key=$value" >>.env
fi
}

Expand Down Expand Up @@ -98,24 +98,32 @@ setup_llm_providers() {
update_or_add_env_var "ENABLE_AZURE" "false"
fi

echo "Do you want to enable Llama (y/n)?"
read enable_llama
if [[ "$enable_llama" == "y" ]]; then
read -p "Enter path to Llama model: " llama_model_path
update_or_add_env_var "ENABLE_LLAMA" "true"
update_or_add_env_var "LLAMA_MODEL_PATH" "$llama_model_path"
model_options+=("LLAMA_3_2_VISION")
fi

# Model Selection
if [ ${#model_options[@]} -eq 0 ]; then
echo "No LLM providers enabled. You won't be able to run Skyvern unless you enable at least one provider. You can re-run this script to enable providers or manually update the .env file."
else
echo "Available LLM models based on your selections:"
for i in "${!model_options[@]}"; do
echo "$((i+1)). ${model_options[$i]}"
echo "$((i + 1)). ${model_options[$i]}"
done
read -p "Choose a model by number (e.g., 1 for ${model_options[0]}): " model_choice
chosen_model=${model_options[$((model_choice-1))]}
chosen_model=${model_options[$((model_choice - 1))]}
echo "Chosen LLM Model: $chosen_model"
update_or_add_env_var "LLM_KEY" "$chosen_model"
fi

echo "LLM provider configurations updated in .env."
}


# Function to initialize .env file
initialize_env_file() {
if [ -f ".env" ]; then
Expand Down Expand Up @@ -165,14 +173,16 @@ remove_poetry_env() {

# Choose python version
choose_python_version_or_fail() {
# https://github.com/python-poetry/poetry/issues/2117
# Py --list-paths
# https://github.com/python-poetry/poetry/issues/2117
# Py --list-paths
# This will output which paths are being used for Python 3.11
# Windows users need to poetry env use {{ Py --list-paths with 3.11}}
poetry env use python3.11 || { echo "Error: Python 3.11 is not installed. If you're on Windows, check out https://github.com/python-poetry/poetry/issues/2117 to unblock yourself"; exit 1; }
# Windows users need to poetry env use {{ Py --list-paths with 3.11}}
poetry env use python3.11 || {
echo "Error: Python 3.11 is not installed. If you're on Windows, check out https://github.com/python-poetry/poetry/issues/2117 to unblock yourself"
exit 1
}
}


# Function to install dependencies
install_dependencies() {
poetry install
Expand Down Expand Up @@ -211,25 +221,25 @@ setup_postgresql() {
return 0
fi
fi

# Check if Docker is installed and running
if ! command_exists docker || ! docker info > /dev/null 2>&1; then
if ! command_exists docker || ! docker info >/dev/null 2>&1; then
echo "Docker is not running or not installed. Please install or start Docker and try again."
exit 1
fi

# Check if PostgreSQL is already running in a Docker container
if docker ps | grep -q postgresql-container; then
echo "PostgreSQL is already running in a Docker container."
else
else
# Attempt to install and start PostgreSQL using Docker
echo "Attempting to install PostgreSQL via Docker..."
docker run --name postgresql-container -e POSTGRES_HOST_AUTH_METHOD=trust -d -p 5432:5432 postgres:14
echo "PostgreSQL has been installed and started using Docker."

# Wait for PostgreSQL to start
echo "Waiting for PostgreSQL to start..."
sleep 20 # Adjust sleep time as necessary
sleep 20 # Adjust sleep time as necessary
fi

# Assuming docker exec works directly since we've checked Docker's status before
Expand Down Expand Up @@ -272,7 +282,7 @@ create_organization() {
fi

# Update the secrets-open-source.toml file
echo -e "[skyvern]\nconfigs = [\n {\"env\" = \"local\", \"host\" = \"http://127.0.0.1:8000/api/v1\", \"orgs\" = [{name=\"Skyvern\", cred=\"$api_token\"}]}\n]" > .streamlit/secrets.toml
echo -e "[skyvern]\nconfigs = [\n {\"env\" = \"local\", \"host\" = \"http://127.0.0.1:8000/api/v1\", \"orgs\" = [{name=\"Skyvern\", cred=\"$api_token\"}]}\n]" >.streamlit/secrets.toml
echo ".streamlit/secrets.toml file updated with organization details."

# Check if skyvern-frontend/.env exists and back it up
Expand Down
11 changes: 11 additions & 0 deletions skyvern/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
from ddtrace.filters import FilterRequestsOnUrl

from skyvern.forge.sdk.forge_log import setup_logger
from typing import Any, List
from skyvern.forge.sdk.models import Step

tracer.configure(
settings={
Expand All @@ -11,3 +13,12 @@
},
)
setup_logger()

async def llama_handler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new llama_handler function appears to duplicate existing functionality in skyvern/forge/sdk/api/llm/llama_handler.py. Consider reusing or extending the existing function instead of adding a new one.

prompt: str,
step: Step | None = None,
screenshots: list[bytes] | None = None,
parameters: dict[str, Any] | None = None,
) -> dict[str, Any]:
# Implement Llama 3.2 vision API integration here
...
36 changes: 33 additions & 3 deletions skyvern/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,26 @@


class Settings(BaseSettings):
model_config = SettingsConfigDict(env_file=(".env", ".env.staging", ".env.prod"), extra="ignore")
# Use only model_config, not Config class
model_config = SettingsConfigDict(
env_file=".env",
env_file_encoding="utf-8",
extra="ignore"
)

# Llama Configuration
ENABLE_LLAMA: bool = True
LLAMA_API_BASE: str = "http://192.168.1.65:11434"
LLAMA_MODEL_NAME: str = "llama3.2-vision"
LLAMA_API_ROUTE: str = "/api/chat"
LLM_KEY: str = "LLAMA3"
SECONDARY_LLM_KEY: str = "LLAMA3"

# Disable other providers
ENABLE_OPENAI: bool = False
ENABLE_ANTHROPIC: bool = False
ENABLE_AZURE: bool = False
ENABLE_BEDROCK: bool = False

ADDITIONAL_MODULES: list[str] = []

Expand All @@ -18,6 +37,14 @@ class Settings(BaseSettings):
BROWSER_SCREENSHOT_TIMEOUT_MS: int = 20000
BROWSER_LOADING_TIMEOUT_MS: int = 120000
OPTION_LOADING_TIMEOUT_MS: int = 600000
MAX_SCRAPING_RETRIES: int = 0
VIDEO_PATH: str | None = None
HAR_PATH: str | None = "./har"
LOG_PATH: str = "./log"
BROWSER_ACTION_TIMEOUT_MS: int = 5000
BROWSER_SCREENSHOT_TIMEOUT_MS: int = 20000
BROWSER_LOADING_TIMEOUT_MS: int = 120000
OPTION_LOADING_TIMEOUT_MS: int = 600000
MAX_STEPS_PER_RUN: int = 75
MAX_NUM_SCREENSHOTS: int = 10
# Ratio should be between 0 and 1.
Expand Down Expand Up @@ -91,8 +118,8 @@ class Settings(BaseSettings):
# LLM Configuration #
#####################
# ACTIVE LLM PROVIDER
LLM_KEY: str = "OPENAI_GPT4O"
SECONDARY_LLM_KEY: str | None = None
LLM_KEY: str = "LLAMA3" # Change default from OPENAI_GPT4O
SECONDARY_LLM_KEY: str = "LLAMA3" # Also set this to LLAMA3
# COMMON
LLM_CONFIG_TIMEOUT: int = 300
LLM_CONFIG_MAX_TOKENS: int = 4096
Expand Down Expand Up @@ -126,6 +153,9 @@ class Settings(BaseSettings):

SVG_MAX_LENGTH: int = 100000

# Add debug property
DEBUG: bool = True

def is_cloud_environment(self) -> bool:
"""
:return: True if env is not local, else False
Expand Down
3 changes: 2 additions & 1 deletion skyvern/forge/prompts.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from skyvern.forge.sdk.prompting import PromptEngine

# Initialize the prompt engine
prompt_engine = PromptEngine("skyvern")
prompt_engine = PromptEngine("ollama")
prompt_engine_llama = PromptEngine("ollama")
45 changes: 45 additions & 0 deletions skyvern/forge/prompts/ollama/answer-user-detail-questions.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
You are a JSON API endpoint that answers questions based on user details and goals. API endpoints ONLY return data - no explanations allowed.

Purpose:
- Answer user questions based on provided information
- Use exact information from user details
- Keep answers direct and concise
- Fill in answers as JSON key-value pairs

Input data:
User's goal: {{ navigation_goal }}
User's details: {{ navigation_payload }}
User's questions: {{ queries_and_answers }}

Instructions for answering:
1. Read each question carefully
2. Find relevant information in user's goal and details
3. Provide only the exact information needed
4. Include answers in the JSON response
5. Keep answers direct - no explanations
6. Use precise values from provided details

CRITICAL FORMATTING RULES:
1. Start response with { and end with }
2. NO text before or after JSON
3. NO markdown formatting or code blocks
4. NO explanations, notes, or comments
5. NO additional formatting or whitespace
6. Response must be pure JSON only

Response format (replace with actual answers):
{
"question_1": "",
"question_2": "",
"question_3": ""
}

AUTOMATIC FAILURE TRIGGERS:
- Text before the opening {
- Text after the closing }
- Explanations or markdown
- Notes or comments
- Code blocks or ```
- Any content outside JSON structure

These answers will be used to fill out information on a webpage automatically. Invalid format will cause system errors.
Loading