This project is a real-time speech-to-speech voice assistant built using OpenAI's Realtime API. The assistant listens to your voice input, sends it to OpenAI's API for processing, and plays back the response as audio output. It is designed to handle both text and audio responses, leveraging WebSockets to communicate with the OpenAI Realtime API for seamless voice interaction.
- Features
- Requirements
- Installation
- Usage
- Environment Variables
- OpenAI Realtime API Qualities
- Project Structure
- Logging and Debugging
- Voice Input: Records your voice input in real-time.
- Text and Audio Response: Processes responses in both text and audio formats.
- WebSocket Communication: Utilizes OpenAI's WebSocket API for efficient and interactive communication.
- Audio Playback: Plays the audio response received from the assistant in real-time.
- Customizable Models: Allows the use of different OpenAI models via environment variable configuration.
- Python 3.10+
- Libraries:
pyaudio
websockets
dotenv
poetry
wave
base64
json
logging
Ensure you have these dependencies installed. If not, they will be installed automatically with poetry
.
-
Clone the repository:
git clone <repository_url> cd <repository_directory>
-
Install dependencies using
poetry
:poetry lock poetry install
-
Ensure
.env
file is configured (see Environment Variables below).
-
Run the application:
python3 Speech2Speech.py
-
Interaction:
- You will be prompted to start speaking. The assistant will record your voice input, process it, and respond in audio form.
Create a .env
file in the project root with the following variables:
WS_URL="<OpenAI_WebSocket_URL>"
MODEL="<OpenAI_Model_Name>"
OPENAI_API_KEY="<Your_OpenAI_API_Key>"
- WS_URL: WebSocket endpoint for the OpenAI Realtime API.
- MODEL: Name of the OpenAI model you intend to use.
- OPENAI_API_KEY: Your OpenAI API key to authenticate requests.
The OpenAI Realtime API provides powerful capabilities for building interactive, real-time applications. Key qualities include:
- Low Latency: Optimized for low-latency interactions, allowing near-instantaneous feedback for a smooth, conversational experience.
- Streamed Responses: Delivers responses as they are generated, enabling real-time feedback and dynamic interactions in text, audio, or combined formats.
- Multi-Modal Responses: Supports a mix of text and audio modalities, allowing the assistant to respond in a format best suited to the user’s needs.
- Flexible Customization: The API allows for dynamic instruction and model behavior changes, enabling more tailored interactions and enhanced user experience.
- Efficient WebSocket Protocol: Leveraging WebSocket technology, the Realtime API maintains a continuous, efficient connection for uninterrupted interaction.
- Error Handling and Event Management: The API includes structured error and event types to facilitate robust interaction management, making it easier to handle various response types and errors in real time.
- Speech2Speech.py: Main script to run the assistant. Contains the
RealtimeClient
class which manages the WebSocket connection, audio recording, and playback. - Environment Configuration:
.env
file holds sensitive keys and URLs. - Audio Functions:
- start_audio_stream: Initializes the audio stream for recording.
- record_audio: Captures audio input for a specified duration.
- play_audio: Plays back the audio response.
The application uses Python’s built-in logging
module to capture detailed logs for debugging. Logging levels can be set in the script for more granular output:
- INFO: High-level events such as connection status and response types.
- DEBUG: Detailed information including data encoding, event processing, and buffer handling.