As part of Google for Developers' mission to build for the community, the comprehensive workflow of the web application presented at the AI booth during Google I/O Extended Hanoi 2024 is shared. This application, named Bông
, is a real-time VLM web app featuring both voice input and output capabilities.
Speak, See, and Interact with Bông |
- Real-time VLM Web App: Supports both voice input and output for interactive experiences.
- Multimodal Model Integration: Utilizes Gemini 1.5 Flash for handling diverse inputs including audio, images, videos, and text.
- Google Ecosystem Utilization: Employs Google's API and WaveNet TTS for enhanced communication capabilities in Vietnamese.
- RAG Workflow: Incorporates Retrieval-Augmented Generation to keep the app updated with event information and GDG Hanoi news.
- Natural and Humorous Responses: Designed to engage attendees with real-time, context-aware interactions.
- Gemini 1.5 Flash: A lightweight model optimized for speed and efficiency at scale, supporting up to 1M context lengths.
- Multimodal Input: Accepts inputs from webcam videos, microphone speech recognition, and other media types.
- Google Cloud's WaveNet TTS: Enhances the app's ability to communicate naturally in Vietnamese.
- Embedding Extraction: Uses Google Text Embedding API to extract embeddings from text information on URLs.
- Chain Construction with LangChain: Constructs a system prompt incorporating conversational history for memory caching.
- Real-time Response: The web application responds in real-time despite noisy environments and multiple individuals in the frame.
-
Clone the repository:
git clone https://github.com/tuanlda78202/geminio.git
-
Navigate to the project directory:
cd geminio
-
Install dependencies:
npm install
-
Download
.google-cloud-credentials
from Google Cloud and set upVITE_GEMINI_KEY
,GOOGLE_APPLICATION_CREDENTIALS
in.env
. -
Run the application:
npm run dev
- Open port
3001
for Google Cloud TTS. - Open your browser and navigate to
http://localhost:3000
. - Allow access to your microphone and webcam.
- Interact with the application using voice commands and visual inputs.
We welcome contributions from the community. Please follow these steps to contribute:
-
Fork the repository.
-
Create a new branch for your feature or bug fix:
git checkout -b feature-name
-
Commit your changes:
git commit -m "Description of feature or fix"
-
Push to the branch:
git push origin feature-name
-
Create a pull request on GitHub.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or feedback, please contact me.