Welcome to AI-Powered Web Scraper and Chatbot, an advanced Python application designed to scrape web content and provide intelligent conversational capabilities. This project leverages Gemini Vision Pro, Selenium, custom logic, Langchain, and FAISS to extract, process, and interact with data from websites.
The AI-Powered Web Scraper and Chatbot is a versatile application that can scrape content from websites and provide interactive chat functionality. Here's how it works:
- Gemini Vision Pro: Utilizes advanced computer vision techniques to identify and extract relevant content from images and other media types on web pages.
- Selenium: Automates browser interactions to navigate through web pages, simulate user actions, and scrape dynamic content.
- Custom Logic: Employs custom algorithms and heuristics to process the scraped data, ensuring accurate and contextually relevant information extraction.
- Langchain: Integrates natural language processing and vector storage technologies to enhance the chatbot's conversational capabilities.
- FAISS: Utilized for efficient vector storage and similarity search, enabling fast and accurate retrieval of relevant information.
Once the content is scraped, the application uses natural language processing and vector storage technologies to offer chatbot functionality, allowing users to interact with the extracted data through meaningful conversations.
The application follows these steps to provide responses to your questions:
- Data Loading: The app reads data from various sources, including web content, and processes the content using Gemini Vision Pro, Selenium, and custom logic.
- Text Chunking: The extracted text is divided into smaller chunks for efficient processing.
- Language Model: The application uses Google Generative AI to generate vector representations (embeddings) of the text chunks.
- Vector Storage: FAISS is used to store and manage these vector embeddings efficiently.
- Similarity Matching: When you ask a question, the app compares it with the text chunks and identifies the most semantically similar ones using FAISS.
- Response Generation: The selected chunks are passed to the language model, which generates a response based on the relevant content.
To install the AI-Powered Web Scraper and Chatbot, please follow these steps:
-
Clone the repository to your local machine:
git clone https://github.com/mxlik-ali/Ai-WebScraper.git cd AI-Powered-Web-Scraper-Chatbot
-
Start a virtual environment. Use the appropriate command based on your operating system:
- Windows:
python -m venv venv .\venv\Scripts\activate
- Linux:
python3 -m venv venv source venv/bin/activate
- Windows:
-
Install the required dependencies by running the following command:
pip install -r requirements.txt
-
Obtain API keys from Google Generative AI and Google Gemini Vision Pro:
- Sign up at Google AI Studio and obtain your API key.
-
Create a
.env
file in the project directory and add your API keys and the URL to scrape:GOOGLE_API_KEY = "your-api-key" URL = 'https://www.scrapethissite.com/pages/ajax-javascript/#2015' #paste the url u want to scrape in URL environment variable #https://www.scrapethissite.com/pages/forms/ #https://www.scrapethissite.com/pages/advanced/
-
Create two folder
- Create a folder named image_saves and scrape
To use the AI-Powered Web Scraper and Chatbot, follow these steps:
-
Ensure that you have installed the required dependencies and added the API keys and URL to the
.env
file. -
Run the
main.py
file to scrape the website. Execute the following command:python main.py
-
After the scraping is complete, run the
app.py
file using the Streamlit CLI to start the chatbot application:streamlit run app.py
-
The application will launch in your default web browser, displaying the user interface.
-
Ask questions in natural language using the chat interface.
-
Remember Scrape only one site at a time (tho it supports multiple page scraping), to scrape another website remove sitemap.json under scrape folder and delete the folder named "faiss-index"
This repository is intended for educational purposes and does not accept further contributions. Feel free to utilize and enhance the app based on your own requirements.
The AI-Powered Web Scraper and Chatbot is released under the MIT License.