This repo contains scripts used to crawl websites and store documents to be indexed for RAG solutions.
How to run these scripts locally:
-
Clone this repository
-
Navigate to the project directory:
cd <your-project>
-
Create the Enviroment, Activate it
Python Virtual Environment
python -m venv assetEnv source assetEnv/bin/activate
-
Run the scripts:
python <script>.py
The order to run the scripts:
- run
get-urls-from-sitemap.py
to get urls from sitemap - run
get-internal-links.py
to get internal urls not in sitemap - run
get-all-link-data.py
to get the data from the links
Use multithreading for faster crawling and answer saving. Some of the requests may be blocked by the host website or it is unable to handle the load at that time. This approach is to get the missing urls and rerun the script with multithreading, little bit manual but still much faster.
- run
get-multithreaded-link-data.py
to get data using multi threaded approach - run
get-all-link-data-multithreaded-info-into-db.py
to put data into database - run
get-documentation-page-data.py
parse html page data, including tables
Helper scripts:
pdf-links.py
: get all pdfs from urls linksget-article.py
: navigate to article within html pageconvert-jsonl-to-json.py
: converts newline JSON to JSONfind-missing-links.py
: Verify that links from .csv are in fact missing