Webcrawling scripts using Beautiful Soup

This repo contains scripts used to crawl websites and store documents to be indexed for RAG solutions.

How to run these scripts locally:

Clone this repository
Navigate to the project directory:
```
cd <your-project>
```
Create the Enviroment, Activate it

Python Virtual Environment
```
python -m venv assetEnv
source assetEnv/bin/activate
```
Run the scripts:
```
python <script>.py
```

The order to run the scripts:

run get-urls-from-sitemap.py to get urls from sitemap
run get-internal-links.py to get internal urls not in sitemap
run get-all-link-data.py to get the data from the links

Use multithreading for faster crawling and answer saving. Some of the requests may be blocked by the host website or it is unable to handle the load at that time. This approach is to get the missing urls and rerun the script with multithreading, little bit manual but still much faster.

run get-multithreaded-link-data.py to get data using multi threaded approach
run get-all-link-data-multithreaded-info-into-db.py to put data into database
run get-documentation-page-data.py parse html page data, including tables

Helper scripts:

pdf-links.py: get all pdfs from urls links
get-article.py: navigate to article within html page
convert-jsonl-to-json.py: converts newline JSON to JSON
find-missing-links.py: Verify that links from .csv are in fact missing

Additional scripts to do webcrawling:

app_webcrawl.py

app_webcrawl_complex.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webcrawling scripts using Beautiful Soup

Additional scripts to do webcrawling:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
app_webcrawl.py		app_webcrawl.py
app_webcrawl_complex.py		app_webcrawl_complex.py
convert-jsonl-to-json.py		convert-jsonl-to-json.py
find-missing-links.py		find-missing-links.py
get-all-link-data-multithreaded-info-into-db.py		get-all-link-data-multithreaded-info-into-db.py
get-all-link-data.py		get-all-link-data.py
get-article.py		get-article.py
get-documentation-page-data.py		get-documentation-page-data.py
get-full-knowledge-base.py		get-full-knowledge-base.py
get-internal-links.py		get-internal-links.py
get-multithreaded-link-data.py		get-multithreaded-link-data.py
get-urls-from-sitemap.py		get-urls-from-sitemap.py
pdf-links.py		pdf-links.py

License

ibm-build-lab/rag-webcrawling-scripts

Folders and files

Latest commit

History

Repository files navigation

Webcrawling scripts using Beautiful Soup

Additional scripts to do webcrawling:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages