A Django app to store scraped website data with the intention to use the data as a source to import from.
It's a work in progress and not ready for use in a production environment.
Many parts of this project are based on previous work I have done. See the credits section below.
It's highly likely that this project will change significantly over time 💥
- Initial command to obtain links to all pages to scrape:
scrapy crawl sitemap
- Collect the page content for each site map page:
scrapy crawl pages
- Run command
python manage.py build_blocks
to "build the blocks" from the scraped data (page content)
You'll need a wordpress site running from which you can scrape data. I used a local install of wordpress with default theme and sample content.
- Clone this repo
- Create a virtualenv and install requirements
poetry install then poetry shell
- Create a database and user for the project
python manage.py migrate then python manage.py createsuperuser
- Run the initial command to obtain links to all pages to scrape:
scrapy crawl sitemap from the warehouse/sitemap/spiders directory
- Collect the page content for each site map page:
scrapy crawl pages from the warehouse/pages/spiders directory
- Run command
python manage.py build_blocks
to "build the blocks" from the scraped data (page content). Run from the root directory of the project.
- Add tests
- Refine the django admin interface
- Add a JSON API to access the data from a wagtail site for import
- and more...
- Poetry for dependency management
- Scrapy for scraping
- Django for the web app
- BeautifulSoup for parsing html
- Pre-commit for code linting
- Black for code formatting
- Flake8 for code linting
- Isort for import sorting
MIT
- https://github.com/rkhleics/nhs-ei.scrapy-poc
- https://github.com/nickmoreton/wagtail-toolbox/tree/main/wagtail_toolbox/wordpress
- https://github.com/import-experiments/scrape-wordpress-html
- https://github.com/wagtail-packages/django-wordpress-import
- https://github.com/import-experiments/wordpress-docker