Skip to content

ljanyst/scrapy-do

Repository files navigation

Scrapy Do

https://api.travis-ci.org/ljanyst/scrapy-do.svg?branch=master https://coveralls.io/repos/github/ljanyst/scrapy-do/badge.svg?branch=master PyPI Version

Scrapy Do is a daemon that provides a convenient way to run Scrapy spiders. It can either do it once - immediately; or it can run them periodically, at specified time intervals. It's been inspired by scrapyd but written from scratch. It comes with a REST API, a command line client, and an interactive web interface.

Quick Start

  • Install scrapy-do using pip:

    $ pip install scrapy-do
  • Start the daemon in the foreground:

    $ scrapy-do -n scrapy-do
  • Open another terminal window, download the Scrapy's Quotesbot example, and push the code to the server:

    $ git clone https://github.com/scrapy/quotesbot.git
    $ cd quotesbot
    $ scrapy-do-cl push-project
    +----------------+
    | quotesbot      |
    |----------------|
    | toscrape-css   |
    | toscrape-xpath |
    +----------------+
  • Schedule some jobs:

    $ scrapy-do-cl schedule-job --project quotesbot \
        --spider toscrape-css --when 'every 5 to 15 minutes'
    +--------------------------------------+
    | identifier                           |
    |--------------------------------------|
    | 0a3db618-d8e1-48dc-a557-4e8d705d599c |
    +--------------------------------------+
    
    $ scrapy-do-cl schedule-job --project quotesbot --spider toscrape-css
    +--------------------------------------+
    | identifier                           |
    |--------------------------------------|
    | b3a61347-92ef-4095-bb68-0702270a52b8 |
    +--------------------------------------+
  • See what's going on:

    Active Jobs

    The web interface is available at http://localhost:7654 by default.

Building from source

Both of the steps below require nodejs to be installed.

  • Check if things work fine:

    $ pip install -rrequirements-dev.txt
    $ tox
  • Build the wheel:

    $ python setup.py bdist_wheel

ChangeLog

Version 0.5.0

  • Rewrite the log handling functionality to resolve duplication issues
  • Bump the JavaScript dependencies to resolve browser caching issues
  • Make the error message on failed spider listing more descriptive (Bug #28)
  • Make sure that the spider descriptions and payloads get handled properly on restart (Bug #24)
  • Clarify the documentation on passing arguments to spiders (Bugs #23 and #27)

Version 0.4.0

  • Migration to the Bootstrap 4 UI
  • Make it possible to add a short description to jobs
  • Make it possible to specify user-defined payload in each job that is passed on as a parameter to the python crawler
  • UI updates to support the above
  • New log viewers in the web UI