Replies: 1 comment
-
Thanks! That's a great idea! I think the architecture of scrapy-redis is exactly the same. If you want something similar you can just try it on Crawlab, which is simply running a couple of scrapy-redis spiders on different nodes, and you can just feed in the URLs to redis. The current version of Crawlab is mainly focused on coordinating tasks rather than implementing distributed web crawlers. But I do believe there could be something that can be leveraged in terms of usability, for example, to visualize feeding URLs to scrapy-redis on the web UI. Also, there could be some spider frameworks specifically designed for Crawlab. Please let me know you thoughts and we may come up with something smart. |
Beta Was this translation helpful? Give feedback.
-
Hi,
So basically, I want to have a some kind of URL feeder (might be a part of the master node) that distributes a list of URLs among the worker nodes/spiders. It may add/remove URLs from its list and create new tasks to have them crawled again. I read through the repo and the documentation but couldn't find a way make this work.
As far as I understood, the master node passes CRAWLAB_TASK_ID to the worker nodes as well as the command but no URL information. Spiders have the URL information in their definition and they are not getting it from the master node.
Am I missing something here or is it not possible to do this using crawlab?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions