Feeding URLs from a source #1231

mgoksu · 2022-11-16T12:51:55Z

mgoksu
Nov 16, 2022

Hi,

So basically, I want to have a some kind of URL feeder (might be a part of the master node) that distributes a list of URLs among the worker nodes/spiders. It may add/remove URLs from its list and create new tasks to have them crawled again. I read through the repo and the documentation but couldn't find a way make this work.

As far as I understood, the master node passes CRAWLAB_TASK_ID to the worker nodes as well as the command but no URL information. Spiders have the URL information in their definition and they are not getting it from the master node.

Am I missing something here or is it not possible to do this using crawlab?

Thanks

tikazyq · 2022-11-18T01:35:48Z

tikazyq
Nov 18, 2022
Maintainer

Thanks! That's a great idea!

I think the architecture of scrapy-redis is exactly the same. If you want something similar you can just try it on Crawlab, which is simply running a couple of scrapy-redis spiders on different nodes, and you can just feed in the URLs to redis.

The current version of Crawlab is mainly focused on coordinating tasks rather than implementing distributed web crawlers. But I do believe there could be something that can be leveraged in terms of usability, for example, to visualize feeding URLs to scrapy-redis on the web UI. Also, there could be some spider frameworks specifically designed for Crawlab.

Please let me know you thoughts and we may come up with something smart.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feeding URLs from a source #1231

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Feeding URLs from a source #1231

mgoksu Nov 16, 2022

Replies: 1 comment

tikazyq Nov 18, 2022 Maintainer

mgoksu
Nov 16, 2022

tikazyq
Nov 18, 2022
Maintainer