Skip to content

Public proxy farm that automatically records and queues suitable proxy servers for web crawling

License

Notifications You must be signed in to change notification settings

dchrostowski/autoproxy

Repository files navigation

autoproxy

About

This is a rewrite of my public proxy farm. It uses redis to record and store reliability statistics for publicly available proxy servers.

After recording sufficient data, it is able to create a database of proxy servers and choose the most reliable proxy to use for crawling a given website.

Pre-requisites

  • docker
  • docker-compose

Overview

When web crawling, proxies are essential for maintaining anonymity and circumventing bot detection. There are a number of free public proxy servers spread across the Internet, however their performance is inconsistent. This project utilizes a redis store to temporarily store and cache proxy server information for use by a scrapy middleware. This cache is then periodically synced to a Postgres database intended to be a more permanent and practical storage medium for proxy statistics.

Example Usage

I'm still working on this, but here's how to run it:

git clone https://github.com/dchrostowski/autoproxy.git
cd autoproxy
docker-compose build scrapyd
docker-compose build spider_scheduler
docker-compose up scrapyd spider_scheduler

Getting proxies

There are a few spiders (see autoproxy/autoproxy/spiders) that are scheduled to crawl a few sites to constantly pull in more proxies and then test those proxies against the sites they've scraped.

To access the Postgres database, you can run the following:

docker exec -it autoproxy_db psql -U postgres proxies

The default password is somepassword

Future plans

I'm planning on publishing the autoproxy_package/ contents as a module/package eventually.

About

Public proxy farm that automatically records and queues suitable proxy servers for web crawling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published