Basic setup with random user agents and proxy addresses for Python Scrapy Framework.
pip install Scrapy
2. Install Beautiful Soup 4 for proxy middleware based on proxydocker lists
pip install beautifulsoup4
apt-get install tor python-stem privoxy
Hash a password with Tor:
tor --hash-password secretPassword
Then copy a hashed password and paste it with control port to /etc/tor/torrc
:
ControlPort 9051
HashedControlPassword 16:72C8ADB0E34F8DA1606BB154886604F708236C0D0A54557A07B00CAB73
Restart Tor:
sudo /etc/init.d/tor restart
Enable Privoxy forwarding by adding next line to /etc/privoxy/config
:
forward-socks5 / localhost:9050 .
Restart Privoxy:
sudo /etc/init.d/privoxy restart
Both Tor and Privoxy should be up & running (check netstat -l
). If you used different password or control port, update settings.py
.
If you get some errors regarding the pyOpenSSL (check this issue), try to downgrade the Twisted engine:
pip install Twisted==16.4.1
To see what it does just:
python run.py
Project contains three middleware classes in middlewares
directory. ProxyMiddleware
downloads IP proxy addresses and before every process request chooses one randomly. TorMiddleware
has a similar purpose, but it relies on Tor network. RandomUserAgentMiddleware
downloads user agent strings and saves them into 'USER_AGENT_LIST'
settings list. It also selects one UA randomly before every process request. Middlewares are activated in settings.py
file.
This project also contains two spiders just for testing purposes, spiders/iptester.py
and spiders/uatester.py
. You can run them individually:
scrapy crawl UAtester
scrapy crawl IPtester
run.py
file is a also good example how to include and run your spiders sequentially from one script.
If you have any questions or problems, feel free to create a new issue. Scrap responsibly!