-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
76 lines (46 loc) · 1.76 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
**arackpy**
===========
**arackpy** is a simple but powerful web crawler and scraper. Although it is
good natured and respectful at heart, it can be used to do evil. Remember with
great power comes great responsibility.
Requirements
------------
**arackpy** currently supports Python 2.7 to 3.6+ out of the box. Depending on
how you want to extract data, several other dependencies from the list below is
required to be installed to support the various backends:
* lxml - for html parsing and url extraction
* requests - for downloading html pages
* pysocks - for making tor based connections
* fake_useragent - for browser spoofing
* selenium (coming soon!)
Installation
------------
For a vanilla **arackpy** install with no other dependencies:
pip install arackpy
For proxy and tor support:
pip install lxml, requests, fake_useragent, pysocks
Quickstart
----------
Open up your favorite python text editor and type the following:
# hello_spider.py
from __future__ import print_function # python 2 support
from arackpy.spider import Spider
class HelloSpider(Spider):
"""A simple spider in just ten lines of working code"""
start_urls = ["https://www.python.org"]
def parse(self, url, html):
"""Extract data from the raw html"""
print("Crawling url, %s" % url)
if __name__ == "__main__":
print("Press Ctrl-c to stop crawling")
spider = HelloSpider()
spider.crawl()
Run the program using:
$ python hello_spider.py
Note
Press Ctrl-c to terminate crawling. To use proxies or tor, change the backend
accordingly.
Documentation
-------------
To learn more, go read the docs at https://arackpy.readthedocs.org. The
**arackpy** logo was taken from https://clipart-library.com.