Hello Spider 🕷

基于 Go 语言的分布式网页爬虫。

爬取数据后可使用 Hello Search 进行搜索。

简介

此项目初衷在于学习 Go 语言以及 Elasticsearch 。

基本原理

将种子 URL 入队。
从队列获取待抓取 URL。
判断此 URL 是否有效。（包括检查 URL 是否已经被抓取过）
抓取网页信息，并记录此 URL 已经被抓取。
储存网页摘要，并将此网页的超链接（URL）进行入队。
执行步骤 2。

实现概要

使用 RabbitMQ 消息队列实现分布式下的 URL 队列的持久化、优先级等。
使用 RedisBloom 的布隆过滤器实现分布式下的 URL 去重。
使用 goquery 解析 HTML。
使用 Elasticsearch 储存数据。

另外可以使用 Kibana 对抓取状况进行实时的可视化分析。

Kibana 可视化

使用方法

命令行参数

$ go run . -h
Usage of hellospider:
  -config string
        File path of configuration. (default "config.json")
  -namespace string
        Namespace of task.
  -priority string
        Priority policy: 0-9 means that the priority is constant, url-len means that the priority is calculated according to the length of the URL (the shorter the priority),
path-len means that the priority is calculated according to the length of the URL path (the shorter the priority).
  -reset
        Reset queue, storage and filter before begin task.
  -seed string
        The seeds URL is comma-separated. Such as: 'https://a.com/, https://b.com/'. And the seeds in the configuration file will be ignored.

-config 配置文件路径
-namespace 命名空间（区分不同任务）
-priority 优先级策略：0-9 表示优先级为常数，url-len 表示根据 URL 长度计算优先级（越短越优先），path-len 表示根据 URL 路径长度计算优先级（越短越优先）。
-reset 开始前是否重置命名空间
-seed 替换配置文件中的 URL 种子（英文逗号分隔）

配置文件

修改配置文件 config.json 中各服务的地址、端口以及用户名及密码等。

{
  "namespace": "default",
  "workers": 8,
  "priority": "path-len",
  "seeds": [
    "https://bing.com/"
  ],
  "redis": {
    "host": "localhost:6379",
    "auth": null
  },
  "rabbitMq": {
    "url": "amqp://guest:guest@localhost:5672/",
    "exchange": "spider",
    "maxLength": 999999999
  },
  "elasticsearch": {
    "address": [
      "http://localhost:9200"
    ],
    "username": "elastic",
    "password": "123456"
  },
  "accepts": [
    "text/html",
    "text/plain"
  ],
  "rules": {
    "allows": [
      ".*"
    ],
    "forbid": null
  },
  "userAgents": [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "..."
  ],
  "responseHeaders": [
    "Content-Type",
    "Content-Length",
    "Content-Language",
    "Server",
    "X-Powered-By"
  ]
}

运行

源代码

go run .

二进制可执行文件

./hellospider

优化计划

入队前筛选有效 URL ，避免消息队列臃肿。 √
使用协程并发抓取。 √
消息队列 Qos。 √
HTTP 连接复用。 ×
维护本地布隆过滤器，避免频繁查询 RedisBloom。×
HTTP 重定向时，将重定向过程中的 URL 也加入过滤器。 ×

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
core		core
docs/img		docs/img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
config.go		config.go
config.json		config.json
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hello Spider 🕷

简介

基本原理

实现概要

Kibana 可视化

相关环境

构建环境

运行环境

使用方法

命令行参数

配置文件

运行

源代码

二进制可执行文件

相关细节

字符编码转换

优先级

a. 无优先级

b. 根据 URL 长度计算优先级

优化计划

相关技术

布隆过滤器

消息队列

数据存储

可视化分析

About

Releases

Packages

Languages

License

Hansin1997/hellospider

Folders and files

Latest commit

History

Repository files navigation

Hello Spider 🕷

简介

基本原理

实现概要

Kibana 可视化

相关环境

构建环境

运行环境

使用方法

命令行参数

配置文件

运行

源代码

二进制可执行文件

相关细节

字符编码转换

优先级

a. 无优先级

b. 根据 URL 长度计算优先级

优化计划

相关技术

布隆过滤器

消息队列

数据存储

可视化分析

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages