-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
171 lines (127 loc) · 6.83 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# parsel
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/parsel)](https://CRAN.R-project.org/package=parsel)
[![License: MIT](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/license/mit/)
![](https://cranlogs.r-pkg.org/badges/grand-total/parsel?color)
<!-- badges: end -->
`parsel` is a framework for parallelized dynamic web-scraping using `RSelenium`. Leveraging parallel processing, it allows you to run any `RSelenium` web-scraping routine on multiple browser instances simultaneously, thus greatly increasing the efficiency of your scraping. `parsel` utilizes chunked input processing as well as error catching and logging, to ensure seamless execution of your scraping routine and minimal data loss, even in the presence of unforeseen `RSelenium` errors.
`parsel` additionally provides convenient wrapper functions around `RSelenium` methods, that allow you to quickly generate safe scraping code with minimal coding on your end.
## Installation
``` r
# Install parsel from CRAN
install.packages("parsel")
# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("till-tietz/parsel")
```
## Usage
### Parallel Scraping
The following example will hopefully serve to illustrate the functionality and ideas behind how `parsel` operates.
We'll set up the following scraping job:
1. navigate to a random Wikipedia article
2. retrieve its title
3. navigate to the first linked page on the article
4. retrieve the linked page's title and first section
and parallelize it with `parsel`.
`parsel` requires two things:
1. a scraping function defining the actions to be executed in each `RSelenium` instance. Actions to be executed in each browser instance should be written in the conventional `RSelenium` syntax with `remDr$` specifying the remote driver.
2. some input `x` to those actions (e.g. search terms to be entered in search boxes or links to navigate to etc.)
```{r, eval = FALSE}
library(RSelenium)
library(parsel)
#let's define our scraping function input
#we want to run our function 4 times and we want it to start on the wikipedia main page each time
input <- rep("https://de.wikipedia.org",4)
#let's define our scraping function
get_wiki_text <- function(x){
input_i <- x
#navigate to input page (i.e wikipedia)
remDr$navigate(input_i)
#find and click random article
rand_art <- remDr$findElement(using = "id", "n-randompage")$clickElement()
#get random article title
title <- remDr$findElement(using = "id", "firstHeading")$getElementText()[[1]]
#check if there is a linked page
link_exists <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]"))
#if no linked page fill output with NA
if(is(link_exists,"try-error")){
first_link_title <- NA
first_link_text <- NA
#if there is a linked page
} else {
#click on link
link <- remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]")$clickElement()
#get link page title
first_link_title <- try(remDr$findElement(using = "id", "firstHeading"))
if(is(first_link_title,"try-error")){
first_link_title <- NA
}else{
first_link_title <- first_link_title$getElementText()[[1]]
}
#get 1st section of link page
first_link_text <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"))
if(is(first_link_text,"try-error")){
first_link_text <- NA
}else{
first_link_text <- first_link_text$getElementText()[[1]]
}
}
out <- data.frame("random_article" = title,
"first_link_title" = first_link_title,
"first_link_text" = first_link_text)
return(out)
}
```
Now that we have our scrape function and input we can parallelize the execution of the function.
For speed and efficiency reasons, it is advisable to specify the headless browser option in the `extraCapabilities` argument.
`parscrape` will show a progress bar, as well as elapsed and estimated remaining time so you can keep track of scraping progress.
```{r, results = 'hide', warning = FALSE, eval = FALSE}
wiki_text <- parsel::parscrape(scrape_fun = get_wiki_text,
scrape_input = input,
cores = 2,
packages = c("RSelenium","XML"),
browser = "firefox",
scrape_tries = 1,
extraCapabilities = list(
"moz:firefoxOptions" = list(args = list('--headless'))
))
```
`parscrape` returns a list with two elements:
1. a list of your scrape function output
2. a data.frame of inputs it was unable to scrape, and the associated error messages
### RSelenium Constructors
`parsel` allows you to generate safe scraping code with minimal hassle by simply composing `constructor` functions that effectively act as wrappers around `RSelenium` methods in a pipe. You can return a scraper function defined by `constructors` to the environment by starting your pipe with `start_scraper()` and ending it with `build_scraper()`. Alternatively you can dump the code generated by your `constructor` pipe to the console via `show()`.
We'll reproduce a slightly stripped down version of the `RSelenium` code in the above wikipedia scraping routine via the `parsel` `constructor` functions.
```{r, warning = FALSE, message = FALSE}
library(parsel)
# returning a scaper function
start_scraper(args = "x", name = "get_wiki_text") %>>%
go(url = "x") %>>%
click(using = "id", value = "'n-randompage'", name = "rand_art") %>>%
get_element(using = "id", value = "'firstHeading'", name = "title") %>>%
click(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'", name = "link") %>>%
get_element(using = "id", value = "'firstHeading'", name = "first_link_title") %>>%
get_element(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'", name = "first_link_text") %>>%
build_scraper()
ls()
# dumping generated code to console
go(url = "x") %>>%
click(using = "id", value = "'n-randompage'", name = "rand_art") %>>%
get_element(using = "id", value = "'firstHeading'", name = "title") %>>%
click(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'", name = "link") %>>%
get_element(using = "id", value = "'firstHeading'", name = "first_link_title") %>>%
get_element(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'", name = "first_link_text") %>>%
show()
```