php-webminer -- Extract data using Selenium, QueryPath and PHP

DESCRIPTION

The goal of this project is to create an extensible system for extracting data from web pages. Currently it is using Selenium WebDriver (via php-webdriver), QueryPath, and a configuration file which specifies which components to extract and how to output the results.

Job File

The "job" configuration file defines all of the aspects of the system (database, infrastructure) and the web site and the data you wish to extract.

It is in XML and has the following options:

Child element "site" must be defined
Child element "steps" are recommended as they drive actions

Database

Currently a single MySQL database is accepted. If elements are defind the XML will be imported into the database->table per the specifications in the Configuration File

Actions

Click
Type
Captcha

Elements

Input - CSS Selectors used by QueryPath to pull data from a web page
Output - Element name of Output XML

Samples are included in the /examples folder.

Outputs XML

The definitions in the configuration define how the output will be formatted (element names).

INSTALLING

GET THE CODE

Github

git clone git@github.com:kjenney/php-webminer.git

Packagist

Add the dependency. https://packagist.org/packages/kjenney/php-webminer

{
  "require": {
    "kjenney/php-webminer": "dev-master"
  }
}

BUILD WITH DEPENDENCIES

Download the composer.phar

curl -sS https://getcomposer.org/installer | php

Install the library.

php composer.phar install

Install PHP5 Extensions

apt-get install php5-tidy
yum install php-tidy

apt-get install php5-mysqlnd

Install Tesseract (optional)

apt-get install tesseract-ocr

GETTING STARTED

All you need as the server for this client is the selenium-server-standalone-#.jar file provided here: http://www.seleniumhq.org/download/
Download and run that file, replacing # with the current server version.
```
java -jar selenium-server-standalone-#.jar
```

Support

Wiki - https://github.com/kjenney/php-webminer/wiki

Contributing

There's still a lot of work that needs to be done, but I welcome any help and/or suggestions.
Feel free to create issues and recommend features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

php-webminer -- Extract data using Selenium, QueryPath and PHP

DESCRIPTION

Job File

Outputs XML

INSTALLING

Github

Packagist

GETTING STARTED

Support

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

php-webminer -- Extract data using Selenium, QueryPath and PHP

DESCRIPTION

Job File

Outputs XML

INSTALLING

Github

Packagist

GETTING STARTED

Support

Contributing