-
Notifications
You must be signed in to change notification settings - Fork 53
How to build and test a search engine parser
Seeks websearch plugin comes with support for a few existing search engines. But the plugin has generic code for building new search engine parsers.
If you'd like to construct a parser for an additional search engine, this is very easy (hum).
Go to seeks/src/plugins/websearch/
. Here, copy an existing parser (ie se_parser_bing.cpp
and se_parser_bing.h
) to create the files for your parser.
Then edit Makefile.am
and add your parser in it (same as for the other parsers).
Edit websearch_configuration.h
, add your parser to the define list (the number should be double for each new line).
Not sure but maybe you'll need to hack se_handler.h
to add your parser here to.
Go to seeks/proxy/src/proxy/tests/
. Here, use test_curl_mget
to download a sample of the page you are going to parse.
Example: ./test_curl_mget 1 http://search.blah.net/search?q=blabla&...
The second argument is the number of time you want to download the page.
Save the output to a file, then move this file to seeks/src/plugins/websearch/tests/
.
Here, copy test test-bing-parser.cpp
to a new file for your parser. Edit this file, remplace bing
with your parser name.
Edit the Makefile.am
, add the test of your parser in it.
Edit the files (.cpp
and .h
) of your parser in seeks/src/plugins/websearch/
. Remplace the bing
keyword (or the equivalent of the parser your have copy) with the keyword of your parser.
Comment the code inside the method.
Launch make. Prey. If this doesn't compile, find why and edit this page (or come and talk with us on irc). If everything goes fine, breath, you are now ready to hack.
Okay, now the game begins. Open up se_parser_yourwebsite.cpp
and the page you have download with test_curl_mget
. The objectif is to find the pattern of the result snippet in this html file and to hack se_parser_yourwebsite to find it.
This is a event based parser, it means that each time he encouter an opening balise, a closing balise or the content of a tag, it will call the corresponding method. You should play with some boolean attributes to indicate where you are in the snippet (declare them in se_parser_yourwebsite.h
and initialise them in the constructor). The important part is the creation of your snippet, you should already have the code from the copied parser (a big if with a lot of ||
).
To test your parser, use the test created in seeks/src/plugins/websearch/tests/
on the file download with test_curl_mget
. You should have the corresponding number of parsers created and no errors.
Happy hacking.
Here is a very simple example for twitter: se_parser_twitter.cpp (not include in seeks for the moment)
It looks for <entry></entry> tag, then <title> tag, grab what is inside <title></title> look for <link></link>, grab the href="" then push it snippet when it encounter </entry></title>