Skip to content

Modular services architecture to Index and Search URLs by it's content

Notifications You must be signed in to change notification settings

nneves/UrlSearchEngine

Repository files navigation

UrlSearchEngine

Modular services architecture to Index and Search URLs by it's content using CouchDB+Lucene as a SearchEngine.

NOTE: This project is still under heavy development, please expect some WIP!

Quick start


Note: (docker and docker-compose required)

// Build docker containers (1st time)
docker-compose build

// Launch all required services
# [PRODUCTION]
docker-compose up

# [DEVELOPMENT]
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up

// Open UI
open http://localhost:8080

Note: for a full CouchDB data reset run (warning, all data will be erased):

docker-compose rm && rm DatabaseInit/dbinitstatus/.dbinitdone

Debug Services


Note: (docker and docker-compose required)

// Test GetImageFromURL
open http://localhost:3000/?url=www.botdream.com&width=1024&height=900

// Test GetContentFromURL
curl http://localhost:6000/\?url\=www.botdream.com

// Open CouchDB UI
open http://localhost:5984/_utils/database.html\?searchengine/_all_docs

// Test CouchDB+Lucene search
curl -X GET --silent http://localhost:5984/_fti/local/searchengine/_design/search/by_content?q=brilliant&include_docs=true | jq .

//DBProxy service
[POST] curl -d 'url=http://www.botdream.com' http://localhost:8000/url
[DELETE] curl -X DELETE http://localhost:8000/remove/http-botdream-com-botdream-com-botdream-com%2F2017-04-21T23%3A11%3A04.910Z | jq .
[GET] curl http://localhost:8000/search/botdream | jq .

// shutdown services
docker-compose stop
OR
CTRL+C in the original terminal

// clean services data (reset containers)
docker-compose rm

// when changing source code you should rebuild docker images, use this generic command to build all and ignore cache (will take some time, forces to rebuild all images from scratch)
docker-compose build --no-cache

TODO:

  • Finish UI
  • Bundle UI into an Hapi.js webapp
  • Add Logic to the previous webapp (insert URL content/image into CouchDB document, search content)
  • Add agent plugin for Email (Read email account to insert links content into CouchDB)
  • Add agent plugin for Slack/Telegraf (Use NodeRed telegram/slack integration to insert links content into CouchDB)

Run services independently [use only for development]

Database Initializer


Couchdb database initializer.

For a quick service test run this commands:

cd DatabaseInit

./docker_build.sh
./docker_run.sh

UI


React webapp UI.

For a quick service test run this commands:

cd UI

./docker_build.sh
./docker_run.sh

open http://localhost:8080

GetImageFromURL


A simple screenshot web service powered by Express and PhantomJS. Forked from screenshot-app.

Original documentation avaliable here

For a quick service test run this commands:

cd GetImageFromURL

./docker_build.sh
./docker_run.sh

open http://localhost:3000/?url=www.botdream.com&width=1024&height=900
// or with clipping params
open http://localhost:3000/?url=www.botdream.com&width=1024&height=900&clipRect=%7B%22top%22%3A0%2C%22left%22%3A0%2C%22width%22%3A1024%2C%22height%22%3A800%7D

curl http://localhost:3000/?url=www.botdream.com&width=1024&height=900 > botdream.png
curl --silent http://localhost:3000/\?url\=www.botdream.com\&width\=1024\&height\=900 | imgcat

GetContentFromURL


A simple content scraping web service powered by Express and Cheerio.js

For a quick service test run this commands:

cd GetContentFromURL

./docker_build.sh
./docker_run.sh

curl http://localhost:6000/\?url\=www.botdream.com

CouchDBLucene

Create a database and some docs and then you can start setting up and querying indexes as explained in the couchdb-lucene readme.

For a quick service test run this commands:

cd CouchDBLucene

docker-compose up
./database_init.sh

open http://localhost:5984/_utils/

curl -X GET --silent http://localhost:5984/_fti/local/searchengine/_design/search/by_content?q=nirvana&include_docs=true | jq .

curl -X GET --silent http://localhost:5984/_fti/local/searchengine/_design/search/by_content?q=einstein&include_docs=true | jq .

curl -X GET --silent http://localhost:5984/_fti/local/searchengine/_design/search/by_content?q=brilliant&include_docs=true | jq .

// Clear CouchDB data
docker-compose rm

More info on CouchDB-Lucene fulltext search here: couchdblucene-fulltext-search

SendFavoritesToCouchDB

A simple tool made in GO to parse Google Chrome Bookmark exported HTML file into CouchDB-Lucene.

For a quick service test run this commands:

cd SendFavoritesToCouchDB

# export Chrome Bookmark file to ./bookmarks_sample.html

# ./build.sh # in case you need to change the sourcecode and compile the tool

# Launch CouchDB-Lucene service and initialize database

# Launch GetContentFromURL service

./SendFavoritesToCouchDB ./bookmarks_sample.html

UI

An experimental UI using Vue.js, basic webpack-dev-server http server, no backend yet implemented, neither docker container available!

Web UI

For a quick service test run this commands:

cd UI

npm install
npm start

open http://localhost:8080

DBProxy


A simple url-to-index service. Send meat , get sauge.

This takes the given url, passes it through GetContentFromURL, then pushes it into the searchengine CouchDB database where couchdb-lucene is indexing documents.

curl -d 'url=http://www.botdream.com' http://localhost:8000/url

then when its done:

curl -X GET --silent http://localhost:5984/_fti/local/searchengine/_design/search/by_content?q=botdream&include_docs=true | jq .

NOTE: service engine is now indexing by title:

curl -X GET --silent http://localhost:5984/_fti/local/searchengine/_design/search/by_title\?q\=botdream\&include_docs\=true | jq .

ALSO: DBProxy is now implementing SEARCH endpoint (this avoids UI to require data directly to couchdb, also in future it will be possible to use other DB engine and abstract it with this service => looking at ElasticSearch)

curl -X GET --silent http://localhost:8000/search/botdream | jq .