Skip to content

OCR and search in digitized soil investigation documents

License

Notifications You must be signed in to change notification settings

j-schmied/TerraSearch

Repository files navigation

TerraSearch

OCR und Suche in digitalisierten Bodenuntersuchungsdokumenten

Tech Stack

Part Technology
Frontend Flask (Web), Kibana (DB)
Backend Elasticsearch (Storage, Indexing)
OCR Google Tesseract, jTessBoxEditor (Training)
NLP spaCy, ContextualSpellCheck, LocationTagger
Deployment Docker/docker-compose (Containerisation)

Setup and Installation

  1. Install Dependencies (Docker, docker-compose)
  2. Create .env File (see example below)
  3. Create Docker Network for Project docker network create terrasearch-net
  4. Start Docker Compose docker-compose up -d

Environment File (.env) Example

VERSION="0.5.3"

# Username for elasticsearch
ELASTIC_USER="elastic"

# Password for the 'elastic' user (at least 6 characters)
ELASTIC_PASSWORD="elastic123"

# API Key for Google Maps
GOOGLE_API_KEY="GetMeFromGoogleCloudPlatform"

# Password for the 'kibana_system' user (at least 6 characters)
KIBANA_PASSWORD="kibana123"

# Version of Elastic products
STACK_VERSION=8.3.1

# Set to 'basic' or 'trial' to automatically start the 30-day trial
LICENSE=basic

# Port to expose Elasticsearch HTTP API to the host
ES_PORT=9200

# Port to expose Kibana to the host
KIBANA_PORT=5001

# Port to expose Frontend to the host
FLASK_PORT=8443

# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=4294967296

# Set the cluster name
CLUSTER_NAME=terrasearch-cluster

# Sync directory inside container
SYNC_DIR="/data/sync"

# Local Dir containing pdf files
SYNC_DIR_LOCAL="C:\Downloads"

Sources