It's a framework for small scale out-of-the-box BI infrastructure.
Each compoment can be replaced by a technology of you choice. Don't like postgres ? replace it with Oracle. The principles will still be the same.
- Install docker https://docs.docker.com/install/
- From root folder run.
docker-compose up
- Wait until the containers are running.
- Done :)
Container | Link | Credentials | Note |
---|---|---|---|
NiFi | http://127.0.0.1:8080/nifi | None | TODO: OpenLDAP in the future |
Postgres | http://127.0.0.1:5432 | postgres:secret123 | to view the schema I added a https://dbschema.com/ project. /postgres/dwh/dbschema_dwh_project.dbs |
Metabase | http://127.0.0.1:3000 | You create your own | |
PgAdmin | http://127.0.0.1:8090 | admin:admin | |
openldap | http://127.0.0.1:38999 | None | |
jira-api | http://127.0.0.1:8000 | None | Mock of https://<your server>/rest/api/3/ API |
pipedrive-api | http://127.0.0.1:8100 | None | Mock of https://developers.pipedrive.com/docs/api/v1/ API |
restcountries-api | http://127.0.0.1:8200 | None | Mock of https://restcountries.eu/ API |
- Fetch data
- Store them in Postgres
- Execute Postgres stored procedures
- Single point of scheduling/dataflow
- Data store - staging
- Clean integrated data - core
- Reports - mart
- +Transformations between areas
- View Reports/tables stored in mart
- Make aggregates/filters
- Share them via URLs
- I'm using NiFi as a batch ingest tool and orchestrator for Data Warehouse.
From 2 years of working and trying to make workflow as generic, modularized and simple to understadnd here are my principles I used to archive it.
- Every logical step should output only 1 flow file. e.g. One Processor group that ingests data from a source system and outputs one flow file into success or error connection.
- Speficy credentials only once. e.g. If you use InvokeHTTP processor, create a workflow so that there won't be 2 InvokeHTTP processors with the same credentials set.
- If you can, create a Processor group that will act as a blackbox. Show the user what's going on without going to unnecessary detail.
Component | Naming Convention | Note |
---|---|---|
flow file attriute | lowerCamelCase | thisIsAnExampleForYou |
connection | all lovercase separated with space | this is an example for you |
processor group | Initial Caps | This Is An Example For You |
input port | in | always use only this one name |
output port | success, error | use always only these two names |
connection | success, error (whenever possible) | breaking rule example: Parsing log files and routing each line on log level: Info, Warn, Error |
Name | Usage |
---|---|
stage | here are data loaded as is from source systems |
core | cleaned integrated (foreign keys) data |
mart | very flat tables, reports |
Postfix | Name | usage |
---|---|---|
_i | input | contains temporary data used during etl |
_t | today | always has current data |
_ih | input historized | scd2 historization of _i tables |
_d | exact copy (snapshot) of _i tables | backup |
withoput_postfix | report | Only in mart schema (will be changed to _r) |