rtdl is a universal real-time ingestion and pre-processing layer for every
data lake – regardless of table format, OLAP layer, catalog, or cloud vendor. It is the easiest
way to build and maintain real-time data lakes. You send rtdl a real-time data stream – often
from a tool like Segment – and it builds you a real-time data lake on AWS S3, GCP Cloud
Storage, and Azure Blob Storage.
You provide the data, rtdl builds your lake.
Stay up-to-date on rtdl via our website and blog, and learn how to use rtdl via our documentation.
rtdl's initial feature set is built and working. You can use the API on port 80 to
configure streams that ingest json from an rtdl endpoint on port 8080, process them into Parquet,
and save the files to a destination configured in your stream. rtdl can write files locally, to
HDFS, to AWS S3, GCP Cloud Storage, and Azure Blob Storage and you can query your data via Dremio's
web UI at http://localhost:9047 (login with Username: rtdl
and Password rtdl1234
). rtdl supports
writing in the Delta Lake table format as well as integration with the
AWS Glue and Snowflake External Tables
metadata catalogs.
- Upgrading to v0.2.0 requires following the steps in our upgrade guide.
- Added Delta Lake support.
- Switched to file-based configuration storage (removed dependency on PostgreSQL).
- Community contribution: Stateful Function for PII detection and masking.
- Making AWS Glue, Snowflake External Tables, and Delta Lake support on a by-stream basis.
- git integration for
stream
configurations. - Research and implementation for Apache Hudi, Apache Iceberg, and Project Nessie.
- Graphical user interface.
- Dremio Cloud support.
For more detailed instructions, see our Initialize rtdl docs.
- Run
docker compose -f docker-compose.init.yml up -d
.- Note: This configuration should be fault-tolerant, but if any containers or
processes fail when running this, run
docker compose -f docker-compose.init.yml down
and retry.
- Note: This configuration should be fault-tolerant, but if any containers or
processes fail when running this, run
- After the containers
rtdl_rtdl-db-init
,rtdl_dremio-init
, andrtdl_redpanda-init
exit and complete withEXITED (0)
, kill and delete the rtdl container set by runningdocker compose -f docker-compose.init.yml down
. - Run
docker compose up -d
every time after.
Note: Your memory setting in Docker must be at least 8GB. rtdl may become unstable if it is set lower.docker compose down
to stop.
Note #1: To start from scratch, run rm -rf storage/
from the rtdl root folder.
Note #2: If you experience file write issues preventing Dremio and/or Redpanda services
from starting, please add user: root
to the docker-compose.init.yml
and docker-compose.yml
files in the Dremio and Redpanda service definitions. This issue has been encountered on Linux.
For more detailed setup instructions for your cloud provider, see our setup docs:
- Create a new S3 bucket.
- For more information, see Amazon’s documentation.
- Create a new IAM user.
- For more information, see Amazon’s documentation.
- Create a IAM new policy.
- Use the below permissions, and attach the policy to the IAM
user created in step 2. Replace
<YOUR_BUCKET_NAME>
with the name of the S3 bucket you created in step 1.{ "Version": "2012-10-17", "Statement": [ { "Sid": "ListAllBuckets", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets" ], "Resource": [ "arn:aws:s3:::*" ] }, { "Sid": "ListBucket", "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET_NAME>" ] }, { "Sid": "ManageBucket", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET_NAME>/*" ] } ] }
- Use the below permissions, and attach the policy to the IAM
user created in step 2. Replace
- Attach the policy created in step 3 to the IAM user created in step 2.
- Create access keys for your IAM user.
- For more information, see Amazon's documentation.
- Save the
Access Key ID
andSecret Access Key
for use in configuring your stream in rtdl.
- Create a stream configuration record in rtdl.
Send a call to the API at http://localhost:80/createStream.- Example
createStream
call body for creating a data lake on AWS S3.{ "active": true, "message_type": "test-msg-aws", "file_store_type_id": 2, "region": "us-west-1", "bucket_name": "testBucketAWS", "folder_name": "testFolderAWS", "partition_time_id": 1, "compression_type_id": 1, "aws_access_key_id": "[aws_access_key_id]", "aws_secret_access_key": "[aws_secret_access_key]" }
- Example
createStream
curl call for creating a data lake on AWS S3.curl --location --request POST 'http://localhost:80/createStream' \ --header 'Content-Type: application/json' \ --data-raw '{ "active": true, "message_type": "test-msg-aws", "file_store_type_id": 2, "region": "us-west-1", "bucket_name": "testBucketAWS", "folder_name": "testFolderAWS", "partition_time_id": 1, "compression_type_id": 1, "aws_access_key_id": "[aws_access_key_id]", "aws_secret_access_key": "[aws_secret_access_key]" }'
- Example
For more detailed instructions, see our Send data to rtdl docs.
All data should be sent to the ingest
endpoint of the ingest service on port 8080 -- e.g. http://localhost:8080/ingest.
- You can send any json with just
stream_id
in the payload and rtdl will add it to your lake.You can optionally add{ "stream_id":"837a8d07-cd06-4e17-bcd8-aef0b5e48d31", "name":"user1", "array":[1,2,3], "properties":{"age":20} }
message_type
should you choose to override themessage_type
specified while creating the stream. rtdl will default to a message typertdl_default
if message type is absent in both stream definition and actual message.
rtdl has a multi-service architecture composed of a new generation of open source tools to process and access your data and custom-built services to interact with them more easily. To learn more about rtdl's services and architecture, visit our Architecture docs.
Contributions are always welcome!
See our CONTRIBUTING for ways to get started.
This project adheres to the rtdl code of conduct - a
direct adaptation of the Contributor Covenant,
version 2.1.