Make batches replayable (intermediate data lake) #1871
Replies: 8 comments
-
A simpler solve is here: We can close this item if we choose to implement #1246 |
Beta Was this translation helpful? Give feedback.
-
@aaronsteers I have been running the "intermediate data lake" use case in a personal project for about a month, using target-jsonl and tap-spreadsheets-anywhere. My pipeline doesn't capture raw singer messages though - What I like:
Wishlist:
Most of these are already in your proposal 🙌 Here is a snippet of the jobs from my jobs:
- name: job-collect-auction-search
tasks:
- tap-auction-search-optimus target-jsonl
- tap-auction-search-primus target-jsonl
- awscli:s3_upload_auction_search
- name: job-collect-auction-categories
tasks:
- tap-auction-categories target-jsonl
- awscli:s3_upload_auction_categories
- name: job-load-auction-search
tasks:
- awscli:s3_download_auction_search
- tap-spreadsheets-anywhere-auction-search target-duckdb
- name: job-load-auction-categories
tasks:
- awscli:s3_download_auction_categories
- tap-spreadsheets-anywhere-auction-categories target-duckdb |
Beta Was this translation helpful? Give feedback.
-
A higher than expected number of folks (I didn't think this mattered to much, but I was wrong) have asked me about this, and they all tend to implement it themselves (They'd love this!) I wanted to add that from a technical outsider view "backing up" the contents that goes from the tap to the target seems like the right architecture move. I've had this confirmed in 3-5 different companies asking this question in different ways.
Now if we also did something like Kafka and added an id system of sorts to allow "full replayability" to folks that would be pretty epic. I'm not sure the id piece is the right way to go here, but just wanted to describe what folks have asked for.
Quoting myself above they all implement it slightly different. Generally it's a backup somewhere with the tap data. The main concern being that they want to be able to debug things even without access to the source system at debug time. I'm not certain how many have actually made tooling to make the data useful easily. I think what actually happens is this sounds really good, but you actually don't need it in production so folks drop it pretty low in the backlog and don't come back to it |
Beta Was this translation helpful? Give feedback.
-
@kgpayne, @visch - Thanks both for your thoughtful feedback. Based on your comments, I refined the issue description to prefer the |
Beta Was this translation helpful? Give feedback.
-
@aaronsteers does that effectively merge this issue and "Support |
Beta Was this translation helpful? Give feedback.
-
@kgpayne - Good question. It doesn't fully merge them because this item still focuses on a config-based way to echo STDOUT also to a set of files, whereas the proposal in the #1246 The other issue would be much easier after this one is implemented but the use cases are different. In this case, we end up with replayable output inline with a normal sync operation, whereas that one generates relatable output instead of the normal sync operation. In theory, these could be done in any order, and either one makes the other easier. It's probably worth also mentioning that turning this in by default for when batch messages are enabled probably just "makes sense" because we're already uploading the bulk of the data anyway, and the remaining files are relatively small. We could get the same behavior for non-batch operations but we can't turn that behavior on by default, because of the massive storage and performance implications, and because the user hasn't provided us a suitable default output path. |
Beta Was this translation helpful? Give feedback.
-
@z3z1ma called this out in Office Hours today: We probably should ensure programmatically that each file has a So an ideal file might (at least for some use cases) contain messages in this sequence:
One benefit to having more files, each self-sufficient, would be the ability to for loaders to run ingest data with multiple workers. A caveat to call out here is that |
Beta Was this translation helpful? Give feedback.
-
As discussed in Office Hours this week (Jan 18), @kgpayne has done some great work in exploration on a potential spec for cloud storage of the Singer event stream here: Ideally we'd reconcile these two so that the 'replayable batches' and 'singerlake' concepts are really just a single spec, optionally with toggleable setting as to whether the records would be stored inline in the In order to achieve maximum performance from source and targets, the capability of the source and target to write and ingest the files natively is going to be critical. Although we of course can leave the option to have inline data storage for smaller workloads, and (more importantly) to retain compatibility with legacy targets that do not support batch. cc @z3z1ma |
Beta Was this translation helpful? Give feedback.
-
Since launching batch messages, we've had community members add S3 support and soon Parquet.
I've realized that we're only inches away from being able to create a data lake with rerunnable data ingestion.
What we have already:
If we just upload items 2 and 3 to cloud storage, we'd end up with a durable, robust, self-describing, and replayable data lake.
Proposed implementation spec
The implementation proposal for this spec is basically to upload the contents of the Singer message stream to a set of files on the remote storage provider.
.singer.jsonl
files (i.e. the STDOUT steam contents) during sync to the same directory as provided by thebatch_config
's storage spec..singer.jsonl
files are alpha-sorted in replay sequence..singer.jsonl
files to a different destination - for instance, to separate the raw data files from the Singer log stream. (If not specified, it defaults to the same directory as the batch files are saved to.)BATCH
message manifest will contain absolute references, so there will be no challenge in referencing those from different S3 paths or even from different backend providers altogether..singer.jsonl
based on the following logic:STDOUT
.STATE
messages would be sent, upload a new file containing singer messages not yet uploaded.Other options considered
These options were considered and not chosen.
Details
To me, option 2 seems like overkill and makes this harder to achieve as an inline byproduct.
Beta Was this translation helpful? Give feedback.
All reactions