Make batches replayable (intermediate data lake) #1871

aaronsteers · 2022-12-06T01:35:09Z

aaronsteers
Dec 6, 2022

Since launching batch messages, we've had community members add S3 support and soon Parquet.

I've realized that we're only inches away from being able to create a data lake with rerunnable data ingestion.

What we have already:

Data files - land in cloud storage
JSON schema reference - sent to STDOUT.
File manifests - sent to STDOUT.

If we just upload items 2 and 3 to cloud storage, we'd end up with a durable, robust, self-describing, and replayable data lake.

Proposed implementation spec

The implementation proposal for this spec is basically to upload the contents of the Singer message stream to a set of files on the remote storage provider.

The tap will upload one or more .singer.jsonl files (i.e. the STDOUT steam contents) during sync to the same directory as provided by the batch_config's storage spec.
The tap will ensure all .singer.jsonl files are alpha-sorted in replay sequence.
- This can be implemented via timestamped filesnames or (perhaps better) by using something like ULID to generate unique file IDs which are also timestamp-sortable.
- By ensuring files have deterministic replay sequence in their filenames, the tap can ensure that files created by any of its runners can be consumed in timestamp order without the consumer needing to pre-parse each file.
The tap can accept an optional override config to send.singer.jsonl files to a different destination - for instance, to separate the raw data files from the Singer log stream. (If not specified, it defaults to the same directory as the batch files are saved to.)
- Regardless of whether files are stored in the same or different output directories, the BATCH message manifest will contain absolute references, so there will be no challenge in referencing those from different S3 paths or even from different backend providers altogether.
Taps might override this implementation, but, by applying learnings from similar BATCH and STATE conversations in the SDK, a reasonable default upload cadence would be for the tap to upload .singer.jsonl based on the following logic:
1. Continual write. If the storage backend supports continual writes, and if it can handle interruption, then the tap can simply echo all singer messages to the file buffer as well as sending to STDOUT.
2. Batched write. If continual write is not available or desireable, use a batched write operation.
  1. Whenever one or more STATE messages would be sent, upload a new file containing singer messages not yet uploaded.
  2. Defer or ignore upload trigger if <2 minutes have passed since the last upload. (Prevents excessive overhead when STATE messages are emitted very frequently, such as several times per second for low-cardinality parent-child relationships.)
  3. Always flush the message queue to the storage backend when the tap exits - whether or not job was successful or failed due to an error.

Other options considered

These options were considered and not chosen.

Details

Upload scheme and manifests to cloud storage as separate files during sync.
Use a new target (target-batch or target-data-lake) which add the additional files.

To me, option 2 seems like overkill and makes this harder to achieve as an inline byproduct.

aaronsteers · 2022-12-06T02:31:14Z

aaronsteers
Dec 6, 2022
Author

A simpler solve is here:

Support --output for taps, with support for cloud filesystems #1246

We can close this item if we choose to implement #1246

0 replies

kgpayne · 2022-12-07T19:06:15Z

kgpayne
Dec 7, 2022

@aaronsteers I have been running the "intermediate data lake" use case in a personal project for about a month, using target-jsonl and tap-spreadsheets-anywhere. My pipeline doesn't capture raw singer messages though - target-jsonl creates a file per stream (with name <stream name>-<timestamp>.jsonl) containing json dumped record values, and tap-spreadsheets-anywhere receives a glob pattern (e.g. customers-*.jsonl) and constructs schema on-the-fly.

What I like:

Output location and timestamping config of jsonl files from target-jsonl (I upload the files to S3 using the awscli util from squared to a prefix per stream, so create a matching folder hierarchy locally - ./output/<tap id>/<stream name>/<stram name>-<timestamp>.jsonl).
Glob pattern and input folder config in tap-spreadsheets-anywhere (rather than a single --input file), which allows me to download all extracts for a given s3 prefix locally before loading them all at once (to duckdb in my case 🐥).

Wishlist:

Persist raw messages to jsonl, especially SCHEMA so that schema changes can be correctly captured and handled by the eventual tap. I actually opened an issue and PR to target-jsonl to do just that.
Direct push to s3, with control over the prefix and filename template.
Compression with gzip.

Most of these are already in your proposal 🙌

Here is a snippet of the jobs from my meltano.yml. tap-auction is inherited a couple of times to search auction sites with different search terms 😅 I run the job-collect-* jobs on a schedule, collecting listings everyday, and run the job-load-* jobs as needed to update duckdb when I want to check out the listings data. I am using dbt too, but its still in development and so isn't yet in my jobs.

jobs:
- name: job-collect-auction-search
  tasks:
  - tap-auction-search-optimus target-jsonl
  - tap-auction-search-primus target-jsonl
  - awscli:s3_upload_auction_search

- name: job-collect-auction-categories
  tasks:
  - tap-auction-categories target-jsonl
  - awscli:s3_upload_auction_categories

- name: job-load-auction-search
  tasks:
  - awscli:s3_download_auction_search
  - tap-spreadsheets-anywhere-auction-search target-duckdb

- name: job-load-auction-categories
  tasks:
  - awscli:s3_download_auction_categories
  - tap-spreadsheets-anywhere-auction-categories target-duckdb

0 replies

visch · 2022-12-09T14:03:40Z

visch
Dec 9, 2022

A higher than expected number of folks (I didn't think this mattered to much, but I was wrong) have asked me about this, and they all tend to implement it themselves (They'd love this!)

I wanted to add that from a technical outsider view "backing up" the contents that goes from the tap to the target seems like the right architecture move. I've had this confirmed in 3-5 different companies asking this question in different ways.

If we just upload items 2 and 3 to cloud storage, we'd end up with a durable, robust, self-describing, and replayable data lake.

Now if we also did something like Kafka and added an id system of sorts to allow "full replayability" to folks that would be pretty epic. I'm not sure the id piece is the right way to go here, but just wanted to describe what folks have asked for.

"implement it themselves"

Quoting myself above they all implement it slightly different. Generally it's a backup somewhere with the tap data. The main concern being that they want to be able to debug things even without access to the source system at debug time.

I'm not certain how many have actually made tooling to make the data useful easily. I think what actually happens is this sounds really good, but you actually don't need it in production so folks drop it pretty low in the backlog and don't come back to it

0 replies

aaronsteers · 2022-12-13T00:38:46Z

aaronsteers
Dec 13, 2022
Author

@kgpayne, @visch - Thanks both for your thoughtful feedback. Based on your comments, I refined the issue description to prefer the .singer.jsonl file upload approach, which literally provides the exact sequence of Singer messages which would then allow data loads to be replayed. Those files would inherently contain the SCHEMA message, theBATCH message (with manifest), and the STATE message.

0 replies

kgpayne · 2022-12-14T14:07:33Z

kgpayne
Dec 14, 2022

@aaronsteers does that effectively merge this issue and "Support --output for taps, with support for cloud filesystems #1246"? It seems like --output could be made to serve both use cases, with the only difference being that BATCH messages trigger upload of each file in the manifest in addition to the raw message stream 🤔

0 replies

aaronsteers · 2022-12-14T14:41:59Z

aaronsteers
Dec 14, 2022
Author

@kgpayne - Good question. It doesn't fully merge them because this item still focuses on a config-based way to echo STDOUT also to a set of files, whereas the proposal in the #1246 --output is an implementation at the CLI which replaces STDOUT.

The other issue would be much easier after this one is implemented but the use cases are different. In this case, we end up with replayable output inline with a normal sync operation, whereas that one generates relatable output instead of the normal sync operation.

In theory, these could be done in any order, and either one makes the other easier.

It's probably worth also mentioning that turning this in by default for when batch messages are enabled probably just "makes sense" because we're already uploading the bulk of the data anyway, and the remaining files are relatively small.

We could get the same behavior for non-batch operations but we can't turn that behavior on by default, because of the massive storage and performance implications, and because the user hasn't provided us a suitable default output path.

0 replies

aaronsteers · 2022-12-14T18:38:34Z

aaronsteers
Dec 14, 2022
Author

@z3z1ma called this out in Office Hours today:

We probably should ensure programmatically that each file has a SCHEMA message for each stream mentioned.

So an ideal file might (at least for some use cases) contain messages in this sequence:

SCHEMA message
BATCH message
STATE message

One benefit to having more files, each self-sufficient, would be the ability to for loaders to run ingest data with multiple workers.

A caveat to call out here is that STATE won't be stream-specific and will be subject to still needing to be processed in sequence. Orchestrators could handle this, but it's just worth calling out.

0 replies

aaronsteers · 2023-01-20T18:33:55Z

aaronsteers
Jan 20, 2023
Author

As discussed in Office Hours this week (Jan 18), @kgpayne has done some great work in exploration on a potential spec for cloud storage of the Singer event stream here:

https://blog.kenpayne.co.uk/blog/2023-01-17.html

Ideally we'd reconcile these two so that the 'replayable batches' and 'singerlake' concepts are really just a single spec, optionally with toggleable setting as to whether the records would be stored inline in the .singer.jsonl stream (non-batch) or as their own directly consumable data files (batch mode).

In order to achieve maximum performance from source and targets, the capability of the source and target to write and ingest the files natively is going to be critical. Although we of course can leave the option to have inline data storage for smaller workloads, and (more importantly) to retain compatibility with legacy targets that do not support batch.

cc @z3z1ma

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make batches replayable (intermediate data lake) #1871

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Make batches replayable (intermediate data lake) #1871

aaronsteers Dec 6, 2022

Proposed implementation spec

Other options considered

Replies: 8 comments

aaronsteers Dec 6, 2022 Author

kgpayne Dec 7, 2022

visch Dec 9, 2022

aaronsteers Dec 13, 2022 Author

kgpayne Dec 14, 2022

aaronsteers Dec 14, 2022 Author

aaronsteers Dec 14, 2022 Author

aaronsteers Jan 20, 2023 Author

aaronsteers
Dec 6, 2022

aaronsteers
Dec 6, 2022
Author

kgpayne
Dec 7, 2022

visch
Dec 9, 2022

aaronsteers
Dec 13, 2022
Author

kgpayne
Dec 14, 2022

aaronsteers
Dec 14, 2022
Author

aaronsteers
Dec 14, 2022
Author

aaronsteers
Jan 20, 2023
Author