Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impl compaction & TTL system for Streams & Pipelines #99

Closed
9 tasks done
thedodd opened this issue Sep 28, 2021 · 0 comments · Fixed by #109
Closed
9 tasks done

Impl compaction & TTL system for Streams & Pipelines #99

thedodd opened this issue Sep 28, 2021 · 0 comments · Fixed by #109
Labels
A-crd Hadron K8s CRDs A-streams Hadron server streams

Comments

@thedodd
Copy link
Collaborator

thedodd commented Sep 28, 2021

Streams

  • Record the stream's last compaction timestamp, and compare on startup. Base initial delay on delta. This ensures that a periodically restarting stream partition (which shouldn't happen) doesn't miss compaction.
  • Per partition compaction based on id+source.
    • Should we preserve only the latest? Or should we preserve the original only as we know it is a duplicate event according to the CloudEvents1.0?
    • Should we just not worry about this at all? Maybe there is no value add given other compaction strategies and the fact that this does not actually guarantee that there will never be duplicate events across partitions?
    • Not going to implement this bit, as we gain little benefit and sacrifice performance a bit on the write path.
  • Per partition timestamp based truncation. As events pass out of the TTL threshold, events are truncated.
    • Event batches should have a timestamp written into a secondary index for the last event's offset per batch.
    • Stream CRDs should be updated to include a compaction policy sub-structure. Users should be able to specify the retention policy; only time-based retention is currently supported.
    • Stream controller should check the earliest value in the timestamp secondary index, and when the time elapses the configured retention policy, it should spawn a task to prune the old data.
  • Update operator to pass along retention policy data to stream statefulsets.

Observations: for any system which does not have transactional integration directly with the full Hadron Stream, there is no way to guard against duplicate re-processing other than the transactional processing model.


Pipelines

  • Pipeline stage data should be deleted once the entire Pipeline instance is complete.
  • Pipelines need to transactionally copy their root event to account for cases where the root event may be compacted away. Or we could to a pipeline offsets check first before we compact a range.
@thedodd thedodd added A-crd Hadron K8s CRDs A-streams Hadron server streams labels Sep 28, 2021
@thedodd thedodd mentioned this issue Sep 28, 2021
11 tasks
@thedodd thedodd added the T-needs-design Needs additional design work label Oct 8, 2021
@thedodd thedodd changed the title Impl compaction & TTL system for Streams Impl compaction & TTL system for Streams & Pipelines Oct 11, 2021
thedodd added a commit that referenced this issue Nov 5, 2021
Compaction routine is now well-tested. Woot woot!

Operator has been updated to pass along retention policy config to
stream.

Updated deps across all components.

closes #99
@thedodd thedodd removed the T-needs-design Needs additional design work label Nov 10, 2021
thedodd added a commit that referenced this issue Nov 10, 2021
Compaction routine is now well-tested. Woot woot!

Operator has been updated to pass along retention policy config to
stream.

Updated deps across all components.

closes #99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-crd Hadron K8s CRDs A-streams Hadron server streams
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant