Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Data loss in Snowflake when pipeline crashes #286

Open
Tolsto opened this issue Jun 6, 2022 · 0 comments
Open

Data loss in Snowflake when pipeline crashes #286

Tolsto opened this issue Jun 6, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@Tolsto
Copy link
Contributor

Tolsto commented Jun 6, 2022

We recently had a crash of Pipelinewise due to OOM that resulted in missing data in Snowflake. Going through the source code I found what I believe to be the likely cause of the problem:

elif t == 'STATE':
LOGGER.debug('Setting state to %s', o['value'])
state = o['value']
# # set flushed state if it's not defined or there are no records so far
if not flushed_state or sum(row_count.values()) == 0:
flushed_state = copy.deepcopy(state)

Line 322 makes the assumption that the first state message is also the flushed state. In the case of the Postgres tap and log-based replication this is not the case. By default, the first state message gets emitted after 10000 changes have been processed. Consider the following scenario:

  • 2 tables synced via logical replication
  • batch size is 15,000
  • flush_all_streams is false
  • LSN in the persisted state for all tables is 10 when beginning the run
  1. Tap processes 9000 inserts for table A and 1000 for table B
  2. Tap emits first state message with LSN now at 20 for both tables. The target sets flushed_state to the content of that schema message
  3. Tap emits 6000 more inserts for table A
  4. batch size for table A is reached and the target flushes that table to Snowflake. Afterwards, it emits the state using flushed_state as the baseline.
  5. Immediately after emitting the state the process crashes
  6. The state file will now contain LSN 20 for both tables although table B has never been flushed to Snowflake and is still at LSN 10. As a result, the next sync will disregard all changes before LSN 20 and the data is lost.

As far as I understand the issue so far, a crash of the pipeline when using CDC with Postgres will always lead to a data loss. The larger the batch sizes the more data gets lost. The only remedy would be setting flush_all_streams to true.

I can think of two solutions:

  1. Before emitting any records, send a state message that contains the persisted state as passed to the tap. Of course, that would only fix it for the Postgres tap and not for other taps that might also have that issue.
  2. Perform a merge when saving the state instead of just overwriting the file. Probably has many negative side effects as the current implementation gives the target the complete authority over the content of the persisted state.
@Tolsto Tolsto added the bug Something isn't working label Jun 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant