Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer: Support old output file hierarchy #1315

Merged
merged 1 commit into from
Nov 3, 2023

Conversation

oguzhanunlu
Copy link
Member

No description provided.

@oguzhanunlu oguzhanunlu self-assigned this Oct 31, 2023
@oguzhanunlu oguzhanunlu changed the title Streaming Transformer: Support old output file hierarchy Transformer: Support old output file hierarchy Oct 31, 2023
Copy link
Contributor

@spenes spenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@oguzhanunlu oguzhanunlu merged commit 6cc1956 into develop Nov 3, 2023
3 checks passed
@oguzhanunlu oguzhanunlu deleted the support-old-file-hierarchy branch November 3, 2023 14:51
@istreeter
Copy link
Contributor

Consider a pair of schemas 1-0-0 and 1-0-1, but imagine schema evolution rules are broken. For example:

1-0-0

{
  "a": {"type": "string"}
}

1-0-1

{
  "a": {"type": "integer"},
  "b": {"type": "string"}
}

If we have legacyPartitioning = false (i.e. the new default behaviour) then everything is OK. The TSV files written to the 1-0-0 directory will have a column for field a, and the TSV files written to the 1-0-1 directory will have columns for fields a and b. The loader knows how to load the 1-0-0 directory into the main *_1 table, and how to load the 1-0-1 directory into the recovery table.

But, if we have legacyPartitioning = true, then the output directory contains a mixture of TSV files with different number of columns. Some TSV files will have a column for field b and some will not.

@oguzhanunlu please could you follow through my reasoning and check I have got that correct??

Bear in mind that a directory containing mixed TSV files is very unhelpful for any purpose. It certainly cannot be loaded into a warehouse table, and it is not helpful for any other purpose either.

Possible solutions

We could say that the legacyPartitioning flag must only be used on pipelines that have no broken schemas. This is the very least we need to do.

But what if we enable it on a pipeline and then six months later there is a broken schema?

The best solution I can think of right now is this (when legacyPartitioning is true):

  1. If shredded output used the merged schema, then write to the legacy /model=1 directory.
  2. If shredded output used a recovery schema, then write to somewhere underneath /type=recovered, instead of type=good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants