Transformer: Support old output file hierarchy #1315

oguzhanunlu · 2023-10-31T13:50:53Z

No description provided.

spenes

LGTM!

istreeter · 2023-11-06T20:42:33Z

Consider a pair of schemas 1-0-0 and 1-0-1, but imagine schema evolution rules are broken. For example:

1-0-0

{
  "a": {"type": "string"}
}

1-0-1

{
  "a": {"type": "integer"},
  "b": {"type": "string"}
}

If we have legacyPartitioning = false (i.e. the new default behaviour) then everything is OK. The TSV files written to the 1-0-0 directory will have a column for field a, and the TSV files written to the 1-0-1 directory will have columns for fields a and b. The loader knows how to load the 1-0-0 directory into the main *_1 table, and how to load the 1-0-1 directory into the recovery table.

But, if we have legacyPartitioning = true, then the output directory contains a mixture of TSV files with different number of columns. Some TSV files will have a column for field b and some will not.

@oguzhanunlu please could you follow through my reasoning and check I have got that correct??

Bear in mind that a directory containing mixed TSV files is very unhelpful for any purpose. It certainly cannot be loaded into a warehouse table, and it is not helpful for any other purpose either.

Possible solutions

We could say that the legacyPartitioning flag must only be used on pipelines that have no broken schemas. This is the very least we need to do.

But what if we enable it on a pipeline and then six months later there is a broken schema?

The best solution I can think of right now is this (when legacyPartitioning is true):

If shredded output used the merged schema, then write to the legacy /model=1 directory.
If shredded output used a recovery schema, then write to somewhere underneath /type=recovered, instead of type=good.

oguzhanunlu self-assigned this Oct 31, 2023

oguzhanunlu force-pushed the support-old-file-hierarchy branch from 5a44c52 to 12e0844 Compare October 31, 2023 15:17

oguzhanunlu changed the title ~~Streaming Transformer: Support old output file hierarchy~~ Transformer: Support old output file hierarchy Oct 31, 2023

Transformer: Support old output file hierarchy (close #1314)

b280640

oguzhanunlu force-pushed the support-old-file-hierarchy branch from 12e0844 to b280640 Compare October 31, 2023 15:37

spenes approved these changes Nov 3, 2023

View reviewed changes

oguzhanunlu merged commit 6cc1956 into develop Nov 3, 2023
3 checks passed

oguzhanunlu deleted the support-old-file-hierarchy branch November 3, 2023 14:51

istreeter mentioned this pull request Nov 6, 2023

Loader: Add legacy file hierarchy support #1320

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer: Support old output file hierarchy #1315

Transformer: Support old output file hierarchy #1315

oguzhanunlu commented Oct 31, 2023

spenes left a comment

istreeter commented Nov 6, 2023

Transformer: Support old output file hierarchy #1315

Transformer: Support old output file hierarchy #1315

Conversation

oguzhanunlu commented Oct 31, 2023

spenes left a comment

Choose a reason for hiding this comment

istreeter commented Nov 6, 2023

Possible solutions