Replies: 7 comments 5 replies
-
Ah ha, this is going to be fun. 😄
So one point of clarification: are you talking about the concrete syntax (i.e. YAML, TOML, etc) or the abstract syntax (StringType). Reason I ask, is the CST comes with Ok, now on to the broader point. I deliberately forced the
The only example I can find of a truly unbounded string is JSON schema. But even that is fantasy because programming languages represent their strings in memory and must have a max bit length. For example, Java max string length is INT_MAX. Same deal for The next question is whether we care about the max length at these boundaries. I think we actually do for all of the examples that I provided above except Avro (a string with a max of LONG_MAX is going to exceed any memory length, most storage space, and filesystem size limits). But going from a Snowflake VARCHAR to a PG VARCHAR, it's actually important to know that your strings are going to get truncated. So, to your proposals: Regarding (1), we already do that in the CST (see my note above about Between (2) and (3), if forced to choose, I would probably go with (3) over (2) just because it's a little simpler on the type hierarchy (but maybe a little less "pure"/"clean"). But to really answer you, I think I need to know who is feeling this pain. This is why I asked about the CST/AST above. If your feeling the pain mostly in the AST when implementing converters and such, I'm inclined to leave things as is. If end users are griping to you about the CST syntax (i.e. when writing in YAML), I'm a bit more inclined to listen. |
Beta Was this translation helpful? Give feedback.
-
@cpard what's your take? |
Beta Was this translation helpful? Give feedback.
-
Huh, I knew about the DB limits but TIL
That's a good point... Maybe recap could throw an error when it detects this is going to happen (for any type), and make the user explicitly acknowledge it by passing in a flag to bypass the check?
I'm definitely more concerned about end users - I was playing around with recap yesterday, started writing some recap schemas for existing proto/avro files we have, and when I hit my first string I realized I had no idea what to put for the byte value. I ended up digging into the converter code for each to see what value was being used I imagine it will be confusing for users who aren't used to being explicitly forced to pick a max length. I also realized while I was working on this that currently it isn't possible for me to write a recap schema that "works" for both avro and proto because the converters use different values for the length. I was just going to update my comparison code to ignore the string length value for proto/avro, but then that seemed like it would be even more confusing for end users: "You have to specify a length, but we're going to ignore it in certain cases" which is how we arrived at this post 😅 |
Beta Was this translation helpful? Give feedback.
-
Ah, gotcha. So we're talking about CST here. I'm more sympathetic to that concern, since it's the end users.
I'm not 100% sure what you mean here. You can write this schema: # A struct with a required signed 32-bit integer field called "id"
type: struct
fields:
- name: id
type: int
bits: 32
- name: email
type: string
bytes: 255 This would be compatible with both Avro and Proto, since 255 < INT_MAX and 255 < LONG_MAX. It's a subset of both type systems. I'm guessing what you mean is that you can't write: # A struct with a required signed 32-bit integer field called "id"
type: struct
fields:
- name: id
type: int
bits: 32
- name: email
type: string
bytes: 9_223_372_036_854_775_807 Since that would be compatible with Avro but not Proto's type systems, right? If so, yes, that's true. But that's valuable information, and I think it should be up to the tooling to decide how coercion works in such a case. So if you had a field with type I think this might be what you were saying with, "Maybe recap could throw an error when it detects this is going to happen (for any type), and make the user explicitly acknowledge it by passing in a flag to bypass the check?" If so, I agree, but Recap doesn't currently have a Recap -> (Avro|Proto) converter. It only has (Avro|Proto|JSONSchema) -> Recap. It sounds like you might be writing the Recap -> (Avro|Proto) converter. If so, want to contribute? 😅 This is where I think the alerts would live, right?
I take your point. So in this case, the question is how to handle the default when no
I need to think on this a bit more. My instinct right now is (2) since it maintains the required bytes length, which I still feel is important to codify so we know when data loss/truncation is occurring. It also means we could leave the AST code untouched (StringType, BytesType). (some time passes) Now that I think about it (1) and (2) aren't mutually exclusive. If we make |
Beta Was this translation helpful? Give feedback.
-
K, here's what I'm thinking:
I'll move forward with this. LMK if I've missed anything. |
Beta Was this translation helpful? Give feedback.
-
PR is here: LMK what you think. I left a comment on it with a question. |
Beta Was this translation helpful? Give feedback.
-
Marking this resolved. I've merged #293. I've also updated the spec (bumped to 0.1.1): |
Beta Was this translation helpful? Give feedback.
-
Wanted to run something by you... As I've been playing around more with recap, one thing that's felt kind of off is the requirement to always set the max length of a string. In practice I almost never do this, opting for
VARCHAR
(orTEXT
in MySQL) in most cases except when I know there needs to be a length restriction for some reason. In Postgres and Snowflake the behavior forVARCHAR
without a length specifier is to default to the maximum allowed length, and even though it's not part of the ANSI standard pretty much all databases have some form ofTEXT
orVARCHAR(max)
these days.Having the default
string
length be "infinite" also helps with compatibility when converting between other formats. Protobuf, and Avro have no concept of maximum string length (sort of... Avro hasfixed
but that's not variable length), and JSON schema does let you specify min/max lengths, but defaults to "infinite" as well. When converting between Postgres, MySQL, and Snowflake, each of which had different max length values, being able to omit the length attribute in the recap schema would indicate that the length should be the max the database allows.A few ideas that popped into my head while writing this:
bytes
as a required attribute, but have the base type be something likestringn
which requires a length, and havestring
be an alias that uses some fixed max valuestring
has nobytes
attribute at all and is always varying, andstringn
requires it and is optionally varyingbytes
attribute optional to indicate "infinite" length (or the max the specific platform can support), andvarying
only applies if the length is specifiedThoughts?
Beta Was this translation helpful? Give feedback.
All reactions