Change string type length behavior to mimic VARCHAR #284

adrianisk · 2023-06-29T02:41:23Z

adrianisk
Jun 29, 2023
Maintainer

Wanted to run something by you... As I've been playing around more with recap, one thing that's felt kind of off is the requirement to always set the max length of a string. In practice I almost never do this, opting for VARCHAR (or TEXT in MySQL) in most cases except when I know there needs to be a length restriction for some reason. In Postgres and Snowflake the behavior for VARCHAR without a length specifier is to default to the maximum allowed length, and even though it's not part of the ANSI standard pretty much all databases have some form of TEXT or VARCHAR(max) these days.

Having the default string length be "infinite" also helps with compatibility when converting between other formats. Protobuf, and Avro have no concept of maximum string length (sort of... Avro has fixed but that's not variable length), and JSON schema does let you specify min/max lengths, but defaults to "infinite" as well. When converting between Postgres, MySQL, and Snowflake, each of which had different max length values, being able to omit the length attribute in the recap schema would indicate that the length should be the max the database allows.

A few ideas that popped into my head while writing this:

Keep bytes as a required attribute, but have the base type be something like stringn which requires a length, and have string be an alias that uses some fixed max value
Have two separate string types - string has no bytes attribute at all and is always varying, and stringn requires it and is optionally varying
Just make the bytes attribute optional to indicate "infinite" length (or the max the specific platform can support), and varying only applies if the length is specified

Thoughts?

criccomini · 2023-06-29T15:21:46Z

criccomini
Jun 29, 2023

Ah ha, this is going to be fun. 😄

one thing that's felt kind of off is the requirement to always set the max length of a string

So one point of clarification: are you talking about the concrete syntax (i.e. YAML, TOML, etc) or the abstract syntax (StringType). Reason I ask, is the CST comes with string32 and string64 aliases out of the box for this purpose.

Ok, now on to the broader point. I deliberately forced the bytes_ length for StringType (and BytesType) because I discovered that there actually are hard caps for nearly every type; they're just frequently ignored or not well documented.

PG is 10485760 bytes
Avro is LONG_MAX
Protobuf is INT_MAX
Snowflake is 16,777,216 (and 8,388,608 for binary)

The only example I can find of a truly unbounded string is JSON schema. But even that is fantasy because programming languages represent their strings in memory and must have a max bit length. For example, Java max string length is INT_MAX. Same deal for byte[] arrays. So really, JSON schema strings aren't infinite, they're undefined, which I'd argue is worse.

The next question is whether we care about the max length at these boundaries. I think we actually do for all of the examples that I provided above except Avro (a string with a max of LONG_MAX is going to exceed any memory length, most storage space, and filesystem size limits). But going from a Snowflake VARCHAR to a PG VARCHAR, it's actually important to know that your strings are going to get truncated.

So, to your proposals:

Regarding (1), we already do that in the CST (see my note above about string32 and string64).

Between (2) and (3), if forced to choose, I would probably go with (3) over (2) just because it's a little simpler on the type hierarchy (but maybe a little less "pure"/"clean").

But to really answer you, I think I need to know who is feeling this pain. This is why I asked about the CST/AST above. If your feeling the pain mostly in the AST when implementing converters and such, I'm inclined to leave things as is. If end users are griping to you about the CST syntax (i.e. when writing in YAML), I'm a bit more inclined to listen.

0 replies

criccomini · 2023-06-29T15:22:47Z

criccomini
Jun 29, 2023

@cpard what's your take?

1 reply

cpard Jun 30, 2023

From a user perspective it's good to have a char array kind of type with optional fixed length. My feeling here is that most users consider CHAR(n) as the option to set a character column with fixed length and VARCHAR or TEXT etc the option for when they don't want to mess with fixed lengths. I assume that behind the scenes the type system just sets VARCHAR to a default (or even max) length but this is something that the user doesn't and shouldn't have to care about.

I agree that for the library developer being explicit is a feature but for the end-user it's too cumbersome to have to manage lengths for text explicitly.

adrianisk · 2023-06-29T17:31:11Z

adrianisk
Jun 29, 2023
Maintainer Author

Avro is LONG_MAX
Protobuf is INT_MAX

Huh, I knew about the DB limits but TIL

But going from a Snowflake VARCHAR to a PG VARCHAR, it's actually important to know that your strings are going to get truncated.

That's a good point... Maybe recap could throw an error when it detects this is going to happen (for any type), and make the user explicitly acknowledge it by passing in a flag to bypass the check?

But to really answer you, I think I need to know who is feeling this pain. This is why I asked about the CST/AST above. If your feeling the pain mostly in the AST when implementing converters and such, I'm inclined to leave things as is. If end users are griping to you about the CST syntax (i.e. when writing in YAML), I'm a bit more inclined to listen.

I'm definitely more concerned about end users - I was playing around with recap yesterday, started writing some recap schemas for existing proto/avro files we have, and when I hit my first string I realized I had no idea what to put for the byte value. I ended up digging into the converter code for each to see what value was being used

https://github.com/recap-build/recap/blob/6473a319ea59c9acdbbf3a57a659a8a6b3dc6e72/recap/converters/protobuf.py#L142-L143

https://github.com/recap-build/recap/blob/6473a319ea59c9acdbbf3a57a659a8a6b3dc6e72/recap/converters/avro.py#L67-L68

I imagine it will be confusing for users who aren't used to being explicitly forced to pick a max length. I also realized while I was working on this that currently it isn't possible for me to write a recap schema that "works" for both avro and proto because the converters use different values for the length. I was just going to update my comparison code to ignore the string length value for proto/avro, but then that seemed like it would be even more confusing for end users: "You have to specify a length, but we're going to ignore it in certain cases" which is how we arrived at this post 😅

0 replies

criccomini · 2023-06-29T18:32:40Z

criccomini
Jun 29, 2023

Ah, gotcha. So we're talking about CST here. I'm more sympathetic to that concern, since it's the end users.

I also realized while I was working on this that currently it isn't possible for me to write a recap schema that "works" for both avro and proto because the converters use different values for the length.

I'm not 100% sure what you mean here. You can write this schema:

# A struct with a required signed 32-bit integer field called "id"
type: struct
fields:
  - name: id
    type: int
    bits: 32
  - name: email
    type: string
    bytes: 255

This would be compatible with both Avro and Proto, since 255 < INT_MAX and 255 < LONG_MAX. It's a subset of both type systems.

I'm guessing what you mean is that you can't write:

# A struct with a required signed 32-bit integer field called "id"
type: struct
fields:
  - name: id
    type: int
    bits: 32
  - name: email
    type: string
    bytes: 9_223_372_036_854_775_807

Since that would be compatible with Avro but not Proto's type systems, right?

If so, yes, that's true. But that's valuable information, and I think it should be up to the tooling to decide how coercion works in such a case. So if you had a field with type string64, and you were converting that to Proto, you'd probably have a WARN or something that would alert the user (or perhaps throw an exception?).

I think this might be what you were saying with, "Maybe recap could throw an error when it detects this is going to happen (for any type), and make the user explicitly acknowledge it by passing in a flag to bypass the check?" If so, I agree, but Recap doesn't currently have a Recap -> (Avro|Proto) converter. It only has (Avro|Proto|JSONSchema) -> Recap. It sounds like you might be writing the Recap -> (Avro|Proto) converter. If so, want to contribute? 😅 This is where I think the alerts would live, right?

I imagine it will be confusing for users who aren't used to being explicitly forced to pick a max length.

I take your point. So in this case, the question is how to handle the default when no bytes is defined for string/bytes in the CST.

As you said, we could make bytes optional. If unset, the converter is left to decide.
We could default the length to a preset value (like INT_MAX or 10 MiBs).
We could do one of the other approaches you enumerated above.

I need to think on this a bit more. My instinct right now is (2) since it maintains the required bytes length, which I still feel is important to codify so we know when data loss/truncation is occurring. It also means we could leave the AST code untouched (StringType, BytesType).

(some time passes)

Now that I think about it (1) and (2) aren't mutually exclusive. If we make bytes optional in the CST, but required in the AST, then it's up to the converter (from_dict) to decide how to handle missing bytes fields. This is where we could expose a parameter to fail or use a default. Is this what you were thinking?

0 replies

criccomini · 2023-06-30T15:54:33Z

criccomini
Jun 30, 2023

K, here's what I'm thinking:

Make bytes optional for string and bytes in CST (YAML, TOML, JSON, etc).
Keep bytes required for string and bytes in AST (RecapTypes).
When converting to Recap types in converters/readers/from_dict, bytes must be set, since they're dealing with the AST.
from_dict should take a "defaults" param that allows devs to set default values for types when doing the conversion. So default = {"string": {"bytes": 9_223_372_036_854_775_807}} Would use 9_223_372_036_854_775_807 as the default value when constructing the AST if no bytes field is set for string types.
When converting from Recap types (to Avro, Proto, JSON--something we don't do yet), the converters should be configurable to either throw errors or use coercion rules (and log warnings). So a string with bytes: 9_223_372_036_854_775_807 could be converted to Protobuf as a string if the coercion rules allow it.

I'll move forward with this. LMK if I've missed anything.

3 replies

criccomini Jun 30, 2023

cc @adrianisk @cpard

adrianisk Jun 30, 2023
Maintainer Author

Sounds good to me! Also some replies to your earlier comment

I'm not 100% sure what you mean here. You can write this schema:

type: struct
fields:
  - name: id
    type: int
    bits: 32
  - name: email
    type: string
    bytes: 255

This would be compatible with both Avro and Proto, since 255 < INT_MAX and 255 < LONG_MAX. It's a subset of both type systems.

Sorry compatible probably wasn't the right word, I meant if you write the above recap schema, then parse an equivalent proto/avro file the two schemas would not be equal because the converter would set the bytes values to a different length. After reading some of your replies though, I agree that is the right behavior for recap's AST implementation.

I think this might be what you were saying with, "Maybe recap could throw an error when it detects this is going to happen (for any type), and make the user explicitly acknowledge it by passing in a flag to bypass the check?" If so, I agree, but Recap doesn't currently have a Recap -> (Avro|Proto) converter. It only has (Avro|Proto|JSONSchema) -> Recap. It sounds like you might be writing the Recap -> (Avro|Proto) converter. If so, want to contribute? 😅 This is where I think the alerts would live, right?

Yep that's what I meant, and I'm not working on anything that fancy yet, but if I do I will absolutely contribute

cpard Jun 30, 2023

sounds good to me too!

criccomini · 2023-06-30T17:01:31Z

criccomini
Jun 30, 2023

PR is here:

#292

LMK what you think. I left a comment on it with a question.

1 reply

criccomini Jun 30, 2023

Alternative PR that just defaults to 64KiB for string and bytes:

#293

I am leaning this way now.

criccomini · 2023-06-30T21:35:05Z

criccomini
Jun 30, 2023

Marking this resolved. I've merged #293. I've also updated the spec (bumped to 0.1.1):

gabledata/recap-website#2

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change string type length behavior to mimic VARCHAR #284

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Change string type length behavior to mimic VARCHAR #284

adrianisk Jun 29, 2023 Maintainer

Replies: 7 comments · 5 replies

adrianisk Jun 29, 2023 Maintainer Author

adrianisk Jun 30, 2023 Maintainer Author

adrianisk
Jun 29, 2023
Maintainer

Replies: 7 comments 5 replies

adrianisk
Jun 29, 2023
Maintainer Author

adrianisk Jun 30, 2023
Maintainer Author