-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UnknownType and use in converters #430
base: main
Are you sure you want to change the base?
Conversation
Great, thanks! I'll take a look tomorrow. One thing I noticed in the website PR was that it seems like you might be modifying the CST to add an "unknown" type. This was something I mentioned wanting to avoid in #429:
I'll take a look and see if we can get away without adding "unknown" as a concrete syntax. |
Ah yes, forgot/missed that. Thanks for taking a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I left some comments. A few notes:
- I think we should not try to serialize
UnknownType
out to JSON/Proto/Avro. None of those formats support an unknown format. Whatever we do there will break parsers or be pretty hacky. - To that end, I also don't believe we should include
"unknown"
in the CST syntax. Do you need this? If so, what is the use case?
I view UnknownType as a way for developers to get a hint that there's a column that couldn't be handled by the converters. The devs are then left with a few options:
a. Strip the UnknownType
(s) from the recap type (ignore).
b. Raise an error that the converter wasn't able to support all types.
c. Update the converter to properly handle the type.
d. Manually replace the UnknownType with a proper RecapType based on some outside knowledge.
In this philosophy, we don't need an "unknown" CST, and including as much info as possible to the dev via UnknownType's constructor seems like the right path.
@@ -147,7 +148,7 @@ def _from_recap( | |||
# meaning from other types. | |||
avro_schema.pop("aliases", None) | |||
case _: | |||
raise ValueError(f"Unsupported Recap type: {recap_type}") | |||
avro_schema["type"] = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should continue to raise value error here.. "unkonwn" is not a valid Avro schema type. Parsers would fail with this.
@@ -352,7 +353,7 @@ def _parse( | |||
# Short circuit so we don't re-register aliases for nested types | |||
return return_type | |||
case _: | |||
raise ValueError(f"Unsupported Avro schema: {avro_schema}") | |||
return UnknownType(**extra_attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do UnknownType(**avro_schema)
rather than extra_attrs. It'd be more useful for debugging to see everything in the Avro schema.
@@ -431,4 +432,4 @@ def _parse_logical( | |||
**extra_attrs, | |||
) | |||
case _: | |||
raise ValueError(f"Unsupported Avro logical type: {logical_type}") | |||
return UnknownType(**extra_attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ same here
@@ -69,7 +70,7 @@ def to_recap(self, fields: list[bigquery.SchemaField]) -> StructType: | |||
scale=field.scale or 0, | |||
) | |||
case _: | |||
raise ValueError(f"Unrecognized field type: {field.field_type}") | |||
field_type = UnknownType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe UnknownType(field_type=field_type)
? I'm thinking we should try to give the dev as much info as possible about the thing we can't convert.
@@ -44,6 +45,9 @@ def to_recap( | |||
alias_strategy = alias_strategy or JSONSchemaConverter.urn_to_alias | |||
json_schema = json.loads(json_schema_str) | |||
recap_schema = self._parse(json_schema, alias_strategy) | |||
if isinstance(recap_schema, UnknownType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels hacky. I think we should probably refactor json_schema to use a similar strategy to the avro converter; namely, creating a _parse that returns RecapType
and using that in to_recap.
@@ -75,6 +75,6 @@ def _parse_type(self, column_props: dict[str, Any]) -> RecapType: | |||
unit = self._get_time_unit(params) or "nanosecond" | |||
base_type = IntType(bits=32, logical="build.recap.Time", unit=unit) | |||
else: | |||
raise ValueError(f"Unknown data type: {data_type}") | |||
base_type = UnknownType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include data_type
raise ValueError( | ||
f"Unsupported `{column_type}` type for `{column_name}`" | ||
) | ||
return UnknownType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include column_* stuff
class UnknownType(RecapType): | ||
"""Represents an unknown or unsupported Recap type.""" | ||
|
||
def __init__(self, description: str = "Unknown or unsupported type", **extra_attrs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused by this. RecapType has "doc". Any reason you're not using that?
super().__init__("unknown", **extra_attrs) | ||
self.description = description | ||
|
||
def __eq__(self, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for eq and repr. RecapType has these already.
@@ -471,6 +486,9 @@ def from_dict( | |||
[from_dict(t, registry) for t in type_dict.pop("types")], | |||
**type_dict, | |||
) | |||
case "unknown": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I'd strongly prefer to avoid. I am really nervous about adding "unknown"
as a type to the CST.
I don't think I will be able to make another iteration on this for several more weeks |
No worries, there's nothing pressing on my end. |
Addresses #429
Adds a new type
UnknownType
for situations where converters can't figure out an appropriate existing recap type.This does not have comprehensive test cases and I'm not sure this is the right way to go for all these converters, but I wanted to get started on this so we can discuss and collaborate on it.
Made a PR for the website/json schema update: gabledata/recap-website#16
Validation
We will need to add a bunch of tests for the converters, parser, and for the JSON schema validation.