Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UnknownType and use in converters #430

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mjperrone
Copy link
Contributor

@mjperrone mjperrone commented Mar 28, 2024

Addresses #429

Adds a new type UnknownType for situations where converters can't figure out an appropriate existing recap type.

This does not have comprehensive test cases and I'm not sure this is the right way to go for all these converters, but I wanted to get started on this so we can discuss and collaborate on it.

Made a PR for the website/json schema update: gabledata/recap-website#16

Validation

We will need to add a bunch of tests for the converters, parser, and for the JSON schema validation.

@criccomini
Copy link
Contributor

Great, thanks! I'll take a look tomorrow.

One thing I noticed in the website PR was that it seems like you might be modifying the CST to add an "unknown" type. This was something I mentioned wanting to avoid in #429:

If we went this route, I wouldn't want to add an unknown to the CST, but rather keep UnknownType as only an AST class similar to ProxyType.

I'll take a look and see if we can get away without adding "unknown" as a concrete syntax.

@mjperrone
Copy link
Contributor Author

Ah yes, forgot/missed that. Thanks for taking a look

Copy link
Contributor

@criccomini criccomini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I left some comments. A few notes:

  1. I think we should not try to serialize UnknownType out to JSON/Proto/Avro. None of those formats support an unknown format. Whatever we do there will break parsers or be pretty hacky.
  2. To that end, I also don't believe we should include "unknown" in the CST syntax. Do you need this? If so, what is the use case?

I view UnknownType as a way for developers to get a hint that there's a column that couldn't be handled by the converters. The devs are then left with a few options:

a. Strip the UnknownType(s) from the recap type (ignore).
b. Raise an error that the converter wasn't able to support all types.
c. Update the converter to properly handle the type.
d. Manually replace the UnknownType with a proper RecapType based on some outside knowledge.

In this philosophy, we don't need an "unknown" CST, and including as much info as possible to the dev via UnknownType's constructor seems like the right path.

@@ -147,7 +148,7 @@ def _from_recap(
# meaning from other types.
avro_schema.pop("aliases", None)
case _:
raise ValueError(f"Unsupported Recap type: {recap_type}")
avro_schema["type"] = "unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should continue to raise value error here.. "unkonwn" is not a valid Avro schema type. Parsers would fail with this.

@@ -352,7 +353,7 @@ def _parse(
# Short circuit so we don't re-register aliases for nested types
return return_type
case _:
raise ValueError(f"Unsupported Avro schema: {avro_schema}")
return UnknownType(**extra_attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do UnknownType(**avro_schema) rather than extra_attrs. It'd be more useful for debugging to see everything in the Avro schema.

@@ -431,4 +432,4 @@ def _parse_logical(
**extra_attrs,
)
case _:
raise ValueError(f"Unsupported Avro logical type: {logical_type}")
return UnknownType(**extra_attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ same here

@@ -69,7 +70,7 @@ def to_recap(self, fields: list[bigquery.SchemaField]) -> StructType:
scale=field.scale or 0,
)
case _:
raise ValueError(f"Unrecognized field type: {field.field_type}")
field_type = UnknownType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe UnknownType(field_type=field_type)? I'm thinking we should try to give the dev as much info as possible about the thing we can't convert.

@@ -44,6 +45,9 @@ def to_recap(
alias_strategy = alias_strategy or JSONSchemaConverter.urn_to_alias
json_schema = json.loads(json_schema_str)
recap_schema = self._parse(json_schema, alias_strategy)
if isinstance(recap_schema, UnknownType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels hacky. I think we should probably refactor json_schema to use a similar strategy to the avro converter; namely, creating a _parse that returns RecapType and using that in to_recap.

@@ -75,6 +75,6 @@ def _parse_type(self, column_props: dict[str, Any]) -> RecapType:
unit = self._get_time_unit(params) or "nanosecond"
base_type = IntType(bits=32, logical="build.recap.Time", unit=unit)
else:
raise ValueError(f"Unknown data type: {data_type}")
base_type = UnknownType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include data_type

raise ValueError(
f"Unsupported `{column_type}` type for `{column_name}`"
)
return UnknownType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include column_* stuff

class UnknownType(RecapType):
"""Represents an unknown or unsupported Recap type."""

def __init__(self, description: str = "Unknown or unsupported type", **extra_attrs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused by this. RecapType has "doc". Any reason you're not using that?

super().__init__("unknown", **extra_attrs)
self.description = description

def __eq__(self, other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for eq and repr. RecapType has these already.

@@ -471,6 +486,9 @@ def from_dict(
[from_dict(t, registry) for t in type_dict.pop("types")],
**type_dict,
)
case "unknown":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part I'd strongly prefer to avoid. I am really nervous about adding "unknown" as a type to the CST.

@mjperrone mjperrone marked this pull request as draft April 17, 2024 05:23
@mjperrone
Copy link
Contributor Author

I don't think I will be able to make another iteration on this for several more weeks

@criccomini
Copy link
Contributor

No worries, there's nothing pressing on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants