Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wiz-cli] the wiz tool shouldn't use any duplicate dataclass names #58

Open
rnag opened this issue May 13, 2022 · 1 comment
Open

[wiz-cli] the wiz tool shouldn't use any duplicate dataclass names #58

rnag opened this issue May 13, 2022 · 1 comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed self-created Opened by me!

Comments

@rnag
Copy link
Owner

rnag commented May 13, 2022

  • Dataclass Wizard version: 0.22.1
  • Python version: 3.10
  • Operating System: Mac OS

Description

I found an interesting case where the wrong dataclass schema is generated with the wiz command-line tool. The schema is wrong insofar as that the same dataclass name - Ability in this case - is reused between generated dataclasses in a nested dataclass structure. This obviously won't allow us to actually load JSON data to the main dataclass Data, because the nested schema has a bug (duplicate name for dataclass) as indicated.

See output below for more details.

What I Did

I ran wiz gs from my terminal, with the following JSON data as input:

echo '{"id": 1,
"height": 7,
"name": "bulbasur",
"abilities": [
    {
      "ability": {
        "name": "overgrow",
        "url": "https://pokeapi.co/api/v2/ability/65/"
      },
      "is_hidden": false,
      "slot": 1
    },
    {
      "ability": {
        "name": "chlorophyll",
        "url": "https://pokeapi.co/api/v2/ability/34/"
      },
      "is_hidden": true,
      "slot": 3
    }
  ]
}' | wiz gs -x

The result generated the following nested dataclass schema:

from __future__ import annotations

from dataclasses import dataclass

from dataclass_wizard import JSONWizard


@dataclass
class Data(JSONWizard):
    """
    Data dataclass

    """
    id: int
    height: int
    name: str
    abilities: list[Ability]


@dataclass
class Ability:
    """
    Ability dataclass

    """
    ability: Ability
    is_hidden: bool
    slot: int


@dataclass
class Ability:
    """
    Ability dataclass

    """
    name: str
    url: str

As you can already see, the Ability dataclass name is duplicated. This results in the second class definition overwriting the previous one, which is what we would like to avoid in this case.

I've then tried the following Python code with the generated class schema above:

j = '{"id":1,"height":7,"name":"bulbasur","abilities":[{"ability":{"name":"overgrow","url":"https://pokeapi.co/api/v2/ability/65/"},"is_hidden":false,"slot":1},{"ability":{"name":"chlorophyll","url":"https://pokeapi.co/api/v2/ability/34/"},"is_hidden":true,"slot":3}]}'
instance = Data.from_json(j)

Which resulted in an error, as expected, since the second dataclass definition overwrites the first one:

  error: Ability.__init__() missing 2 required positional arguments: 'name' and 'url'

Renaming the second class to something else, for example Ability2, and then updating any references to it, then works as expected to load the JSON input data to a dataclass instance.

Resolution

One possible approach that comes to mind, is to keep a hashset of all generated class names - or better yet a dict mapping of class name to a count of its occurrences, or how many times we've seen it - maybe at the module or global scope, or perhaps recursively pass it in through each function in the generation process.

Then check if the class name we want to add is already in the mapping, and if so we add some random suffix (for example increment by the count of how many times we've already seen the class name) to the generated class name; if not, we use the generated class name as-is. In any case, we then increment the count of the class name in the mapping, to indicate that we've seen it previously.

@rnag rnag added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers labels May 13, 2022
@rnag rnag changed the title the wiz tool shouldn't use any duplicate dataclass names [wiz-cli] the wiz tool shouldn't use any duplicate dataclass names May 13, 2022
@ialarmedalien
Copy link

I've run into the same issue, with the added complication that the same key is reused to represent several objects:

{
   "data": {
      "citations": {
        "data": [
          {
            "id": "10.1093/bioinformatics/btu153",
            "type": "dois"
          },
          {
            "id": "10.1038/nbt.4163",
            "type": "dois"
          }
        ]
      },
      "client": {
        "data": {
          "id": "doe.lbnl",
          "type": "clients"
        }
      },
      "media": {
        "data": {
          "id": "10.25982/26409.220/1820944",
          "type": "media"
        }
      },
      "partOf": {
        "data": []
      },
   }
}

The data key is being used here as a container for the whole API response, as an object with keys id and type, and as a container for anonymous objects with keys id and type. The data structure above is particularly tricky as it's not clear whether the value of the data object automatically gets turned into a list if there are several (see data["citations"]["data"] vs data["client"]["data"] vs data["partOf"]["data"]

Possible approach:

  • have two dataclass generation modes, conservative and DRY
  • in conservative mode, a separate dataclass is generated for each instance of the duplicated key
  • in DRY mode, the wizard will attempt to reuse class names
  • when parsing, collect instances of the generated class names and the data structures that they represent
  • if the data structures beneath the keys match, reuse the class name -- e.g. in the data structure above, data["client"]["data"] and data["attributes"]["media"]["data"]
  • otherwise, create a new class name

The definitive way to work out the class structures would be to have access to the schema (which, unfortunately, the API does not provide) but this would provide a reasonable estimation.

@rnag rnag added the self-created Opened by me! label Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed self-created Opened by me!
Projects
None yet
Development

No branches or pull requests

2 participants