Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys #5

Open
hunter-moseley opened this issue Oct 25, 2022 · 2 comments

Comments

@hunter-moseley
Copy link
Member

The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys.
This will likely need to occur during the initial parsing of the mwTab-formatted files.

@ptth222
Copy link
Collaborator

ptth222 commented Apr 12, 2024

There are 2 concerns.

  1. Duplicate keys in a .txt file.
  2. Duplicate keys in a .json file. Note that the JSON specification doesn't require unique keys.

For the .txt file the code in the tokenizer has to be modified (line 69). The question is how exactly to handle the situation if there are actually duplicate keys. Do we want to warn the user on read in? Should it be a validation error since technically it's allowed by JSON? To give a message during validation we would have to detect it on read in and note it in a way that could be found during validation, such as using 2 different container types. If we want to be able to faithfully reproduce the file with the duplicate keys we have to use a different data structure than a dict. From looking around online it looks like a common thing to do if there are duplicate keys is to use one key, but turn the value into a list of all the values.

Ex.

{ 
   'posting': {
                'content': 'stuff',
                'timestamp': '123456789'
              }
   'posting': {
                'content': 'weird stuff',
                'timestamp': '93828492'
              }
}
# becomes
{
   'posting': [
    {
        'content': 'stuff',
        'timestamp': '123456789'
    },
    {
        'content': 'weird stuff',
        'timestamp': '93828492'
    }
  ]
}

We could go this route and then modify the write out code to look for this case and reproduce the duplicate keys. This list type can also be detected during validation and we can give some warnings about duplicate keys.

For JSON files we can do a similar thing on the read in side, see this SO post. The write out is a bit different. Basically, there is no way to write out duplicate keys using the built in json library. There is a PyPi package that looks like it can though.

What do you think?

@hunter-moseley
Copy link
Member Author

hunter-moseley commented Apr 13, 2024 via email

ptth222 added a commit that referenced this issue Apr 25, 2024
Added code to handle and validate having duplicate keys in the "Additional sample data". Also changed documentation where appropriate and added to changelog. Closes #5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants