The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys #5

hunter-moseley · 2022-10-25T18:34:53Z

The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys.
This will likely need to occur during the initial parsing of the mwTab-formatted files.

ptth222 · 2024-04-12T23:53:10Z

There are 2 concerns.

Duplicate keys in a .txt file.
Duplicate keys in a .json file. Note that the JSON specification doesn't require unique keys.

For the .txt file the code in the tokenizer has to be modified (line 69). The question is how exactly to handle the situation if there are actually duplicate keys. Do we want to warn the user on read in? Should it be a validation error since technically it's allowed by JSON? To give a message during validation we would have to detect it on read in and note it in a way that could be found during validation, such as using 2 different container types. If we want to be able to faithfully reproduce the file with the duplicate keys we have to use a different data structure than a dict. From looking around online it looks like a common thing to do if there are duplicate keys is to use one key, but turn the value into a list of all the values.

Ex.

{ 
   'posting': {
                'content': 'stuff',
                'timestamp': '123456789'
              }
   'posting': {
                'content': 'weird stuff',
                'timestamp': '93828492'
              }
}
# becomes
{
   'posting': [
    {
        'content': 'stuff',
        'timestamp': '123456789'
    },
    {
        'content': 'weird stuff',
        'timestamp': '93828492'
    }
  ]
}

We could go this route and then modify the write out code to look for this case and reproduce the duplicate keys. This list type can also be detected during validation and we can give some warnings about duplicate keys.

For JSON files we can do a similar thing on the read in side, see this SO post. The write out is a bit different. Basically, there is no way to write out duplicate keys using the built in json library. There is a PyPi package that looks like it can though.

What do you think?

hunter-moseley · 2024-04-13T00:50:27Z

I think we should treat the presence of duplicate keys as an error in validation, given that Metabolomics Workbench is putting them into a dictionary in the JSONized version of mwTab.

…

On Fri, Apr 12, 2024 at 7:53 PM ptth222 ***@***.***> wrote: There are 2 concerns. 1. Duplicate keys in a .txt file. 2. Duplicate keys in a .json file. Note that the JSON specification doesn't require unique keys. For the .txt file the code in the tokenizer has to be modified (line 69). The question is how exactly to handle the situation if there are actually duplicate keys. Do we want to warn the user on read in? Should it be a validation error since technically it's allowed by JSON? To give a message during validation we would have to detect it on read in and note it in a way that could be found during validation, such as using 2 different container types. If we want to be able to faithfully reproduce the file with the duplicate keys we have to use a different data structure than a dict. From looking around online it looks like a common thing to do if there are duplicate keys is to use one key, but turn the value into a list of all the values. Ex. { 'posting': { 'content': 'stuff', 'timestamp': '123456789' } 'posting': { 'content': 'weird stuff', 'timestamp': '93828492' } } # becomes { 'posting': [ { 'content': 'stuff', 'timestamp': '123456789' }, { 'content': 'weird stuff', 'timestamp': '93828492' } ] } We could go this route and then modify the write out code to look for this case and reproduce the duplicate keys. This list type can also be detected during validation and we can give some warnings about duplicate keys. For JSON files we can do a similar thing on the read in side, see this SO post <https://stackoverflow.com/questions/14902299/json-loads-allows-duplicate-keys-in-a-dictionary-overwriting-the-first-value>. The write out is a bit different. Basically, there is no way to write out duplicate keys using the built in json library. There is a PyPi package <https://pypi.org/project/json-duplicate-keys/> that looks like it can though. What do you think? — Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADEP7B3PODGGTXXOR4PYXX3Y5BXXXAVCNFSM6AAAAAAROHXQQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJSG4YDKNBYGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still. --------------------------------------------------------------- Email: ***@***.*** (work) ***@***.*** (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

Added code to handle and validate having duplicate keys in the "Additional sample data". Also changed documentation where appropriate and added to changelog. Closes #5.

ptth222 added a commit that referenced this issue Apr 25, 2024

Handle duplicate keys in JSON

049b758

Added code to handle and validate having duplicate keys in the "Additional sample data". Also changed documentation where appropriate and added to changelog. Closes #5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys #5

The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys #5

hunter-moseley commented Oct 25, 2022

ptth222 commented Apr 12, 2024

hunter-moseley commented Apr 13, 2024 via email

The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys #5

The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys #5

Comments

hunter-moseley commented Oct 25, 2022

ptth222 commented Apr 12, 2024

hunter-moseley commented Apr 13, 2024 via email