-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys #5
Comments
There are 2 concerns.
For the .txt file the code in the tokenizer has to be modified (line 69). The question is how exactly to handle the situation if there are actually duplicate keys. Do we want to warn the user on read in? Should it be a validation error since technically it's allowed by JSON? To give a message during validation we would have to detect it on read in and note it in a way that could be found during validation, such as using 2 different container types. If we want to be able to faithfully reproduce the file with the duplicate keys we have to use a different data structure than a dict. From looking around online it looks like a common thing to do if there are duplicate keys is to use one key, but turn the value into a list of all the values. Ex.
We could go this route and then modify the write out code to look for this case and reproduce the duplicate keys. This list type can also be detected during validation and we can give some warnings about duplicate keys. For JSON files we can do a similar thing on the read in side, see this SO post. The write out is a bit different. Basically, there is no way to write out duplicate keys using the built in json library. There is a PyPi package that looks like it can though. What do you think? |
I think we should treat the presence of duplicate keys as an error in
validation, given that Metabolomics Workbench is putting them into a
dictionary in the JSONized version of mwTab.
…On Fri, Apr 12, 2024 at 7:53 PM ptth222 ***@***.***> wrote:
There are 2 concerns.
1. Duplicate keys in a .txt file.
2. Duplicate keys in a .json file. Note that the JSON specification
doesn't require unique keys.
For the .txt file the code in the tokenizer has to be modified (line 69).
The question is how exactly to handle the situation if there are actually
duplicate keys. Do we want to warn the user on read in? Should it be a
validation error since technically it's allowed by JSON? To give a message
during validation we would have to detect it on read in and note it in a
way that could be found during validation, such as using 2 different
container types. If we want to be able to faithfully reproduce the file
with the duplicate keys we have to use a different data structure than a
dict. From looking around online it looks like a common thing to do if
there are duplicate keys is to use one key, but turn the value into a list
of all the values.
Ex.
{
'posting': {
'content': 'stuff',
'timestamp': '123456789'
}
'posting': {
'content': 'weird stuff',
'timestamp': '93828492'
}
}
# becomes
{
'posting': [
{
'content': 'stuff',
'timestamp': '123456789'
},
{
'content': 'weird stuff',
'timestamp': '93828492'
}
]
}
We could go this route and then modify the write out code to look for this
case and reproduce the duplicate keys. This list type can also be detected
during validation and we can give some warnings about duplicate keys.
For JSON files we can do a similar thing on the read in side, see this SO
post
<https://stackoverflow.com/questions/14902299/json-loads-allows-duplicate-keys-in-a-dictionary-overwriting-the-first-value>.
The write out is a bit different. Basically, there is no way to write out
duplicate keys using the built in json library. There is a PyPi package
<https://pypi.org/project/json-duplicate-keys/> that looks like it can
though.
What do you think?
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEP7B3PODGGTXXOR4PYXX3Y5BXXXAVCNFSM6AAAAAAROHXQQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJSG4YDKNBYGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Hunter Moseley, Ph.D. -- Univ. of Kentucky
Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center
/ Institute for Biomedical Informatics / UK Superfund Research Center
Not just a scientist, but a fencer as well.
My foil is sharp, but my mind sharper still.
---------------------------------------------------------------
Email: ***@***.*** (work) ***@***.***
(personal)
Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax)
Web: http://bioinformatics.cesb.uky.edu/
Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
|
Added code to handle and validate having duplicate keys in the "Additional sample data". Also changed documentation where appropriate and added to changelog. Closes #5.
The extra data in SUBJECT/SAMPLE FACTORS block needs to be tested for duplicate keys.
This will likely need to occur during the initial parsing of the mwTab-formatted files.
The text was updated successfully, but these errors were encountered: