-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convoluted conversion from JSON to mwTab directly #6
Comments
Look at fileio.py submodule to understand how to use MWTabFile objects.
https://github.com/MoseleyBioinformaticsLab/mwtab/blob/master/mwtab/fileio.py
The MWTabFile __init__ takes a filename, but the read method is used to
read from a filehandle.
It is a little messy, but it was designed to capture the source whether it
was a filename, a URL, or a filepath in a zip archive.
Could use the tempfile module to create a temporary JSON file to read in.
https://docs.python.org/3/library/tempfile.html
Then could use:
mwtabfile = mwtab.MWTabFile(filename-to-save-to)
mwtabfile.read(tempfile_handle)
…On Thu, Nov 3, 2022 at 9:22 AM ptth222 ***@***.***> wrote:
The whole package essentially assumes everything is starting from a file
on disk, so if you already have a JSON dictionary constructed in memory it
is not straightforward to create the mwTab format file. One method is to
save the JSON out and then read it back in.
Excerpt from old MESSES convert code:
mwtab_json_fpath = "{}.json".format(splitext(results_path)[0])
mwtab_txt_fpath = "{}.txt".format(splitext(results_path)[0])
# save JSON
with open(mwtab_json_fpath, 'w', encoding="utf-8") as outfile:
json.dump(mwtabfile, outfile, indent=4)
# save mwTab (.txt)
with open(mwtab_txt_fpath, 'w', encoding="utf-8") as outfile:
mwfile = next(mwtab.read_files(mwtab_json_fpath))
mwfile.write(outfile, file_format="mwtab")
Another way is a bit hacky:
mwtabfile = mwtab.mwtab.MWTabFile("")
mwtabfile.update(output_json)
mwtabfile.header = " ".join(
["#METABOLOMICS WORKBENCH"]
+ [item[0] + ":" + item[1] for item in mwtabfile["METABOLOMICS WORKBENCH"].items() if item[0] not in ["VERSION", "CREATED_ON"]]
)
with open(args["<output-name>"], 'w', encoding='utf-8') as outfile:
mwtabfile.write(outfile, file_format="mwtab")
The MWTabFile class has a strange init. It requires a "source" input that
is supposed to be the path to the file I think, but then self.source is
never used by any of its methods. Similarly self.study_id and
self.analysis_id are set in the read method and not used. self.header is
also set in the read method and then used in the print_file method rather
than just printing the header when it is needed like every other section. I
feel like this should be retooled so it can accept an initial JSON in the
init, and maybe clean up some of the unused stuff, but I don't know that
they aren't used in other parts of the package.
—
Reply to this email directly, view it on GitHub
<#6>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEP7B765M47ND5X3ZE6WNDWGO4A5ANCNFSM6AAAAAARWEFYIA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Hunter Moseley, Ph.D. -- Univ. of Kentucky
Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center
/ Institute for Biomedical Informatics / UK Superfund Research Center
Not just a scientist, but a fencer as well.
My foil is sharp, but my mind sharper still.
---------------------------------------------------------------
Email: ***@***.*** (work) ***@***.***
(personal)
Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax)
Web: http://bioinformatics.cesb.uky.edu/
Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
|
The package may have been designed to handle URL, zip, etc., but not the MWTabFile class. Interpreting URL, filepath, or zip appears to be done by the GenericFilePath class, and then the read method of the MWTabFile class is used to read the contents. The point is that the filename (source) in the init is pointless and is not used by the class at all regardless of what it was designed to do. The value the MWTabFile object is initialized with is irrelevant. It would make more sense to not require a source. Currently the init really doesn't do anything at all, but it could be expanded to accept an optional json input and initialize with that value to make this problem easier. Example of how this might look:
This is an example of code that will work just fine even though the source is nonsense:
I did get the tempfile suggestion working:
Notice how it works just fine even though MWTabFile is initialized with an empty string. I can't decide if the tempfile solution is more or less hacky than the one I already put in the first comment. |
The current design has some issues, but it works.
Let's discuss this further, because there are other changes that need to be
made to the package.
It may make more sense to have the __init__ take no parameters.
Then the read method could take an optional source parameter.
Now additional creation methods can be added as staticmethods or
classmethods.
Also, the read method is using _is_json and the OrderedDict.update
inherited method to populate the MWTabFile object.
(_is_json method has an incorrect :rtype and the json_str variable is
misnamed, since it is a json data structure.)
If OrderDicts are used to create the data structure, then a simple
mwtabfile.update call should absorb the data structure.
…On Thu, Nov 3, 2022 at 12:36 PM ptth222 ***@***.***> wrote:
The package may have been designed to handle URL, zip, etc., but not the
MWTabFile class. Interpreting URL, filepath, or zip appears to be done by
the GenericFilePath class, and then the read method of the MWTabFile class
is used to read the contents.
The point is that the filename (source) in the init is pointless and is
not used by the class at all regardless of what it was designed to do. The
value the MWTabFile object is initialized with is irrelevant. It would make
more sense to not require a source. Currently the init really doesn't do
anything at all, but it could be expanded to accept an optional json input
and initialize with that value to make this problem easier.
Example of how this might look:
mwfile = mwtab.mwtab.MWTabFile(mwtab_json)
with open(path_to_txt_open, 'w', encoding='utf-8') as outfile:
mwfile.write(outfile, file_format='mwtab')
This is an example of code that will work just fine even though the source
is nonsense:
mwtabfile = mwtab.mwtab.MWTabFile("asdf")
mwtabfile.read("path_to_a_real_file")
I did get the tempfile suggestion working:
with tempfile.TemporaryFile(mode="w+", encoding="utf-8") as tp:
tp.write(json.dumps(output_json))
tp.seek(0)
mwfile = mwtab.mwtab.MWTabFile("")
mwfile.read(tp)
with open(args["<output-name>"], 'w', encoding="utf-8") as outfile:
mwfile.write(outfile, file_format="mwtab")
Notice how it works just fine even though MWTabFile is initialized with an
empty string.
I can't decide if the tempfile solution is more or less hacky than the one
I already put in the first comment.
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEP7B2CN3UBMRM5KXMBGHLWGPSXVANCNFSM6AAAAAARWEFYIA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Hunter Moseley, Ph.D. -- Univ. of Kentucky
Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center
/ Institute for Biomedical Informatics / UK Superfund Research Center
Not just a scientist, but a fencer as well.
My foil is sharp, but my mind sharper still.
---------------------------------------------------------------
Email: ***@***.*** (work) ***@***.***
(personal)
Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax)
Web: http://bioinformatics.cesb.uky.edu/
Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
|
I only opened the issue to suggest improving the API of the package so that you can directly save an mwTab formatted JSON as its text version, rather than having to save the JSON to file first. I further looked at what might be the best way to do this and offered some comments on what I found. The actual particulars of how to make this improvement don't really matter to me. I wanted to open the issue and document the need for this because I need it for MESSES convert, but I have 2 ways around saving out and reading back in, so it isn't urgent. I only respond to the other details to try and clear up confusion.
The current source parameter is unused, so why would the read method need one? What would it do with it?
Look at the second code block I put in the comment that started this issue. I also found this, but an update alone isn't enough due to the header being handled differently from the other sections for some reason. This is why I commented some about it. I highly recommend reading these issues on GitHub itself and not in the email so it is easier to see quotes, code blocks, tables, etc. that I formatting to appear well when viewed on the issue page. |
I see what you are saying about the header data member. But, it does not matter that no code is accessing self.source. After looking at the init, it makes more sense to have a default value for source of "". Probably best to add a staticmethod/classmethod to absorb a data structure and return new MWTabFile object. |
study_id and analysis_id are also not used. I can understand that a class can have data members that are meant to just be convenience for access or something, but generally you take great care with these so their values are always accurate. For instance, with this source member you would have it be required when reading in a file so that it is always accurate with what the data source actually is, or set it to None if the data is read in without it. Members like study_id and analysis_id make less sense for a class like this since it is a glorified dictionary and those fields are already stored in it. Those class methods could work. They are just roundabout initializers. As long as easy creation from a dict is achieved, whether it's through the init or a class method isn't all that different, in the end I end up with the same object I can use to save the JSON to text. |
After meeting decided to:
|
I want to clarify something with making the data members properties. Do we want to make them read-only? To make them otherwise would require keeping up with whether they have been manually set or modifying the underlying dictionary the members pull from. Making them read-only would technically break backward compatibility, but none of them had any expectation of being modified or needing to be modified, so I think it would be acceptable for them to be read-only. |
They can be implemented read-only initially and then made writable later,
especially once the MW File Status Website code is tested.
However, I think it is best to just implement them fully.
…On Wed, May 1, 2024 at 5:02 PM ptth222 ***@***.***> wrote:
I want to clarify something with making the data members properties. Do we
want to make them read-only? To make them otherwise would require keeping
up with whether they have been manually set or modifying the underlying
dictionary the members pull from. Making them read-only would technically
break backward compatibility, but none of them had any expectation of being
modified or needing to be modified, so I think it would be acceptable for
them to be read-only.
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEP7B5KAOMNU3C2UHIXAC3ZAFJ45AVCNFSM6AAAAAARWEFYICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBZGEZTINJYHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Hunter Moseley, Ph.D. -- Univ. of Kentucky
Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center
/ Institute for Biomedical Informatics / UK Superfund Research Center
Not just a scientist, but a fencer as well.
My foil is sharp, but my mind sharper still.
---------------------------------------------------------------
Email: ***@***.*** (work) ***@***.***
(personal)
Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax)
Web: http://bioinformatics.cesb.uky.edu/
Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
|
I don't think there is a way to make these properties without breaking some behavior. Currently, the values are initialized to empty strings and then the read method tries to set them to something else based on the information in the file being read in. If you change them to properties with a getter that tries to give you the underlying value from the dictionary then they will no longer return the empty string value from the current initialization.
Is this acceptable or what we want? The change in behavior has to happen to support manually creating your own dictionary I think. I don't think the class was made with that in mind. It was only considering reading from an already existing file and is not expecting changes to occur to the underlying dictionary. |
Simply add an if statement to n the getter to test whether the key exists
and then return empty string if it does not exist.
…On Thu, May 2, 2024, 10:02 AM ptth222 ***@***.***> wrote:
I don't think there is a way to make these properties without breaking
some behavior. Currently, the values are initialized to empty strings and
then the read method tries to set them to something else based on the
information in the file being read in. If you change them to properties
with a getter that tries to give you the underlying value from the
dictionary then they will no longer return the empty string value from the
current initialization.
# Current
temp = MWTabFile("source")
print(temp.study_id)
# prints the empty string
# Change to properties
temp = MWTabFile("source")
print(temp.study_id)
# raises a keyerror because the underlying dictionary is empty
Is this acceptable or what we want? The change in behavior has to happen
to support manually creating your own dictionary I think. I don't think the
class was made with that in mind. It was only considering reading from an
already existing file and is not expecting changes to occur to the
underlying dictionary.
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEP7B5GPPGI7Y7ERZYAISDZAJBPRAVCNFSM6AAAAAARWEFYICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQGU3TONJRGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
You have to raise an error to get the current same behavior for the read method. Unless you want to have the read method copy what the getter already does. You would also lose part of the point of having a managed property, so that there isn't mismatch between the property and the underlying dictionary. There are a few issues we have to balance here.
I think something has to give here. I'm not sure why the read method raises an error. It doesn't actually stop the dictionary from being populated, the error is raised after the underlying dictionary is updated from the file contents. You could still use the writing methods after getting the error and it will just print with an empty string or not at all. If there is going to be error checking shouldn't it be on the write out as well? Maybe we could get rid of errors and leave that to validation. The data members just try to create themselves from the underlying dict and if they can't then return the empty string, and it would be up to the user to validate before writing. I can't decide if it's better to throw an error, return the empty string, or return None if the data member can't be created. I don't like the empty string because a user could think that the underlying value is truly an empty string. I suppose they could get this impression with None as well though. From the stand point of trying to use this class to manually create an MWTab file I personally would rather have the errors, but I would also just use the validation. |
I think leaving errors to validation makes the most sense.
Returning None when the key-value pair is not there is reasonable getter
behavior.
…On Thu, May 2, 2024 at 6:11 PM ptth222 ***@***.***> wrote:
You have to raise an error to get the current same behavior for the read
method. Unless you want to have the read method copy what the getter
already does. You would also lose part of the point of having a managed
property, so that there isn't mismatch between the property and the
underlying dictionary.
There are a few issues we have to balance here.
1. The data members not matching the underlying dictionary. Currently,
this is possible if you don't use the read method to populate the
dictionary or if you change something after using the read method.
2. Trying to keep the functionality of the data members the same.
3. Raising an error in the read method if the data members can't be
created due to the file missing "METABOLOMICS WORKBENCH".
4. Letting the user know when the data members can't be created from
the underlying dictionary. This is similar to 3. It's about being
consistent between populating the dictionary manually or using the read
method. If you populate using the read method then an error is raised if a
data member can't be created, so if you populate manually shouldn't you get
a similar error?
I think something has to give here.
I'm not sure why the read method raises an error. It doesn't actually stop
the dictionary from being populated, the error is raised after the
underlying dictionary is updated from the file contents. You could still
use the writing methods after getting the error and it will just print with
an empty string or not at all. If there is going to be error checking
shouldn't it be on the write out as well? Maybe we could get rid of errors
and leave that to validation. The data members just try to create
themselves from the underlying dict and if they can't then return the empty
string, and it would be up to the user to validate before writing.
I can't decide if it's better to throw an error, return the empty string,
or return None if the data member can't be created. I don't like the empty
string because a user could think that the underlying value is truly an
empty string. I suppose they could get this impression with None as well
though. From the stand point of trying to use this class to manually create
an MWTab file I personally would rather have the errors, but I would also
just use the validation.
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEP7B6EQTD7EAGWF3N2RCDZAK2ZDAVCNFSM6AAAAAARWEFYICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJRHAZDSNJYG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Hunter Moseley, Ph.D. -- Univ. of Kentucky
Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center
/ Institute for Biomedical Informatics / UK Superfund Research Center
Not just a scientist, but a fencer as well.
My foil is sharp, but my mind sharper still.
---------------------------------------------------------------
Email: ***@***.*** (work) ***@***.***
(personal)
Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax)
Web: http://bioinformatics.cesb.uky.edu/
Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
|
The header property looks something like: The internal dictionary for the "METABOLOMICS WORKBENCH" key looks something like:
The header is created by just pulling out all of the key value pairs in this dict that aren't "VERSION" or "CREATED_ON". It's not too difficult to parse them back out of the header and modify or add the pairs to the dict, but I'm wondering how strict to be when trying to set a header. Currently, if the header trying to be set doesn't match r"#METABOLOMICS WORKBENCH( [A-Z_]+:\w+)*" then I raise an error. I'm not sure if this is too strict or maybe not strict enough. I'm also concerned about user's expectations regarding setting the header. If someone tries to set something like "#METABOLOMICS WORKBENCH KEY1:asdf KEY2:qwer", then this will add the keys to the dictionary, but if you access the header after that it would return '#METABOLOMICS WORKBENCH STUDY_ID:ST000122 ANALYSIS_ID:AN000204 KEY1:asdf KEY2:qwer', which is different from what was set. Should there be a check to see if the getter returns something different than what was set and either print a message or raise an error or something? |
I think a warning should be printed if the header does not match a strict
regular expression with STUDY_ID and ANALYSIS_ID keys.
But the parsing should handle any set of key-value pairs.
This allows a bit of future-proofing, since someone may use the header for
additional purposes or Metabolomics Workbench may change their header.
…On Mon, May 6, 2024 at 5:33 AM ptth222 ***@***.***> wrote:
The header property looks something like:
'#METABOLOMICS WORKBENCH STUDY_ID:ST000122 ANALYSIS_ID:AN000204'
The internal dictionary for the "METABOLOMICS WORKBENCH" key looks
something like:
{'STUDY_ID': 'ST000122',
'ANALYSIS_ID': 'AN000204',
'VERSION': '1',
'CREATED_ON': '2016-09-17'}
The header is created by just pulling out all of the key value pairs in
this dict that aren't "VERSION" or "CREATED_ON". It's not too difficult to
parse them back out of the header and modify or add the pairs to the dict,
but I'm wondering how strict to be when trying to set a header. Currently,
if the header trying to be set doesn't match r"#METABOLOMICS WORKBENCH(
[A-Z_]+:\w+)*" then I raise an error. I'm not sure if this is too strict or
maybe not strict enough.
I'm also concerned about user's expectations regarding setting the header.
If someone tries to set something like "#METABOLOMICS WORKBENCH KEY1:asdf
KEY2:qwer", then this will add the keys to the dictionary, but if you
access the header after that it would return '#METABOLOMICS WORKBENCH
STUDY_ID:ST000122 ANALYSIS_ID:AN000204 KEY1:asdf KEY2:qwer', which is
different from what was set. Should there be a check to see if the getter
returns something different than what was set and either print a message or
raise an error or something?
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEP7B62IXXWSE24BDYHFITZA5E67AVCNFSM6AAAAAARWEFYICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGU2TSOJUGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Hunter Moseley, Ph.D. -- Univ. of Kentucky
Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center
/ Institute for Biomedical Informatics / UK Superfund Research Center
Not just a scientist, but a fencer as well.
My foil is sharp, but my mind sharper still.
---------------------------------------------------------------
Email: ***@***.*** (work) ***@***.***
(personal)
Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax)
Web: http://bioinformatics.cesb.uky.edu/
Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
|
Closes #6. Added a validate method, a from_dict method, and changed some data members to properties.
The whole package essentially assumes everything is starting from a file on disk, so if you already have a JSON dictionary constructed in memory it is not straightforward to create the mwTab format file. One method is to save the JSON out and then read it back in.
Excerpt from old MESSES convert code:
Another way is a bit hacky:
The MWTabFile class has a strange init. It requires a "source" input that is supposed to be the path to the file I think, but then self.source is never used by any of its methods. Similarly self.study_id and self.analysis_id are set in the read method and not used. self.header is also set in the read method and then used in the print_file method rather than just printing the header when it is needed like every other section. I feel like this should be retooled so it can accept an initial JSON in the init, and maybe clean up some of the unused stuff, but I don't know that they aren't used in other parts of the package.
The text was updated successfully, but these errors were encountered: