You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think this might actually be a tskit problem, but I stumbled across it in a pyslim context so I thought I'd report here first, and we can escalate to tskit if needed.
The gist is that if a tree sequence comes with pre-existing metadata (of a certain format), it seems like this metadata can lead to corrupted metadata after annotation. I came across this in a context of a tree sequence output from tsinfer, in which some nodes have metadata of the format b'{"ancestor_data_id": 1}'.
For testing purposes we'll create such a tree sequence "artificially":
I've tracked this through the pyslim.annotate code, and ended up finding that it appears to happen once a metadata schema is set for nodes with metadata formatted in this way:
This is what makes me think this might actually be a tskit issue, since just setting the metadata schema leads to the corruption, which doesn't seem like something that pyslim has a ton to do with. However, since metadata for non-sample nodes are just passed through pyslim.annotate, these malformed metadata end up in the annotated tree sequence if the tree sequence happened to contain non-sample nodes with metadata of this format.
The text was updated successfully, but these errors were encountered:
Gee, thanks for the report here, @clwgg - #317 is looking more urgent all the time.
I can't look at this until next week, but will have a look then. But, it sounds like changing the schema isn't doing consistency checking, which it probably should, to avoid this sort of thing (unless we decide that "changing a schema" is a user-beware sort of operation, in which case pyslim should be checking for existing metadata).
I think this might actually be a
tskit
problem, but I stumbled across it in apyslim
context so I thought I'd report here first, and we can escalate totskit
if needed.The gist is that if a tree sequence comes with pre-existing metadata (of a certain format), it seems like this metadata can lead to corrupted metadata after annotation. I came across this in a context of a tree sequence output from
tsinfer
, in which some nodes have metadata of the formatb'{"ancestor_data_id": 1}'
.For testing purposes we'll create such a tree sequence "artificially":
This results in the expected node metadata, which matches what some nodes look like after a tree sequence is inferred by
tsinfer
:Once we annotate these tables, the metadata for sample nodes is replaced in the way SLiM wants it to be:
Non-sample nodes, however, carry erroneous metadata that doesn't seem to make much sense:
I've tracked this through the
pyslim.annotate
code, and ended up finding that it appears to happen once a metadata schema is set for nodes with metadata formatted in this way:This is what makes me think this might actually be a
tskit
issue, since just setting the metadata schema leads to the corruption, which doesn't seem like something thatpyslim
has a ton to do with. However, since metadata for non-sample nodes are just passed throughpyslim.annotate
, these malformed metadata end up in the annotated tree sequence if the tree sequence happened to contain non-sample nodes with metadata of this format.The text was updated successfully, but these errors were encountered: