Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotating tree sequence with pre-existing (node) metadata can lead to erroneous metadata #327

Open
clwgg opened this issue Aug 22, 2023 · 1 comment

Comments

@clwgg
Copy link

clwgg commented Aug 22, 2023

I think this might actually be a tskit problem, but I stumbled across it in a pyslim context so I thought I'd report here first, and we can escalate to tskit if needed.

The gist is that if a tree sequence comes with pre-existing metadata (of a certain format), it seems like this metadata can lead to corrupted metadata after annotation. I came across this in a context of a tree sequence output from tsinfer, in which some nodes have metadata of the format b'{"ancestor_data_id": 1}'.

For testing purposes we'll create such a tree sequence "artificially":

import numpy as np
import tskit
import msprime
import pyslim

ts = msprime.sim_ancestry(
    samples=10,
    population_size=10,
    sequence_length=1e6,
    recombination_rate=1e-6,
)

tables = ts.dump_tables()
nms = tables.nodes.metadata_schema
tables.nodes.packset_metadata([
    nms.validate_and_encode_row(b'{"ancestor_data_id": 1}')
    for _ in ts.nodes()
])
ts = tables.tree_sequence()

This results in the expected node metadata, which matches what some nodes look like after a tree sequence is inferred by tsinfer:

tables = ts.dump_tables()
tables.nodes[0]
NodeTableRow(flags=1, time=0.0, population=0, individual=0, metadata=b'{"ancestor_data_id": 1}')
tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata=b'{"ancestor_data_id": 1}')

Once we annotate these tables, the metadata for sample nodes is replaced in the way SLiM wants it to be:

pyslim.annotate_tables(tables, model_type="WF", tick=1)
tables.nodes[0]
NodeTableRow(flags=1, time=0.0, population=0, individual=0, metadata={'slim_id': 0, 'is_null': False, 'genome_type': 0})

Non-sample nodes, however, carry erroneous metadata that doesn't seem to make much sense:

tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata={'slim_id': 8391162008449393275, 'is_null': True, 'genome_type': 114})

I've tracked this through the pyslim.annotate code, and ended up finding that it appears to happen once a metadata schema is set for nodes with metadata formatted in this way:

tables = ts.dump_tables()
tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata=b'{"ancestor_data_id": 1}')
tables.nodes.metadata_schema = pyslim.slim_metadata_schemas['node']
tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata={'slim_id': 8391162008449393275, 'is_null': True, 'genome_type': 114})

This is what makes me think this might actually be a tskit issue, since just setting the metadata schema leads to the corruption, which doesn't seem like something that pyslim has a ton to do with. However, since metadata for non-sample nodes are just passed through pyslim.annotate, these malformed metadata end up in the annotated tree sequence if the tree sequence happened to contain non-sample nodes with metadata of this format.

@petrelharp
Copy link
Contributor

Gee, thanks for the report here, @clwgg - #317 is looking more urgent all the time.

I can't look at this until next week, but will have a look then. But, it sounds like changing the schema isn't doing consistency checking, which it probably should, to avoid this sort of thing (unless we decide that "changing a schema" is a user-beware sort of operation, in which case pyslim should be checking for existing metadata).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants