Annotating tree sequence with pre-existing (node) metadata can lead to erroneous metadata #327

clwgg · 2023-08-22T22:43:59Z

I think this might actually be a tskit problem, but I stumbled across it in a pyslim context so I thought I'd report here first, and we can escalate to tskit if needed.

The gist is that if a tree sequence comes with pre-existing metadata (of a certain format), it seems like this metadata can lead to corrupted metadata after annotation. I came across this in a context of a tree sequence output from tsinfer, in which some nodes have metadata of the format b'{"ancestor_data_id": 1}'.

For testing purposes we'll create such a tree sequence "artificially":

import numpy as np
import tskit
import msprime
import pyslim

ts = msprime.sim_ancestry(
    samples=10,
    population_size=10,
    sequence_length=1e6,
    recombination_rate=1e-6,
)

tables = ts.dump_tables()
nms = tables.nodes.metadata_schema
tables.nodes.packset_metadata([
    nms.validate_and_encode_row(b'{"ancestor_data_id": 1}')
    for _ in ts.nodes()
])
ts = tables.tree_sequence()

This results in the expected node metadata, which matches what some nodes look like after a tree sequence is inferred by tsinfer:

tables = ts.dump_tables()
tables.nodes[0]

NodeTableRow(flags=1, time=0.0, population=0, individual=0, metadata=b'{"ancestor_data_id": 1}')

tables.nodes[-1]

NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata=b'{"ancestor_data_id": 1}')

Once we annotate these tables, the metadata for sample nodes is replaced in the way SLiM wants it to be:

pyslim.annotate_tables(tables, model_type="WF", tick=1)
tables.nodes[0]

NodeTableRow(flags=1, time=0.0, population=0, individual=0, metadata={'slim_id': 0, 'is_null': False, 'genome_type': 0})

Non-sample nodes, however, carry erroneous metadata that doesn't seem to make much sense:

tables.nodes[-1]

NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata={'slim_id': 8391162008449393275, 'is_null': True, 'genome_type': 114})

I've tracked this through the pyslim.annotate code, and ended up finding that it appears to happen once a metadata schema is set for nodes with metadata formatted in this way:

tables = ts.dump_tables()
tables.nodes[-1]

NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata=b'{"ancestor_data_id": 1}')

tables.nodes.metadata_schema = pyslim.slim_metadata_schemas['node']
tables.nodes[-1]

NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata={'slim_id': 8391162008449393275, 'is_null': True, 'genome_type': 114})

This is what makes me think this might actually be a tskit issue, since just setting the metadata schema leads to the corruption, which doesn't seem like something that pyslim has a ton to do with. However, since metadata for non-sample nodes are just passed through pyslim.annotate, these malformed metadata end up in the annotated tree sequence if the tree sequence happened to contain non-sample nodes with metadata of this format.

The text was updated successfully, but these errors were encountered:

petrelharp · 2023-08-23T22:53:40Z

Gee, thanks for the report here, @clwgg - #317 is looking more urgent all the time.

I can't look at this until next week, but will have a look then. But, it sounds like changing the schema isn't doing consistency checking, which it probably should, to avoid this sort of thing (unless we decide that "changing a schema" is a user-beware sort of operation, in which case pyslim should be checking for existing metadata).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotating tree sequence with pre-existing (node) metadata can lead to erroneous metadata #327

Annotating tree sequence with pre-existing (node) metadata can lead to erroneous metadata #327

clwgg commented Aug 22, 2023

petrelharp commented Aug 23, 2023

Annotating tree sequence with pre-existing (node) metadata can lead to erroneous metadata #327

Annotating tree sequence with pre-existing (node) metadata can lead to erroneous metadata #327

Comments

clwgg commented Aug 22, 2023

petrelharp commented Aug 23, 2023