Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links diff: links not shown as changed #19

Open
Mr0grog opened this issue Jul 1, 2020 · 2 comments
Open

Links diff: links not shown as changed #19

Mr0grog opened this issue Jul 1, 2020 · 2 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jul 1, 2020

In this diff: https://monitoring.envirodatagov.org/page/1de9a11d-330b-4a87-9926-6c6357b6f668/36442e96-71da-4a54-899b-e3c193e5d5fd..5ca0de1e-c4c9-472b-8f66-83cbf18069c8

The two links near the top appear to have been removed and added because their text changed enough that a different link moved from before them in the list to after them.

Screen Shot 2020-06-30 at 5 41 54 PM

Ideally, we should find a way for these links to correctly show that their text was changed, rather than that the links were added or removed.

I think this probably requires some deep re-thinking of how we diff links. Right now, we create a list, then we diff the list on “rough similarity,” then diff the internals of similar links (these are the links that show up as changed, rather than added or removed). The problem here is diffing a list, where order matters. In reality, the links diff is about diffing a set.

We should probably start by grouping roughly similar links from the old and new list together, tagging each link with sorting properties that come exclusively from either the new version of the link (so both lists sort the same). Then do the diffing.

@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 16, 2020

Here’s another example. Simpler (only one change) and much more obvious (because the text changed so much): https://monitoring.envirodatagov.org/page/f2e7706b-59ab-4261-b9e3-d50eef8562d3/c4fbc0d7-7b86-4e62-82b4-07d37433ded8..db1a8fb6-7a76-4a81-b962-c4f08c4807b7

Screen Shot 2020-10-16 at 10 55 32 AM

Screen Shot 2020-10-16 at 10 55 43 AM

@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 16, 2020

Additional thought: when grouping links, we should probably look at text + some canonicalized form of the URL (e.g. SURT) instead of text + URL.

@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot removed the stale label Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant