You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In edgi-govdata-archiving/web-monitoring-processing#401, we added customized URL comparisons for the links and images in our diffs that help resolve issues around things like archive-relative URLs and session IDs or other transient information in URLs.
Unfortunately, we mixed two concepts together there:
URLs that need special comparisons — i.e. you need to know both sides of the comparison in order to do it. The Wayback comparison is like this — it only kicks in if both URLs are Wayback URLs.
URLs that need to be compared according to a canonical or normalized form — i.e. you only need to know one side to do the comparison. The servlet session IDs are like this — we just want to remove the session ID from the URL, regardless of what’s in the URL we are comparing to.
The difference is important because two of the first type of rule can’t be combined, but two of the second kind of rule can, and we the second kind can also be combined with the first. E.g. you wouldn’t compare the following two URLs as the same:
I don’t think this needs to change the public API (set list of rules to use via the url_rules parameter), but under the hood, we should be able to treat these differently. As a bonus, normalized URLs are something we can cache to speed up comparisons a bit.
The text was updated successfully, but these errors were encountered:
In edgi-govdata-archiving/web-monitoring-processing#401, we added customized URL comparisons for the links and images in our diffs that help resolve issues around things like archive-relative URLs and session IDs or other transient information in URLs.
Unfortunately, we mixed two concepts together there:
URLs that need special comparisons — i.e. you need to know both sides of the comparison in order to do it. The Wayback comparison is like this — it only kicks in if both URLs are Wayback URLs.
URLs that need to be compared according to a canonical or normalized form — i.e. you only need to know one side to do the comparison. The servlet session IDs are like this — we just want to remove the session ID from the URL, regardless of what’s in the URL we are comparing to.
The difference is important because two of the first type of rule can’t be combined, but two of the second kind of rule can, and we the second kind can also be combined with the first. E.g. you wouldn’t compare the following two URLs as the same:
But you would want to combine the session ID rule with the Wayback rule so these are the same:
I don’t think this needs to change the public API (set list of rules to use via the
url_rules
parameter), but under the hood, we should be able to treat these differently. As a bonus, normalized URLs are something we can cache to speed up comparisons a bit.The text was updated successfully, but these errors were encountered: