HTML diff should tokenize on some punctuation #6
Labels
enhancement
New feature or request
experiment
Experimental changes to a diff that need lots of testing and may or may not work out well
never-stale
This FTP diffing problem made me realize we should probably be splitting tokens in the HTML diff on periods (and maybe other punctuation?), not just on whitespace:
(Of course we don’t really want to use this differ on FTP listings, but that’s a different matter.)
This requires some care, though — we probably want to treat the periods as tokens themselves (in case they change), unlike whitespace. We’ve also talked about this before in terms of general punctuation handling — it would be really useful not only to split this way, but to tag and count punctuation changes separately from other changes. We might not prioritize a punctuation change for analysts to look at like we do a word change, and it would be nice to call out clearly that a change was merely in punctuation.
There are also punctuation changes we might want to treat extra special and even suppress in many cases. For example, changing
’
to'
(apostrophe to prime) is a change we’ve seen before, and not one we generally care about.The text was updated successfully, but these errors were encountered: