UTF-8 replacement issues #16

ksandvik · 2021-03-26T18:51:23Z

Are the search regex strings and string variables configured for UTF-8? I could see cases where a right Markdown file is not picked up by the forward linker:

-rw-r--r--@ 1 ksandvik  CORP\Domain Users   56 Mar 21 14:56 KSDict/སྟོང་པ.md

(master) ~> python3 obs-linkr.py /Volumes/Work/Tibworkspace/ -r
Empty alias (will be ignored): aliases
----------------------
linked རྣམ་པར་ཤེས་པ
linked སྟོང
linked རྣམས
----------------------
linked text copied to clipboard

(master) ~> python3 --version
Python 3.9.2

Clipboard contents:

རྣམ་པར་ཤེས་པ་རྣམས་སྟོང་པའོ།།

Note that the word is part of the grammatical construct སྟོང་པའོ , འོ means ending means genitive.
Note that the regex didn't pick up སྟོང་པ.md, rather སྟོང . This is in Tibetan but it's standard UTF-8. Worst case it's just a python3 regex UTF-8 bug not properly handling rune combinations.

perkinsben · 2021-03-26T19:11:26Z

I have UTF-8 handling for text content (ie. from the clipboard), but not explicitly for filenames, so it doesn't surprise me that a filename with extended characters might cause some problems. I'll do some testing this weekend to track down where UTF-8 needs to be handled for the filenames.

ksandvik · 2021-03-27T00:32:10Z

Ok thx, yes it would help with parsing .md files that are in UTF-8. Let me know if you need a simple Obsidian project with a few UTF-8 files.

ksandvik · 2021-03-28T21:50:12Z

UTF8-marking-test.zip
Enclosed is a small Obsidian vault folder that shows the problem. the UTF-8 test.md file has an example to select to the clipboard and see if the specific UTF-8 string (here Tibetan) is matched. I uncommented the print line processing files and all the needed files are parsed, so I suspect it's more with the matching code -- to make sure it handles UTF-8. But I'm no Python expert. But hopefully this should help narrow down the place where UTF-8 is not handled correctly.

ksandvik · 2021-03-30T06:23:33Z

I looked at the code around
matches = re.finditer('(?<!([[\w|]))' + re.escape(title.lower()) + '(?!([|]\w]))', txt.lower())

and I think it does not match certain UTF-8 strings passed into link_title. See testdiff.txt and result.txt

With the previous Obsidian files, when iterate page_title སྟོང་པ is passed into def link_title(title, txt):
then a match is not found even if the txt passed in is, see firset section: གཟུགས་སྟོང་པར། where སྟོང་པར། should match སྟོང་པ

original updated_txt གཟུགས་སྟོང་པར། [[སྟོང་པ་ཉིད]]་ཀྱང་གཟུགས་སོ།
གཟུགས་ལས་ཀྱང་[[སྟོང་པ་ཉིད]]་གཞན་མ་ཡིན་ནོ། [[སྟོང་པ་ཉིད]]་ལས་ཀྱང་གཟུགས་གཞན་མ་ཡིན་ནོ།
དེ་བཞིན་དུ་ཚོར་བ་དང་། འདུ་ཤེས་དང་། འདུ་བྱེད་དང་། རྣམ་པར་ཤེས་པ་རྣམས་སྟོང་པའོ།།
ཤཱ་རིའི་བུ་དེ་ལྟ་བས་ན་ཆོས་ཐམས་ཅད་[[སྟོང་པ་ཉིད]]་དེ། མཚན་མ་མེད་པ། མ་སྐྱེས་པ། མ་འགགས་པ། དྲི་མ་དང་བྲལ་པ་མེད་པ། བྲི་བ་མེད་པ། གང་བ་མེད་པའོ།།

and then later སྟོང is matched.

result.txt
testdiff.txt

Not sure about the regex, assume there are no assumptions of whitespace or word separators as those are missing in many Asian languages (i.e. white space is not used to separate words in Chinese, Hindi, Japanese and here Tibetan). Or that the regex should be able to handled UTF-8 one byte (roman), as well as two and three byte characters/runes.

The lower() and upper() functions handle UTF-8 fine.

ksandvik · 2021-03-30T21:00:11Z

There's something going on with the matching even with English:

markdown files: proud.md, summer.md
text block:
summer's end
summers' end
proudly displayed

result:
[[summer]]'s end
summers' end
proudly displayed

Note that it misses one of the summer.md cases and can't match proud inside proudly. Was this designed with spacing matching, as I think in Unicode the Tibetan dot ་ is defined as a syllable/word separator, same as whitespace. It would be nice if the marking is done by UTF-8 character/byte codes rather than with assumptions of where the word ends. It's maybe something the python regex library assumes by default.

perkinsben · 2021-03-30T21:46:58Z

Yes, that was done intentionally using the regex \w that ensures a word boundary surrounds the match. That could be space, quote, etc., but not another letter (ie. it won't match a word that is part of another word).

…

On Tue., Mar. 30, 2021, 6:00 p.m. Kent Sandvik, ***@***.***> wrote: There's something going on with the matching even with English: *markdown files:* proud.md, summer.md *text block:* summer's end summers' end proudly displayed *result:* [[summer]]'s end summers' end proudly displayed Note that it misses one of the summer.md cases and can't match proud inside proudly. Was this designed with spacing matching, as I think in Unicode the Tibetan dot ་ is defined as a syllable/word separator, same as whitespace. It would be nice if the marking is done by UTF-8 character/byte codes rather than with assumptions of where the word ends. It's maybe something the python regex library assumes by default. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHKZRQBEFAGCOJOTHBQ5HUDTGI3WZANCNFSM4Z35IQQQ> .

ksandvik · 2021-03-31T22:45:07Z

Any chance there could be another mode via an option that matches character by character, -c or something similar, it does not need to be default? Would help with language matching for me. And there might be other similar uses, Japanese, Chinese and so on?

ksandvik · 2021-04-05T18:58:13Z

I replaced the \w with * and works well for my intended purposes.

perkinsben · 2021-04-05T19:25:38Z

Good to hear. Apologies for being absent, I was starting a new job.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 replacement issues #16

UTF-8 replacement issues #16

ksandvik commented Mar 26, 2021 •

edited

Loading

perkinsben commented Mar 26, 2021

ksandvik commented Mar 27, 2021

ksandvik commented Mar 28, 2021 •

edited

Loading

ksandvik commented Mar 30, 2021 •

edited

Loading

ksandvik commented Mar 30, 2021

perkinsben commented Mar 30, 2021 via email

ksandvik commented Mar 31, 2021

ksandvik commented Apr 5, 2021

perkinsben commented Apr 5, 2021

UTF-8 replacement issues #16

UTF-8 replacement issues #16

Comments

ksandvik commented Mar 26, 2021 • edited Loading

perkinsben commented Mar 26, 2021

ksandvik commented Mar 27, 2021

ksandvik commented Mar 28, 2021 • edited Loading

ksandvik commented Mar 30, 2021 • edited Loading

ksandvik commented Mar 30, 2021

perkinsben commented Mar 30, 2021 via email

ksandvik commented Mar 31, 2021

ksandvik commented Apr 5, 2021

perkinsben commented Apr 5, 2021

ksandvik commented Mar 26, 2021 •

edited

Loading

ksandvik commented Mar 28, 2021 •

edited

Loading

ksandvik commented Mar 30, 2021 •

edited

Loading