Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 replacement issues #16

Open
ksandvik opened this issue Mar 26, 2021 · 9 comments
Open

UTF-8 replacement issues #16

ksandvik opened this issue Mar 26, 2021 · 9 comments

Comments

@ksandvik
Copy link

ksandvik commented Mar 26, 2021

Are the search regex strings and string variables configured for UTF-8? I could see cases where a right Markdown file is not picked up by the forward linker:

-rw-r--r--@ 1 ksandvik  CORP\Domain Users   56 Mar 21 14:56 KSDict/སྟོང་པ.md

(master) ~> python3 obs-linkr.py /Volumes/Work/Tibworkspace/ -r
Empty alias (will be ignored): aliases
----------------------
linked རྣམ་པར་ཤེས་པ
linked སྟོང
linked རྣམས
----------------------
linked text copied to clipboard

(master) ~> python3 --version
Python 3.9.2

Clipboard contents:

རྣམ་པར་ཤེས་པ་རྣམས་སྟོང་པའོ།།

Note that the word is part of the grammatical construct སྟོང་པའོ , འོ means ending means genitive.
Note that the regex didn't pick up སྟོང་པ.md, rather སྟོང . This is in Tibetan but it's standard UTF-8. Worst case it's just a python3 regex UTF-8 bug not properly handling rune combinations.

@perkinsben
Copy link
Owner

I have UTF-8 handling for text content (ie. from the clipboard), but not explicitly for filenames, so it doesn't surprise me that a filename with extended characters might cause some problems. I'll do some testing this weekend to track down where UTF-8 needs to be handled for the filenames.

@ksandvik
Copy link
Author

Ok thx, yes it would help with parsing .md files that are in UTF-8. Let me know if you need a simple Obsidian project with a few UTF-8 files.

@ksandvik
Copy link
Author

ksandvik commented Mar 28, 2021

UTF8-marking-test.zip
Enclosed is a small Obsidian vault folder that shows the problem. the UTF-8 test.md file has an example to select to the clipboard and see if the specific UTF-8 string (here Tibetan) is matched. I uncommented the print line processing files and all the needed files are parsed, so I suspect it's more with the matching code -- to make sure it handles UTF-8. But I'm no Python expert. But hopefully this should help narrow down the place where UTF-8 is not handled correctly.

@ksandvik
Copy link
Author

ksandvik commented Mar 30, 2021

I looked at the code around
matches = re.finditer('(?<!([[\w|]))' + re.escape(title.lower()) + '(?!([|]\w]))', txt.lower())

and I think it does not match certain UTF-8 strings passed into link_title. See testdiff.txt and result.txt

With the previous Obsidian files, when iterate page_title སྟོང་པ is passed into def link_title(title, txt):
then a match is not found even if the txt passed in is, see firset section: གཟུགས་སྟོང་པར། where སྟོང་པར། should match སྟོང་པ

original updated_txt གཟུགས་སྟོང་པར། [[སྟོང་པ་ཉིད]]་ཀྱང་གཟུགས་སོ།
གཟུགས་ལས་ཀྱང་[[སྟོང་པ་ཉིད]]་གཞན་མ་ཡིན་ནོ། [[སྟོང་པ་ཉིད]]་ལས་ཀྱང་གཟུགས་གཞན་མ་ཡིན་ནོ།
དེ་བཞིན་དུ་ཚོར་བ་དང་། འདུ་ཤེས་དང་། འདུ་བྱེད་དང་། རྣམ་པར་ཤེས་པ་རྣམས་སྟོང་པའོ།།
ཤཱ་རིའི་བུ་དེ་ལྟ་བས་ན་ཆོས་ཐམས་ཅད་[[སྟོང་པ་ཉིད]]་དེ། མཚན་མ་མེད་པ། མ་སྐྱེས་པ། མ་འགགས་པ། དྲི་མ་དང་བྲལ་པ་མེད་པ། བྲི་བ་མེད་པ། གང་བ་མེད་པའོ།།

and then later སྟོང is matched.

result.txt
testdiff.txt

Not sure about the regex, assume there are no assumptions of whitespace or word separators as those are missing in many Asian languages (i.e. white space is not used to separate words in Chinese, Hindi, Japanese and here Tibetan). Or that the regex should be able to handled UTF-8 one byte (roman), as well as two and three byte characters/runes.

The lower() and upper() functions handle UTF-8 fine.

@ksandvik
Copy link
Author

There's something going on with the matching even with English:

markdown files: proud.md, summer.md
text block:
summer's end
summers' end
proudly displayed

result:
[[summer]]'s end
summers' end
proudly displayed

Note that it misses one of the summer.md cases and can't match proud inside proudly. Was this designed with spacing matching, as I think in Unicode the Tibetan dot ་ is defined as a syllable/word separator, same as whitespace. It would be nice if the marking is done by UTF-8 character/byte codes rather than with assumptions of where the word ends. It's maybe something the python regex library assumes by default.

@perkinsben
Copy link
Owner

perkinsben commented Mar 30, 2021 via email

@ksandvik
Copy link
Author

Any chance there could be another mode via an option that matches character by character, -c or something similar, it does not need to be default? Would help with language matching for me. And there might be other similar uses, Japanese, Chinese and so on?

@ksandvik
Copy link
Author

ksandvik commented Apr 5, 2021

I replaced the \w with * and works well for my intended purposes.

@perkinsben
Copy link
Owner

Good to hear. Apologies for being absent, I was starting a new job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants