-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 replacement issues #16
Comments
I have UTF-8 handling for text content (ie. from the clipboard), but not explicitly for filenames, so it doesn't surprise me that a filename with extended characters might cause some problems. I'll do some testing this weekend to track down where UTF-8 needs to be handled for the filenames. |
Ok thx, yes it would help with parsing .md files that are in UTF-8. Let me know if you need a simple Obsidian project with a few UTF-8 files. |
UTF8-marking-test.zip |
I looked at the code around and I think it does not match certain UTF-8 strings passed into link_title. See testdiff.txt and result.txt With the previous Obsidian files, when iterate page_title སྟོང་པ is passed into def link_title(title, txt): original updated_txt གཟུགས་སྟོང་པར། [[སྟོང་པ་ཉིད]]་ཀྱང་གཟུགས་སོ། and then later སྟོང is matched. Not sure about the regex, assume there are no assumptions of whitespace or word separators as those are missing in many Asian languages (i.e. white space is not used to separate words in Chinese, Hindi, Japanese and here Tibetan). Or that the regex should be able to handled UTF-8 one byte (roman), as well as two and three byte characters/runes. The lower() and upper() functions handle UTF-8 fine. |
There's something going on with the matching even with English: markdown files: proud.md, summer.md result: Note that it misses one of the summer.md cases and can't match proud inside proudly. Was this designed with spacing matching, as I think in Unicode the Tibetan dot ་ is defined as a syllable/word separator, same as whitespace. It would be nice if the marking is done by UTF-8 character/byte codes rather than with assumptions of where the word ends. It's maybe something the python regex library assumes by default. |
Yes, that was done intentionally using the regex \w that ensures a word
boundary surrounds the match. That could be space, quote, etc., but not
another letter (ie. it won't match a word that is part of another word).
…On Tue., Mar. 30, 2021, 6:00 p.m. Kent Sandvik, ***@***.***> wrote:
There's something going on with the matching even with English:
*markdown files:* proud.md, summer.md
*text block:*
summer's end
summers' end
proudly displayed
*result:*
[[summer]]'s end
summers' end
proudly displayed
Note that it misses one of the summer.md cases and can't match proud
inside proudly. Was this designed with spacing matching, as I think in
Unicode the Tibetan dot ་ is defined as a syllable/word separator, same as
whitespace. It would be nice if the marking is done by UTF-8 character/byte
codes rather than with assumptions of where the word ends. It's maybe
something the python regex library assumes by default.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHKZRQBEFAGCOJOTHBQ5HUDTGI3WZANCNFSM4Z35IQQQ>
.
|
Any chance there could be another mode via an option that matches character by character, -c or something similar, it does not need to be default? Would help with language matching for me. And there might be other similar uses, Japanese, Chinese and so on? |
I replaced the \w with * and works well for my intended purposes. |
Good to hear. Apologies for being absent, I was starting a new job. |
Are the search regex strings and string variables configured for UTF-8? I could see cases where a right Markdown file is not picked up by the forward linker:
Clipboard contents:
Note that the word is part of the grammatical construct སྟོང་པའོ , འོ means ending means genitive.
Note that the regex didn't pick up སྟོང་པ.md, rather སྟོང . This is in Tibetan but it's standard UTF-8. Worst case it's just a python3 regex UTF-8 bug not properly handling rune combinations.
The text was updated successfully, but these errors were encountered: