-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve murmur2 to behave as JS on special chars #522
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool! ❤️ My only concern is if uutf
lib is needed or this can be solved with OCaml stdlib.
packages/murmur2/dune
Outdated
@@ -1,3 +1,4 @@ | |||
(library | |||
(name murmur2) | |||
(libraries uutf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if it's possible to not add a dependency? Related: https://discuss.ocaml.org/t/decoding-many-unicode-strings-with-uutf/8910/2 and ocaml/ocaml#10710.
Relevant sentence:
And once you can afford 4.14, I suggest you ditch uutf for the Stdlib UTF decoders which do not allocate at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove the uutf
adding a few lines of code. Thank you, @jchavarri
@@ -79,6 +79,34 @@ | |||
$ ./compare.sh "z-index: 10" | |||
Hashes match: 7au0g0 | |||
|
|||
Special chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked chatgpt for more examples of tests (I don't think the "long strings" one will compile 😓 but still useful):
Adding more targeted tests can ensure that edge cases and potential boundary issues are well-covered. Here are some specific examples of tests to add:
1. Boundary Tests for UTF-16 Code Units
- These tests would ensure the code correctly handles characters around the boundary between single UTF-16 code units and surrogate pairs.
$ ./compare.sh 'content: "\u{FFFF}"' (* Last code point within the BMP *)
Hashes match: expected_hash
$ ./compare.sh 'content: "\u{10000}"' (* First code point requiring surrogate pairs *)
Hashes match: expected_hash
2. Surrogate Pair Ranges
- Include specific tests for both high and low surrogates to verify correct handling of surrogate pairs, particularly ensuring they combine correctly.
$ ./compare.sh 'content: "\u{D800}"' (* High surrogate, standalone (should not occur normally) *)
$ ./compare.sh 'content: "\u{DC00}"' (* Low surrogate, standalone (should not occur normally) *)
3. Multi-Character Strings with Mixed BMP and Non-BMP Characters
- Testing strings that mix BMP characters (e.g., ASCII, accented letters) and characters that require surrogate pairs helps confirm consistent encoding and hashing across different character ranges.
$ ./compare.sh 'content: "a𐍈z"' (* ASCII, non-BMP character, ASCII *)
Hashes match: expected_hash
$ ./compare.sh 'content: "é😀中"' (* Accented character, emoji, CJK character *)
Hashes match: expected_hash
4. Edge Cases for Non-Printable ASCII and Control Characters
- Testing with control characters (e.g., newline, tab) ensures they’re handled appropriately in UTF-16 encoding.
$ ./compare.sh 'content: "Line1\nLine2"' (* Newline in string *)
Hashes match: expected_hash
$ ./compare.sh 'content: "Tab\tSpace"' (* Tab character *)
Hashes match: expected_hash
5. Long Strings
- Test long strings to validate that your code handles lengthy input without performance issues or memory problems, especially important for a hashing algorithm.
$ ./compare.sh 'content: "a" ^ 1000' (* Repeat a single character 1000 times *)
Hashes match: expected_hash
$ ./compare.sh 'content: "😀" ^ 500' (* Repeat a surrogate pair character 500 times *)
Hashes match: expected_hash
6. Combination of Diacritical Marks and Composite Characters
- Certain characters can be combined with diacritical marks to form composite characters. Testing these combinations ensures accurate encoding and hashing.
$ ./compare.sh 'content: "e\u0301"' (* 'e' + acute accent (é) *)
Hashes match: expected_hash
$ ./compare.sh 'content: "n\u0303o\u0301"' (* 'n' + tilde, 'o' + acute accent (ñó) *)
Hashes match: expected_hash
7. Emoji Sequences and Skin Tone Modifiers
- Emojis with skin tone or gender modifiers form composite characters in UTF-16, which could affect hashing.
$ ./compare.sh 'content: "👍🏽"' (* Thumbs up with medium skin tone *)
Hashes match: expected_hash
$ ./compare.sh 'content: "👩💻"' (* Woman technologist emoji, sequence with zero-width joiner *)
Hashes match: expected_hash
8. CJK and Other Multi-Byte Characters
- Characters from CJK languages and other multi-byte character sets are crucial to test as they are common in non-Latin scripts and can have varied byte-length representations in UTF-16.
$ ./compare.sh 'content: "漢字"' (* Chinese characters *)
Hashes match: expected_hash
$ ./compare.sh 'content: "हिन्दी"' (* Hindi script *)
Hashes match: expected_hash
These examples aim to cover a wide variety of characters, including edge cases for surrogate pairs, multi-byte characters, and sequences. They will help validate that the hashing algorithm performs consistently with JavaScript’s UTF-16 expectations across diverse inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jchavarri thank you for those test cases. Every test case works perfectly.
The only "issue" was that I had to hardcode the long text with emojis because it did not work with a script.
70dff65
to
870fe58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woa
packages/murmur2/murmur2.ml
Outdated
let i = ref 0 in | ||
let len = ref (String.length str) in | ||
|
||
let utf8_values = get_utf16_char_codes str in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_utf16_char_codes could raise, can we do something to avoid the raise and default to a decent output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davesnx I was looking for an alternative is returning the char as size 1: loop (i + 1) (Char.code s.[i] :: ACC)
What do you think? Commit: a6b5178#diff-9c6d38be187d03d1ebccb32294a9f980f60bab975999e1b739ae58b5c4fe8d44R67
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just try/catch it? or more important what happens when this fails in murmurjs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @davesnx, thank you for this comment.
mumurjs
does not raise but it indeed handles malformed chars, and the handle with �
So I updated the get_utf16_char_codes
to not raise and to return the necessary amount.
I also added a test for it, I didn't like the way I made this test; too many files, but I needed ideas.
You can see the tests here: 56ddc8d#diff-637969fb2a4af70ce6473152b597fdc34dcce8a6f4e28beed3a6bd6ca150e784
8540c21
to
56ddc8d
Compare
56ddc8d
to
0fb659a
Compare
$ ./compare.sh "color:var(--alt-text--tertiary);:disabled{color:var(--alt-text--tertiary);}:hover{color:var(--alt-text--primary);}" | ||
Hashes match: 66bc4u | ||
$ ./compare.sh "display:flex;:before, :after{content:'';flex:0 0 16px;}" | ||
Hashes match: ab0yh7 | ||
|
||
Malformed UTF-8 | ||
$ ./compare-malformed.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why malformed is another test and another sh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't reproduce malformed utf8 strings on bash: $ ./compare.sh <some malformed string here>
, so I did it on its own environment, JS and OCaml (malformed.js/malformed.ml)
I could have added it into compare. Sh
, I did it before using a --malformed
flag but didn't like the usage:
$ ./compare.sh <input>
$ ./compare.sh --malformed
I commented here that I didn't like this way but didn't know any other way to do it.
Now, thinking we can avoid other sh with:
Malformed UTF-8
$ cat > compare-malformed.sh << 'EOF'
> #!/bin/sh
> native_hashes=$(./malformed.exe)
> js_hashes=$(node './malformed.js')
> if [ "$native_hashes" = "$js_hashes" ]; then
> echo "Hashes match: ${native_hashes}"
> exit 0
> else
> echo "Hashes do not match"
> echo "native: ${native_hashes}"
> echo "JavaScript: ${js_hashes}"
> exit 1
> fi
> EOF
$ chmod +x compare-malformed.sh
$ ./compare-malformed.sh
Hashes match: z68v9f (2 bytes) - c4jucd (3 bytes) - c4jucd (4 bytes)
Description
The murmur2 code doesn't fit well on special chars (#521)
How
There were some issues that without handling special char we didn't see before:
get_utf16_char_codes
, and instead of handling with string on the murmur, we hold with a list of utf16 char codes.It probably worth a blog post about Int and int32 on JS and utf-16