Improve murmur2 to behave as JS on special chars #522

pedrobslisboa · 2024-11-13T18:26:23Z

Description

The murmur2 code doesn't fit well on special chars (#521)

   $ ./compare.sh 'content: "é"'
+  Hashes do not match
+  native: corivl
+  JavaScript: nawgoz
+  [1]
   $ ./compare.sh 'content: "®"'
+  Hashes do not match
+  native: 3084ue
+  JavaScript: ufik94
+  [1]
   $ ./compare.sh 'content: "😀"'
+  Hashes do not match
+  native: cqlzcw
+  JavaScript: 1u8rdd
+  [1]

How

There were some issues that without handling special char we didn't see before:

JS strings are UTF-16 encoded, so for chars, the max length is 2, but ocaml strings are UTF-8 encoded, meaning they use a variable length (1–4 bytes);
- Then I create the get_utf16_char_codes, and instead of handling with string on the murmur, we hold with a list of utf16 char codes.
The emotion murmur has trick cases, where it handles Int in some instances and Int32 in others. So, I changed to primary Int and used Int32 for some specific cases.

It probably worth a blog post about Int and int32 on JS and utf-16

vercel · 2024-11-13T18:26:29Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
styled-ppx	⬜️ Ignored (Inspect)	Visit Preview		Nov 19, 2024 0:22am

jchavarri

This is very cool! ❤️ My only concern is if uutf lib is needed or this can be solved with OCaml stdlib.

jchavarri · 2024-11-14T07:19:38Z

packages/murmur2/dune

@@ -1,3 +1,4 @@
 (library
 (name murmur2)
+ (libraries uutf)


Do you know if it's possible to not add a dependency? Related: https://discuss.ocaml.org/t/decoding-many-unicode-strings-with-uutf/8910/2 and ocaml/ocaml#10710.

Relevant sentence:

And once you can afford 4.14, I suggest you ditch uutf for the Stdlib UTF decoders which do not allocate at all.

Just remove the uutf adding a few lines of code. Thank you, @jchavarri

jchavarri · 2024-11-14T07:26:19Z

packages/murmur2/test/hash.t

@@ -79,6 +79,34 @@
  $ ./compare.sh "z-index: 10"
  Hashes match: 7au0g0

+Special chars


I asked chatgpt for more examples of tests (I don't think the "long strings" one will compile 😓 but still useful):

Adding more targeted tests can ensure that edge cases and potential boundary issues are well-covered. Here are some specific examples of tests to add:

1. Boundary Tests for UTF-16 Code Units

These tests would ensure the code correctly handles characters around the boundary between single UTF-16 code units and surrogate pairs.

$ ./compare.sh 'content: "\u{FFFF}"' (* Last code point within the BMP *) Hashes match: expected_hash $ ./compare.sh 'content: "\u{10000}"' (* First code point requiring surrogate pairs *) Hashes match: expected_hash

2. Surrogate Pair Ranges

Include specific tests for both high and low surrogates to verify correct handling of surrogate pairs, particularly ensuring they combine correctly.

$ ./compare.sh 'content: "\u{D800}"' (* High surrogate, standalone (should not occur normally) *) $ ./compare.sh 'content: "\u{DC00}"' (* Low surrogate, standalone (should not occur normally) *)

3. Multi-Character Strings with Mixed BMP and Non-BMP Characters

Testing strings that mix BMP characters (e.g., ASCII, accented letters) and characters that require surrogate pairs helps confirm consistent encoding and hashing across different character ranges.

$ ./compare.sh 'content: "a𐍈z"' (* ASCII, non-BMP character, ASCII *) Hashes match: expected_hash $ ./compare.sh 'content: "é😀中"' (* Accented character, emoji, CJK character *) Hashes match: expected_hash

4. Edge Cases for Non-Printable ASCII and Control Characters

Testing with control characters (e.g., newline, tab) ensures they’re handled appropriately in UTF-16 encoding.

$ ./compare.sh 'content: "Line1\nLine2"' (* Newline in string *) Hashes match: expected_hash $ ./compare.sh 'content: "Tab\tSpace"' (* Tab character *) Hashes match: expected_hash

5. Long Strings

Test long strings to validate that your code handles lengthy input without performance issues or memory problems, especially important for a hashing algorithm.

$ ./compare.sh 'content: "a" ^ 1000' (* Repeat a single character 1000 times *) Hashes match: expected_hash $ ./compare.sh 'content: "😀" ^ 500' (* Repeat a surrogate pair character 500 times *) Hashes match: expected_hash

6. Combination of Diacritical Marks and Composite Characters

Certain characters can be combined with diacritical marks to form composite characters. Testing these combinations ensures accurate encoding and hashing.

$ ./compare.sh 'content: "e\u0301"' (* 'e' + acute accent (é) *) Hashes match: expected_hash $ ./compare.sh 'content: "n\u0303o\u0301"' (* 'n' + tilde, 'o' + acute accent (ñó) *) Hashes match: expected_hash

7. Emoji Sequences and Skin Tone Modifiers

Emojis with skin tone or gender modifiers form composite characters in UTF-16, which could affect hashing.

$ ./compare.sh 'content: "👍🏽"' (* Thumbs up with medium skin tone *) Hashes match: expected_hash $ ./compare.sh 'content: "👩‍💻"' (* Woman technologist emoji, sequence with zero-width joiner *) Hashes match: expected_hash

8. CJK and Other Multi-Byte Characters

Characters from CJK languages and other multi-byte character sets are crucial to test as they are common in non-Latin scripts and can have varied byte-length representations in UTF-16.

$ ./compare.sh 'content: "漢字"' (* Chinese characters *) Hashes match: expected_hash $ ./compare.sh 'content: "हिन्दी"' (* Hindi script *) Hashes match: expected_hash

These examples aim to cover a wide variety of characters, including edge cases for surrogate pairs, multi-byte characters, and sequences. They will help validate that the hashing algorithm performs consistently with JavaScript’s UTF-16 expectations across diverse inputs.

@jchavarri thank you for those test cases. Every test case works perfectly.
The only "issue" was that I had to hardcode the long text with emojis because it did not work with a script.

davesnx

Woa

davesnx · 2024-11-14T14:06:26Z

packages/murmur2/murmur2.ml

  let i = ref 0 in
-  let len = ref (String.length str) in
+
+  let utf8_values = get_utf16_char_codes str in


get_utf16_char_codes could raise, can we do something to avoid the raise and default to a decent output?

@davesnx I was looking for an alternative is returning the char as size 1: loop (i + 1) (Char.code s.[i] :: ACC)

What do you think? Commit: a6b5178#diff-9c6d38be187d03d1ebccb32294a9f980f60bab975999e1b739ae58b5c4fe8d44R67

Maybe just try/catch it? or more important what happens when this fails in murmurjs?

Hey @davesnx, thank you for this comment.

mumurjs does not raise but it indeed handles malformed chars, and the handle with �

So I updated the get_utf16_char_codes to not raise and to return the necessary amount.

I also added a test for it, I didn't like the way I made this test; too many files, but I needed ideas.

You can see the tests here: 56ddc8d#diff-637969fb2a4af70ce6473152b597fdc34dcce8a6f4e28beed3a6bd6ca150e784

path/to/file

packages/murmur2/test/cli_murmur2.ml

packages/murmur2/test/murmur2.js

davesnx · 2024-11-19T12:42:51Z

packages/murmur2/test/hash.t

  $ ./compare.sh "color:var(--alt-text--tertiary);:disabled{color:var(--alt-text--tertiary);}:hover{color:var(--alt-text--primary);}"
  Hashes match: 66bc4u
  $ ./compare.sh "display:flex;:before, :after{content:'';flex:0 0 16px;}"
  Hashes match: ab0yh7
+
+Malformed UTF-8
+  $ ./compare-malformed.sh


Just curious, why malformed is another test and another sh?

I couldn't reproduce malformed utf8 strings on bash: $ ./compare.sh <some malformed string here>, so I did it on its own environment, JS and OCaml (malformed.js/malformed.ml)
I could have added it into compare. Sh, I did it before using a --malformed flag but didn't like the usage:

$ ./compare.sh <input> $ ./compare.sh --malformed

I commented here that I didn't like this way but didn't know any other way to do it.
Now, thinking we can avoid other sh with:

Malformed UTF-8 $ cat > compare-malformed.sh << 'EOF' > #!/bin/sh > native_hashes=$(./malformed.exe) > js_hashes=$(node './malformed.js') > if [ "$native_hashes" = "$js_hashes" ]; then > echo "Hashes match: ${native_hashes}" > exit 0 > else > echo "Hashes do not match" > echo "native: ${native_hashes}" > echo "JavaScript: ${js_hashes}" > exit 1 > fi > EOF $ chmod +x compare-malformed.sh $ ./compare-malformed.sh Hashes match: z68v9f (2 bytes) - c4jucd (3 bytes) - c4jucd (4 bytes)

pedrobslisboa added the bug Something isn't working label Nov 13, 2024

pedrobslisboa requested review from jchavarri and davesnx November 13, 2024 18:26

pedrobslisboa self-assigned this Nov 13, 2024

jchavarri reviewed Nov 14, 2024

View reviewed changes

pedrobslisboa force-pushed the feat/mumur2-special-chars-improve branch 3 times, most recently from 70dff65 to 870fe58 Compare November 14, 2024 12:34

davesnx approved these changes Nov 14, 2024

View reviewed changes

pedrobslisboa force-pushed the feat/mumur2-special-chars-improve branch 4 times, most recently from 8540c21 to 56ddc8d Compare November 18, 2024 23:06

jchavarri reviewed Nov 19, 2024

View reviewed changes

path/to/file Outdated Show resolved Hide resolved

packages/murmur2/test/cli_murmur2.ml Outdated Show resolved Hide resolved

packages/murmur2/test/murmur2.js Outdated Show resolved Hide resolved

feat: improve murmur2 to behave as JS on special chars

0fb659a

pedrobslisboa force-pushed the feat/mumur2-special-chars-improve branch from 56ddc8d to 0fb659a Compare November 19, 2024 12:22

davesnx reviewed Nov 19, 2024

View reviewed changes

davesnx merged commit 6d64970 into main Nov 19, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve murmur2 to behave as JS on special chars #522

Improve murmur2 to behave as JS on special chars #522

pedrobslisboa commented Nov 13, 2024

vercel bot commented Nov 13, 2024 •

edited

Loading

jchavarri left a comment

jchavarri Nov 14, 2024

pedrobslisboa Nov 14, 2024

jchavarri Nov 14, 2024

pedrobslisboa Nov 14, 2024

davesnx left a comment

davesnx Nov 14, 2024

pedrobslisboa Nov 18, 2024 •

edited

Loading

davesnx Nov 18, 2024

pedrobslisboa Nov 18, 2024 •

edited

Loading

davesnx Nov 19, 2024

pedrobslisboa Nov 19, 2024

Improve murmur2 to behave as JS on special chars #522

Improve murmur2 to behave as JS on special chars #522

Conversation

pedrobslisboa commented Nov 13, 2024

Description

How

vercel bot commented Nov 13, 2024 • edited Loading

jchavarri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1. Boundary Tests for UTF-16 Code Units

2. Surrogate Pair Ranges

3. Multi-Character Strings with Mixed BMP and Non-BMP Characters

4. Edge Cases for Non-Printable ASCII and Control Characters

5. Long Strings

6. Combination of Diacritical Marks and Composite Characters

7. Emoji Sequences and Skin Tone Modifiers

8. CJK and Other Multi-Byte Characters

Choose a reason for hiding this comment

davesnx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedrobslisboa Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedrobslisboa Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Nov 13, 2024 •

edited

Loading

pedrobslisboa Nov 18, 2024 •

edited

Loading

pedrobslisboa Nov 18, 2024 •

edited

Loading