Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve murmur2 to behave as JS on special chars #522

Merged
merged 1 commit into from
Nov 19, 2024

Conversation

pedrobslisboa
Copy link
Collaborator

Description

The murmur2 code doesn't fit well on special chars (#521)

   $ ./compare.sh 'content: "é"'
+  Hashes do not match
+  native: corivl
+  JavaScript: nawgoz
+  [1]
   $ ./compare.sh 'content: "®"'
+  Hashes do not match
+  native: 3084ue
+  JavaScript: ufik94
+  [1]
   $ ./compare.sh 'content: "😀"'
+  Hashes do not match
+  native: cqlzcw
+  JavaScript: 1u8rdd
+  [1]

How

There were some issues that without handling special char we didn't see before:

  • JS strings are UTF-16 encoded, so for chars, the max length is 2, but ocaml strings are UTF-8 encoded, meaning they use a variable length (1–4 bytes);
    • Then I create the get_utf16_char_codes, and instead of handling with string on the murmur, we hold with a list of utf16 char codes.
  • The emotion murmur has trick cases, where it handles Int in some instances and Int32 in others. So, I changed to primary Int and used Int32 for some specific cases.

It probably worth a blog post about Int and int32 on JS and utf-16

@pedrobslisboa pedrobslisboa added the bug Something isn't working label Nov 13, 2024
@pedrobslisboa pedrobslisboa self-assigned this Nov 13, 2024
Copy link

vercel bot commented Nov 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
styled-ppx ⬜️ Ignored (Inspect) Visit Preview Nov 19, 2024 0:22am

Copy link
Collaborator

@jchavarri jchavarri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! ❤️ My only concern is if uutf lib is needed or this can be solved with OCaml stdlib.

@@ -1,3 +1,4 @@
(library
(name murmur2)
(libraries uutf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if it's possible to not add a dependency? Related: https://discuss.ocaml.org/t/decoding-many-unicode-strings-with-uutf/8910/2 and ocaml/ocaml#10710.

Relevant sentence:

And once you can afford 4.14, I suggest you ditch uutf for the Stdlib UTF decoders which do not allocate at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the uutf adding a few lines of code. Thank you, @jchavarri

@@ -79,6 +79,34 @@
$ ./compare.sh "z-index: 10"
Hashes match: 7au0g0

Special chars
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked chatgpt for more examples of tests (I don't think the "long strings" one will compile 😓 but still useful):

Adding more targeted tests can ensure that edge cases and potential boundary issues are well-covered. Here are some specific examples of tests to add:

1. Boundary Tests for UTF-16 Code Units

  • These tests would ensure the code correctly handles characters around the boundary between single UTF-16 code units and surrogate pairs.
$ ./compare.sh 'content: "\u{FFFF}"' (* Last code point within the BMP *)
Hashes match: expected_hash

$ ./compare.sh 'content: "\u{10000}"' (* First code point requiring surrogate pairs *)
Hashes match: expected_hash

2. Surrogate Pair Ranges

  • Include specific tests for both high and low surrogates to verify correct handling of surrogate pairs, particularly ensuring they combine correctly.
$ ./compare.sh 'content: "\u{D800}"' (* High surrogate, standalone (should not occur normally) *)
$ ./compare.sh 'content: "\u{DC00}"' (* Low surrogate, standalone (should not occur normally) *)

3. Multi-Character Strings with Mixed BMP and Non-BMP Characters

  • Testing strings that mix BMP characters (e.g., ASCII, accented letters) and characters that require surrogate pairs helps confirm consistent encoding and hashing across different character ranges.
$ ./compare.sh 'content: "a𐍈z"' (* ASCII, non-BMP character, ASCII *)
Hashes match: expected_hash

$ ./compare.sh 'content: "é😀中"' (* Accented character, emoji, CJK character *)
Hashes match: expected_hash

4. Edge Cases for Non-Printable ASCII and Control Characters

  • Testing with control characters (e.g., newline, tab) ensures they’re handled appropriately in UTF-16 encoding.
$ ./compare.sh 'content: "Line1\nLine2"' (* Newline in string *)
Hashes match: expected_hash

$ ./compare.sh 'content: "Tab\tSpace"' (* Tab character *)
Hashes match: expected_hash

5. Long Strings

  • Test long strings to validate that your code handles lengthy input without performance issues or memory problems, especially important for a hashing algorithm.
$ ./compare.sh 'content: "a" ^ 1000' (* Repeat a single character 1000 times *)
Hashes match: expected_hash

$ ./compare.sh 'content: "😀" ^ 500' (* Repeat a surrogate pair character 500 times *)
Hashes match: expected_hash

6. Combination of Diacritical Marks and Composite Characters

  • Certain characters can be combined with diacritical marks to form composite characters. Testing these combinations ensures accurate encoding and hashing.
$ ./compare.sh 'content: "e\u0301"' (* 'e' + acute accent (é) *)
Hashes match: expected_hash

$ ./compare.sh 'content: "n\u0303o\u0301"' (* 'n' + tilde, 'o' + acute accent (ñó) *)
Hashes match: expected_hash

7. Emoji Sequences and Skin Tone Modifiers

  • Emojis with skin tone or gender modifiers form composite characters in UTF-16, which could affect hashing.
$ ./compare.sh 'content: "👍🏽"' (* Thumbs up with medium skin tone *)
Hashes match: expected_hash

$ ./compare.sh 'content: "👩‍💻"' (* Woman technologist emoji, sequence with zero-width joiner *)
Hashes match: expected_hash

8. CJK and Other Multi-Byte Characters

  • Characters from CJK languages and other multi-byte character sets are crucial to test as they are common in non-Latin scripts and can have varied byte-length representations in UTF-16.
$ ./compare.sh 'content: "漢字"' (* Chinese characters *)
Hashes match: expected_hash

$ ./compare.sh 'content: "हिन्दी"' (* Hindi script *)
Hashes match: expected_hash

These examples aim to cover a wide variety of characters, including edge cases for surrogate pairs, multi-byte characters, and sequences. They will help validate that the hashing algorithm performs consistently with JavaScript’s UTF-16 expectations across diverse inputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jchavarri thank you for those test cases. Every test case works perfectly.
The only "issue" was that I had to hardcode the long text with emojis because it did not work with a script.

@pedrobslisboa pedrobslisboa force-pushed the feat/mumur2-special-chars-improve branch 3 times, most recently from 70dff65 to 870fe58 Compare November 14, 2024 12:34
Copy link
Owner

@davesnx davesnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woa

let i = ref 0 in
let len = ref (String.length str) in

let utf8_values = get_utf16_char_codes str in
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_utf16_char_codes could raise, can we do something to avoid the raise and default to a decent output?

Copy link
Collaborator Author

@pedrobslisboa pedrobslisboa Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davesnx I was looking for an alternative is returning the char as size 1: loop (i + 1) (Char.code s.[i] :: ACC)

What do you think? Commit: a6b5178#diff-9c6d38be187d03d1ebccb32294a9f980f60bab975999e1b739ae58b5c4fe8d44R67

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just try/catch it? or more important what happens when this fails in murmurjs?

Copy link
Collaborator Author

@pedrobslisboa pedrobslisboa Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @davesnx, thank you for this comment.

mumurjs does not raise but it indeed handles malformed chars, and the handle with �

So I updated the get_utf16_char_codes to not raise and to return the necessary amount.

I also added a test for it, I didn't like the way I made this test; too many files, but I needed ideas.

You can see the tests here: 56ddc8d#diff-637969fb2a4af70ce6473152b597fdc34dcce8a6f4e28beed3a6bd6ca150e784

@pedrobslisboa pedrobslisboa force-pushed the feat/mumur2-special-chars-improve branch 4 times, most recently from 8540c21 to 56ddc8d Compare November 18, 2024 23:06
path/to/file Outdated Show resolved Hide resolved
packages/murmur2/test/cli_murmur2.ml Outdated Show resolved Hide resolved
packages/murmur2/test/murmur2.js Outdated Show resolved Hide resolved
@pedrobslisboa pedrobslisboa force-pushed the feat/mumur2-special-chars-improve branch from 56ddc8d to 0fb659a Compare November 19, 2024 12:22
$ ./compare.sh "color:var(--alt-text--tertiary);:disabled{color:var(--alt-text--tertiary);}:hover{color:var(--alt-text--primary);}"
Hashes match: 66bc4u
$ ./compare.sh "display:flex;:before, :after{content:'';flex:0 0 16px;}"
Hashes match: ab0yh7

Malformed UTF-8
$ ./compare-malformed.sh
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why malformed is another test and another sh?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't reproduce malformed utf8 strings on bash: $ ./compare.sh <some malformed string here>, so I did it on its own environment, JS and OCaml (malformed.js/malformed.ml)
I could have added it into compare. Sh, I did it before using a --malformed flag but didn't like the usage:

  $ ./compare.sh <input>
  $ ./compare.sh --malformed

I commented here that I didn't like this way but didn't know any other way to do it.
Now, thinking we can avoid other sh with:

Malformed UTF-8
  $ cat > compare-malformed.sh << 'EOF'
  >  #!/bin/sh
  >  native_hashes=$(./malformed.exe)
  >  js_hashes=$(node './malformed.js')
  >  if [ "$native_hashes" = "$js_hashes" ]; then
  >    echo "Hashes match: ${native_hashes}"
  >    exit 0
  >  else
  >    echo "Hashes do not match"
  >    echo "native: ${native_hashes}"
  >    echo "JavaScript: ${js_hashes}"
  >    exit 1
  >  fi
  >  EOF

  $ chmod +x compare-malformed.sh
  $ ./compare-malformed.sh
  Hashes match: z68v9f (2 bytes) - c4jucd (3 bytes) - c4jucd (4 bytes)

@davesnx davesnx merged commit 6d64970 into main Nov 19, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants