`uniprop` and friends are buggy, inconsistent, and potentially replaceable #437

ab5tract · 2024-08-31T11:30:03Z

The Problem

Raku currently very lightly wraps internal NQP and VM operations for:

getting Unicode properties (uniprop)
getting Unicode names (uniname / uninames)
matching the first-or-only character of a string to a property or a property/property category pair (unimatch)
producing Unicode characters (uniparse)
producing numerical values from Unicode numbers (unival)

As currently implemented, these routines -- methods and subs alike -- do not perform their functions as well as they could.

Issues

uniprop will return 0 if the property provided is not found. This would be better served with a Failure or Exception, or at the very least an empty string so that the return type is consistent.

'Properties' and 'Property Categories' are served via the same mechanism, even though they are only interchangeable from category to property and not the other way around.

"w".unimatch("Latin", "Script") ==> say() # True
"w".unimatch("Script", "Latin") ==> say() # False

"w".uniprop("Script") ==> say() # Latin
"w".uniprop("Latin")  ==> say() # Latin
"w".uniprop           ==> say() # Ll

All of the methods which take properties could fail at compile time when provided with known nonexistent properties (ie, not stored in a variable).
- As of right now, there is zero feedback that the provided property is nonexistent. This was reported already [6 years ago](Unrecognized unicode properties shouldn't fail silently rakudo/rakudo#1496).

Properties are returned from NQP with spaces separating words, but only _ separated property names are accepted going the other direction.

"e".uniprop("Block") 				# "Basic Latin" 
"e".uniprop("Basic Latin") 	# 0
"e".uniprop("Basic_Latin")  # "Basic Latin"

use nqp;
my $a = nqp::unipropcode("Basic Latin"),
my $b = nqp::unipropcode("Basic_Latin"),
$a, $b, $a == $b ==> say()  # (0 6 False)

The current implementation is not [thread-safe](Unicode property methods/subs are not thread-safe rakudo/rakudo#4871). See also at [the MoarVM level](Unicode ops are critically thread-dangerous MoarVM/MoarVM#1717).
There are separate methods for "first character" (uniprop/uniname) versus "full string" (uniprops, uninames). It feels far too Perl-ish at this point to accept a multi-character string as an argument that only operates ona single character.
- Wouldn't it be reasonable to except uniprops to return a comprehensive list of applicable properties for a single character? Currently there is no way (that I am aware of) to get such a comprehensive list.
(Nit-level deficiency) smashedtogetherlowercase is an ugly (at worst) or a non-conformant (at best) naming scheme.

The solution

The instructions for problem-solving state I'm not meant to spend any space on this in the initial post.

So for now, I'll just mention that -- at the HLL level -- I think we could achieve all current functionality as well as much more out of even a single method that utilizes adverbs.

The text was updated successfully, but these errors were encountered:

lizmat · 2024-09-08T11:39:21Z

Feels to me an implementation of desired functionality in module space, would be a first step?

ab5tract · 2024-09-09T08:23:49Z

Feels to me an implementation of desired functionality in module space, would be a first step?

Hmmm.. I'm not sure how successfully some of the issues can be addressed without NQP and/or VM changes.

But it's a good idea for ironing out the actual replacement Raku API nonetheless.

ab5tract added the unicode Unicode and encoding/decoding label Aug 31, 2024

ab5tract assigned samcv Aug 31, 2024

lizmat unassigned samcv Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`uniprop` and friends are buggy, inconsistent, and potentially replaceable #437

`uniprop` and friends are buggy, inconsistent, and potentially replaceable #437

ab5tract commented Aug 31, 2024 •

edited

Loading

lizmat commented Sep 8, 2024

ab5tract commented Sep 9, 2024

uniprop and friends are buggy, inconsistent, and potentially replaceable #437

uniprop and friends are buggy, inconsistent, and potentially replaceable #437

Comments

ab5tract commented Aug 31, 2024 • edited Loading

The Problem

Issues

The solution

lizmat commented Sep 8, 2024

ab5tract commented Sep 9, 2024

`uniprop` and friends are buggy, inconsistent, and potentially replaceable #437

`uniprop` and friends are buggy, inconsistent, and potentially replaceable #437

ab5tract commented Aug 31, 2024 •

edited

Loading