Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniprop and friends are buggy, inconsistent, and potentially replaceable #437

Open
ab5tract opened this issue Aug 31, 2024 · 2 comments
Open
Labels
unicode Unicode and encoding/decoding

Comments

@ab5tract
Copy link

ab5tract commented Aug 31, 2024

The Problem

Raku currently very lightly wraps internal NQP and VM operations for:

  • getting Unicode properties (uniprop)
  • getting Unicode names (uniname / uninames)
  • matching the first-or-only character of a string to a property or a property/property category pair (unimatch)
  • producing Unicode characters (uniparse)
  • producing numerical values from Unicode numbers (unival)

As currently implemented, these routines -- methods and subs alike -- do not perform their functions as well as they could.

Issues

  • uniprop will return 0 if the property provided is not found. This would be better served with a Failure or Exception, or at the very least an empty string so that the return type is consistent.

  • 'Properties' and 'Property Categories' are served via the same mechanism, even though they are only interchangeable from category to property and not the other way around.

    "w".unimatch("Latin", "Script") ==> say() # True
    "w".unimatch("Script", "Latin") ==> say() # False
    
    "w".uniprop("Script") ==> say() # Latin
    "w".uniprop("Latin")  ==> say() # Latin
    "w".uniprop           ==> say() # Ll      
    
  • All of the methods which take properties could fail at compile time when provided with known nonexistent properties (ie, not stored in a variable).

  • Properties are returned from NQP with spaces separating words, but only _ separated property names are accepted going the other direction.

    "e".uniprop("Block") 				# "Basic Latin" 
    "e".uniprop("Basic Latin") 	# 0
    "e".uniprop("Basic_Latin")  # "Basic Latin"
    
    use nqp;
    my $a = nqp::unipropcode("Basic Latin"),
    my $b = nqp::unipropcode("Basic_Latin"),
    $a, $b, $a == $b ==> say()  # (0 6 False)
    
  • The current implementation is not [thread-safe](Unicode property methods/subs are not thread-safe rakudo/rakudo#4871). See also at [the MoarVM level](Unicode ops are critically thread-dangerous MoarVM/MoarVM#1717).

  • There are separate methods for "first character" (uniprop/uniname) versus "full string" (uniprops, uninames). It feels far too Perl-ish at this point to accept a multi-character string as an argument that only operates ona single character.

    • Wouldn't it be reasonable to except uniprops to return a comprehensive list of applicable properties for a single character? Currently there is no way (that I am aware of) to get such a comprehensive list.
  • (Nit-level deficiency) smashedtogetherlowercase is an ugly (at worst) or a non-conformant (at best) naming scheme.

The solution

The instructions for problem-solving state I'm not meant to spend any space on this in the initial post.

So for now, I'll just mention that -- at the HLL level -- I think we could achieve all current functionality as well as much more out of even a single method that utilizes adverbs.

@ab5tract ab5tract added the unicode Unicode and encoding/decoding label Aug 31, 2024
@lizmat lizmat unassigned samcv Sep 8, 2024
@lizmat
Copy link
Collaborator

lizmat commented Sep 8, 2024

Feels to me an implementation of desired functionality in module space, would be a first step?

@ab5tract
Copy link
Author

ab5tract commented Sep 9, 2024

Feels to me an implementation of desired functionality in module space, would be a first step?

Hmmm.. I'm not sure how successfully some of the issues can be addressed without NQP and/or VM changes.

But it's a good idea for ironing out the actual replacement Raku API nonetheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Unicode and encoding/decoding
Projects
None yet
Development

No branches or pull requests

3 participants