Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wide chars, UTF-8, terminal escapes and colors, etc. #69

Open
4 tasks
apjanke opened this issue Dec 1, 2024 · 5 comments
Open
4 tasks

Wide chars, UTF-8, terminal escapes and colors, etc. #69

apjanke opened this issue Dec 1, 2024 · 5 comments
Assignees
Labels
bug Somethin ain't right
Milestone

Comments

@apjanke
Copy link
Collaborator

apjanke commented Dec 1, 2024

Cowsay doesn't handle variant character widths well. It kind of assumes all characters are 1 char wide (in display) and (I think) 1 byte (in the input encoding). This means that non-English/Latin characters in the cows or the message text are not handled well.

This is an expansion of #65 "Use Debian's UTF-8 Patch".

Aspects:

  • Messages with multi-byte encoded characters don't look right. The speech balloon width and word wrapping are wrong.
  • Terminal control (escape) sequences, like color codes, are treated as visible characters.

Bad-wrapped multi-byte char example:

image

Considerations

The cow files distributed with cowsay are all UTF-8, regardless of what locale the user is running in or how their system is set up. (I think? Or are they actually ASCII/Latin-1, since they are Perl source code?)

Message input might be in the user's locale while the cow files are UTF-8. Custom cows (including in third-party cow herd packages) might be in other encodings, which may or may not be the same encoding as

Perl's standard library doesn't support char width detection, I don't think. Would need a CPAN module for that. We currently don't take any deps on modules. Would need to figure out how to do that. I think we'd vendor the module (ship a copy of it in cowsay itself), to avoid creating any external dependencies or a more complicated install process.

Testing

Examples:

  • cowsay "MÖÖÖ"
  • cowsay 'Привет, мир!'
  • cowsay 'Ищу свое лицо. Особых примет нет.'

Wide chars:

ANSI terminal escapes:

  • echo 'Hello, World!' | toilet -w 100 --metal | cowsay -n
  • figlet "Hello World!" | toilet -f term --metal | /usr/games/cowsay -n

TODO

  • Determine and document which encoding(s) are supported for cowfiles.
    • Make our source code UTF-8 (with use utf8;)?
  • [-] Add support for detecting, and maybe explicitly setting, message input encoding.
  • [-] Bump required Perl to 5.8.1 (or 5.8.7, or later), which added Unicode fixes relevant to this UTF-8 stuff?
    • 5.8.0 added Unicode support and UTF8-ness of stdin/out/err/ARGV.
    • 5.8.1 restored behavior of stdin/out/err and ARGV not being interpreted as UTF8 by default.
    • ${^UTF8LOCALE} was added in 5.8.7.
  • Support wide chars and invisible chars (like terminal escapes).
    • ("Wide" in the sense that they are displayed 2-char width; not that they're a wchar type in encoding.)
  • Add tests for multibyte and wide characters.

References

@apjanke
Copy link
Collaborator Author

apjanke commented Dec 1, 2024

I looked in to the licensing some more, and I don't think there's actually a licensing problem with the Debian patches or most of their cows. Per their copyright file, all their patches (except cowsay_random) are licensed under the original Cowsay license terms, so we could pull them in no problem.

Looking at the Debian patch...

--- a/cowsay
+++ b/cowsay
@@ -12,6 +12,13 @@ use File::Basename;
 use Getopt::Std;
 use Cwd;
 
+if (${^UTF8LOCALE}) {
+    binmode STDIN, ':utf8';
+    binmode STDOUT, ':utf8';
+    require Encode;
+    eval { $_ = Encode::decode_utf8($_,1) } for @ARGV;
+}
+
 $version = "3.03";
 $progname = basename($0);

...hmm. I'm not very familiar with Perl Unicode support.

Looks like this is programmatically doing the equivalent of perl -C which enables Unicode-ness and UTF8-ness on file handles and arguments.

Might be able to handle this more gracefully by using #!/usr/bin/env perl -C as the shebang. But then it wouldn't work right under explicit perl cowsay invocations unless the user explicitly added -C or $PERL_UNICODE stuff, I think. So maybe this Debian approach is the right way to do it.

See:

@apjanke apjanke self-assigned this Dec 1, 2024
@apjanke apjanke added the bug Somethin ain't right label Dec 1, 2024
@apjanke apjanke added this to Cowsay Dec 1, 2024
@apjanke apjanke added this to the 3.9.0 milestone Dec 1, 2024
@apjanke apjanke moved this to Todo in Cowsay Dec 1, 2024
@apjanke apjanke moved this from Todo to In Progress in Cowsay Dec 1, 2024
apjanke added a commit that referenced this issue Dec 1, 2024
…o 5.8.7

Addresses #69 and #65.

This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.
@apjanke
Copy link
Collaborator Author

apjanke commented Dec 1, 2024

Added support for multibyte UTF-8 chars in inputs in b91f3d2, targeted for Cowsay 3.9.0, on the dev/free-mime branch instead of main.

Seems to work fine for me:

image

apjanke added a commit that referenced this issue Dec 1, 2024
…o 5.8.7

Addresses #69 and #65.

This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.
apjanke added a commit that referenced this issue Dec 1, 2024
…o 5.8.7

Addresses #69 and #65.

This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.
apjanke added a commit that referenced this issue Dec 1, 2024
…o 5.8.7

Addresses #69 and #65.

This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.
@apjanke
Copy link
Collaborator Author

apjanke commented Dec 1, 2024

Oh no, something went rather wrong here.

[cowsay] $ cowsay -f sus 'Hello world!'
 ______________
< Hello world! >
 --------------
   \
    \  .������.
      .���.    \
     (     )   +��\
      `���´    |  |
      |        |  |
      |   __   +��/
      \__/  \__/
[cowsay] $ git switch main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
[cowsay] $ cowsay -f sus 'Hello world!'
 ______________
< Hello world! >
 --------------
   \
    \  .——————.
      .———.    \
     (     )   +——\
      `———´    |  |
      |        |  |
      |   __   +——/
      \__/  \__/
[cowsay] $
image

@apjanke
Copy link
Collaborator Author

apjanke commented Dec 1, 2024

Ah. It looks like some of our recent contributions are using non-ASCII UTF-8 characters.

[cows] $ pwd
/Users/janke/repos/cowsay-repos/cowsay/share/cowsay/cows
[cows] $ file *
actually.cow:          Unicode text, UTF-8 text
alpaca.cow:            Unicode text, UTF-8 text
beavis.zen.cow:        ASCII text
blowfish.cow:          ASCII text
bong.cow:              ASCII text
[...]
supermilker.cow:       ASCII text
surgery.cow:           ASCII text
sus.cow:               Unicode text, UTF-8 text
three-eyes.cow:        ASCII text
turkey.cow:            ASCII text

I don't know what interaction this new "UTF-8 on inputs" is having with cow source files, but seems likely it's something like that.

Maybe the problem is that we're changing STDOUT to UTF8 encoding, and these UTF8-using source files are not marked as UTF8, so they get misinterpreted as single-byte Latin-1 source, converted to chars, and then upon output, Perl renders them as the UTF-8 encoding of that bogus single-byte-encoding interpretation. That'd explain the accented characters: high (8-bit not 7-bit) bytes getting re-rendered.

Maybe use utf8; will help.

References

@apjanke
Copy link
Collaborator Author

apjanke commented Dec 1, 2024

Trying use utf8; in the known UTF-encoded cows.

Before:

image

After:

image

Yeah, that looks better.

Did that in fc346f4.

apjanke added a commit that referenced this issue Dec 1, 2024
…o 5.8.7

Addresses #69 and #65.

This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.
apjanke added a commit that referenced this issue Dec 3, 2024
…o 5.8.7

Addresses #69 and #65.

This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Somethin ain't right
Projects
Status: In Progress
Development

No branches or pull requests

1 participant