Wide chars, UTF-8, terminal escapes and colors, etc. #69

apjanke · 2024-12-01T01:02:49Z

Cowsay doesn't handle variant character widths well. It kind of assumes all characters are 1 char wide (in display) and (I think) 1 byte (in the input encoding). This means that non-English/Latin characters in the cows or the message text are not handled well.

This is an expansion of #65 "Use Debian's UTF-8 Patch".

Aspects:

Messages with multi-byte encoded characters don't look right. The speech balloon width and word wrapping are wrong.
Terminal control (escape) sequences, like color codes, are treated as visible characters.

Bad-wrapped multi-byte char example:

Considerations

The cow files distributed with cowsay are all UTF-8, regardless of what locale the user is running in or how their system is set up. (I think? Or are they actually ASCII/Latin-1, since they are Perl source code?)

Message input might be in the user's locale while the cow files are UTF-8. Custom cows (including in third-party cow herd packages) might be in other encodings, which may or may not be the same encoding as

Perl's standard library doesn't support char width detection, I don't think. Would need a CPAN module for that. We currently don't take any deps on modules. Would need to figure out how to do that. I think we'd vendor the module (ship a copy of it in cowsay itself), to avoid creating any external dependencies or a more complicated install process.

Testing

Examples:

cowsay "MÖÖÖ"
cowsay 'Привет, мир!'
cowsay 'Ищу свое лицо. Особых примет нет.'

Wide chars:

cowsay "我愛中國人"
- from Debian 769565 "widechar not good":
  - cowsay 'でびあん/Debian'
  - cowsay 谢谢你

ANSI terminal escapes:

echo 'Hello, World!' | toilet -w 100 --metal | cowsay -n
figlet "Hello World!" | toilet -f term --metal | /usr/games/cowsay -n

TODO

Determine and document which encoding(s) are supported for cowfiles.
- Make our source code UTF-8 (with use utf8;)?
[-] Add support for detecting, and maybe explicitly setting, message input encoding.
[-] Bump required Perl to 5.8.1 (or 5.8.7, or later), which added Unicode fixes relevant to this UTF-8 stuff?
- 5.8.0 added Unicode support and UTF8-ness of stdin/out/err/ARGV.
- 5.8.1 restored behavior of stdin/out/err and ARGV not being interpreted as UTF8 by default.
- ${^UTF8LOCALE} was added in 5.8.7.
Support wide chars and invisible chars (like terminal escapes).
- ("Wide" in the sense that they are displayed 2-char width; not that they're a wchar type in encoding.)
Add tests for multibyte and wide characters.

References

Use Debian's UTF-8 Patch For Calculating the Number of Columns #65
Third-party patches
- Debian patch for UTF-8 input and char sizes
  - Debian bug 2544557
  - Ubuntu bug 393212
- ANSI escape codes
  - Ubuntu bug 1027033 "should ignore ANSI codes for width"
  - Ubuntu bug 1437804 "confused by input with ANSI colour codes"
Perl doco

The text was updated successfully, but these errors were encountered:

apjanke · 2024-12-01T01:05:24Z

I looked in to the licensing some more, and I don't think there's actually a licensing problem with the Debian patches or most of their cows. Per their copyright file, all their patches (except cowsay_random) are licensed under the original Cowsay license terms, so we could pull them in no problem.

Looking at the Debian patch...

--- a/cowsay
+++ b/cowsay
@@ -12,6 +12,13 @@ use File::Basename;
 use Getopt::Std;
 use Cwd;
 
+if (${^UTF8LOCALE}) {
+    binmode STDIN, ':utf8';
+    binmode STDOUT, ':utf8';
+    require Encode;
+    eval { $_ = Encode::decode_utf8($_,1) } for @ARGV;
+}
+
 $version = "3.03";
 $progname = basename($0);

...hmm. I'm not very familiar with Perl Unicode support.

Looks like this is programmatically doing the equivalent of perl -C which enables Unicode-ness and UTF8-ness on file handles and arguments.

Might be able to handle this more gracefully by using #!/usr/bin/env perl -C as the shebang. But then it wouldn't work right under explicit perl cowsay invocations unless the user explicitly added -C or $PERL_UNICODE stuff, I think. So maybe this Debian approach is the right way to do it.

See:

…o 5.8.7 Addresses #69 and #65. This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.

apjanke · 2024-12-01T13:10:33Z

Added support for multibyte UTF-8 chars in inputs in b91f3d2, targeted for Cowsay 3.9.0, on the dev/free-mime branch instead of main.

Seems to work fine for me:

…o 5.8.7 Addresses #69 and #65. This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.

apjanke · 2024-12-01T16:14:30Z

Oh no, something went rather wrong here.

[cowsay] $ cowsay -f sus 'Hello world!'
 ______________
< Hello world! >
 --------------
   \
    \  .â��â��â��â��â��â��.
      .â��â��â��.    \
     (     )   +â��â��\
      `â��â��â��Â´    |  |
      |        |  |
      |   __   +â��â��/
      \__/  \__/
[cowsay] $ git switch main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
[cowsay] $ cowsay -f sus 'Hello world!'
 ______________
< Hello world! >
 --------------
   \
    \  .——————.
      .———.    \
     (     )   +——\
      `———´    |  |
      |        |  |
      |   __   +——/
      \__/  \__/
[cowsay] $

apjanke · 2024-12-01T16:19:00Z

Ah. It looks like some of our recent contributions are using non-ASCII UTF-8 characters.

[cows] $ pwd
/Users/janke/repos/cowsay-repos/cowsay/share/cowsay/cows
[cows] $ file *
actually.cow:          Unicode text, UTF-8 text
alpaca.cow:            Unicode text, UTF-8 text
beavis.zen.cow:        ASCII text
blowfish.cow:          ASCII text
bong.cow:              ASCII text
[...]
supermilker.cow:       ASCII text
surgery.cow:           ASCII text
sus.cow:               Unicode text, UTF-8 text
three-eyes.cow:        ASCII text
turkey.cow:            ASCII text

I don't know what interaction this new "UTF-8 on inputs" is having with cow source files, but seems likely it's something like that.

Maybe the problem is that we're changing STDOUT to UTF8 encoding, and these UTF8-using source files are not marked as UTF8, so they get misinterpreted as single-byte Latin-1 source, converted to chars, and then upon output, Perl renders them as the UTF-8 encoding of that bogus single-byte-encoding interpretation. That'd explain the accented characters: high (8-bit not 7-bit) bytes getting re-rendered.

Maybe use utf8; will help.

References

https://perldoc.perl.org/utf8

apjanke · 2024-12-01T16:27:05Z

Trying use utf8; in the known UTF-encoded cows.

Before:

After:

Yeah, that looks better.

Did that in fc346f4.

…o 5.8.7 Addresses #69 and #65. This UTF-8 handling approach is based on Debian's UTF-8 handling patch for cowsay 3.03 at https://sources.debian.org/patches/cowsay/3.03%2Bdfsg2-8/utf8_width, discussed at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=254557. It has been in place on Debian since 2010, so I think we can consider it reasonably well tested and supported.

apjanke self-assigned this Dec 1, 2024

apjanke added the bug Somethin ain't right label Dec 1, 2024

apjanke added this to Cowsay Dec 1, 2024

apjanke added this to the 3.9.0 milestone Dec 1, 2024

apjanke moved this to Todo in Cowsay Dec 1, 2024

apjanke moved this from Todo to In Progress in Cowsay Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wide chars, UTF-8, terminal escapes and colors, etc. #69

Wide chars, UTF-8, terminal escapes and colors, etc. #69

apjanke commented Dec 1, 2024 •

edited

Loading

apjanke commented Dec 1, 2024

apjanke commented Dec 1, 2024

apjanke commented Dec 1, 2024

apjanke commented Dec 1, 2024 •

edited

Loading

apjanke commented Dec 1, 2024

Wide chars, UTF-8, terminal escapes and colors, etc. #69

Wide chars, UTF-8, terminal escapes and colors, etc. #69

Comments

apjanke commented Dec 1, 2024 • edited Loading

Considerations

Testing

TODO

References

apjanke commented Dec 1, 2024

apjanke commented Dec 1, 2024

apjanke commented Dec 1, 2024

apjanke commented Dec 1, 2024 • edited Loading

References

apjanke commented Dec 1, 2024

apjanke commented Dec 1, 2024 •

edited

Loading

apjanke commented Dec 1, 2024 •

edited

Loading