Dusko Pijetlovic

My personal notes where I store things I find interesting or might need in the future.

Encodings, UTF-8 and Unicode Notes

17 Sep 2024 » unicode, utf8, x11, xorg, xterm, cli, terminal, shell, howto, sysadmin, unix, perl, python, vi, vim, ascii, plaintext, text, tex, latex, pdf, typography, font, html, design, webbrowser, webdevelopment, awk, regex, programming, coding, development, tool, reference, dotfiles, tip

OS: FreeBSD 14

% freebsd-version 
14.0-RELEASE-p6

Shell: tcsh [1]

Locale:

% locale
LANG=en_CA.UTF-8
LC_CTYPE="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_ALL=

NOTE: Similar to Wikipedia note for Emoji article, heed this note:

This page contains Unicode emoticons or emojis. Without proper rendering support, you may see question marks, boxes, or other symbols instead of the intended characters. [2]

Programs from uniutils Package

% pkg query %Fp uniutils | wc -l
      31
 
% pkg query %Fp uniutils | grep bin
/usr/local/bin/ExplicateUTF8
/usr/local/bin/unidesc
/usr/local/bin/unifuzz
/usr/local/bin/unihist
/usr/local/bin/uniname
/usr/local/bin/unireverse
/usr/local/bin/unisurrogate
/usr/local/bin/utf8lookup
% printf '\302\240' | uniname
No LINES variable in environment so unable to determine lines per page.
Using default of 24.
character  byte       UTF-32   encoded as     glyph   name
        0          0  0000A0   C2 A0                  NO-BREAK SPACE
% printf '\302\240' | env LINES=0 uniname
character  byte       UTF-32   encoded as     glyph   name
        0          0  0000A0   C2 A0                  NO-BREAK SPACE

% printf '\302\240' | env LINES=1 uniname
        0          0  0000A0   C2 A0                  NO-BREAK SPACE
% printf '\302\240' | unidesc
       0               0        Latin-1 Supplement
% printf '\302\240' | ExplicateUTF8
The sequence 0xC2     0xA0     
             11000010 10100000 
is a valid UTF-8 character encoding equivalent to UTF32 0x000000A0.
The first byte tells us that there should be 1
continuation bytes since it begins with 2 contiguous 1s.
There are 1 following bytes and all are valid
continuation bytes since they all have high bits 10.
The first byte contributes its low 5 bits.
The remaining bytes each contribute their low 6 bits,
for a total of 11 bits: 00010 100000 
This is padded to 32 places with 21 zeros: 0000000000000000000000000000000000000000000000000000000010100000
                                           0   0   0   0   0   0   A   0
% env LINES=0 utf8lookup 0000A0
UTF-32   name
0000A0  NO-BREAK SPACE

% env LINES=1 utf8lookup 0000A0
0000A0  NO-BREAK SPACE

Converting Multi-Byte Characters

Example: Detect and Convert

Paste the character you want to analyze into a file.

NOTE: Depending on fonts you have on your system and your Web browser setup, you might not see the glyph representing this character (which is an emoji) in some of the outputs below.

% cat /tmp/convchar
🤔

Explain both UTF-8 (Hex UTF-8 Bytes) and Unicode (Unicode Hex Point)

With ExplicateUTF8(1)

% ExplicateUTF8 /tmp/convchar 
The sequence 0xF0     0x9F     0xA4     0x94     
             11110000 10011111 10100100 10010100 
is a valid UTF-8 character encoding equivalent to UTF32 0x0001F914.
The first byte tells us that there should be 3
continuation bytes since it begins with 4 contiguous 1s.
There are 3 following bytes and all are valid
continuation bytes since they all have high bits 10.
The first byte contributes its low 3 bits.
The remaining bytes each contribute their low 6 bits,
for a total of 21 bits: 000 011111 100100 010100 
This is padded to 32 places with 11 zeros: 0000000000000000000000000000000000000000000000011111100100010100
                                           0   0   0   1   F   9   1   4

This chacter in UTF-8: F0 9F A4 94

This chacter in Unicode: 1F914

From the above output of the ExplicateUTF8(1) tool:

... for a total of 21 bits: 000 011111 100100 010100"

The 21 bits: 000 011111 100100 010100

Convert binary to hex.

% printf "obase=16; ibase=2; 000011111100100010100" | bc
1F914

Or, with padding:

% printf "obase=16; ibase=2; 0000000000000000000000000000000000000000000000011111100100010100" | bc
1F914

With uniname(1)

% env LINES=0 uniname /tmp/convchar
character  byte       UTF-32   encoded as     glyph   name
        0          0  01F914   F0 9F A4 94    🤔      Character in undefined range
        1          4  00000A   0A                     LINE FEED (LF)

UTF-8 bytes as Latin-1 Characters Bytes

About Latin-1 characters bytes: From UTF-8 Conversion Tool by Richard Tobin:

UTF-8 bytes as Latin-1 characters is what you typically see when you display a UTF-8 file with a terminal or editor that only knows about 8-bit characters.

WARNING: On my FreeBSD 14 system, iconv(1) in the base install considered LATIN1 and ISO-8859-15 encodings as the same, while iconv(1) installed as a package didn’t. This mattered because using iconv(1) with ISO-8859-15 encoding resulted in an incorrect output.

For more details, see Footnote 3. [3]

% where where
where is a shell built-in

% which which
which: shell built-in command.

% where whereis
/usr/bin/whereis

% where which
which is a shell built-in
/usr/bin/which
% command -V iconv
iconv is /usr/bin/iconv

% type iconv
iconv is /usr/bin/iconv

% which iconv
/usr/bin/iconv
 
% whereis -a iconv
iconv: /usr/bin/iconv /usr/local/bin/iconv /usr/share/man/man1/iconv.1.gz /usr/local/share/man/man1/iconv.1.gz /usr/share/man/man3/iconv.3.gz /usr/local/share/man/man3/iconv.3.gz
 
% where iconv
/usr/bin/iconv
/usr/local/bin/iconv

With iconv(1) from base install:

% iconv -l | grep -w -i LATIN1 | grep -i ISO-8859-15
ISO-8859-1 CP819 CSISOLATIN1 IBM819 ISO-IR-100 ISO8859-1 ISO_8859-1 ISO_8859-1:1987 L1 LATIN1 CSISOLATIN6 ISO-8859-10 ISO-IR-157 ISO8859-10 ISO_8859-10 ISO_8859-10:1992 L6 LATIN6 ISO-8859-11 ISO-IR-166 ISO8859-11 ISO_8859-11 TIS-620 TIS.2533-1 TIS620 TIS620-0 TIS620.2529-1 TIS620.2533-0 ISO-8859-13 ISO-IR-179 ISO8859-13 ISO_8859-13 ISO_8859-13:1998 L7 LATIN7 ISO-8859-14 ISO-CELTIC ISO-IR-199 ISO8859-14 ISO_8859-14 ISO_8859-14:1998 L8 LATIN8 CP923 IBM923 ISO-8859-15 ISO-IR-203 ISO8859-15 ISO_8859-15 ISO_8859-15:1998 L9 LATIN9 ISO-8859-16 ISO-IR-226 ISO8859-16 ISO_8859-16 ISO_8859-16:2001 L10 LATIN10

With iconv(1) from packages:

% /usr/local/bin/iconv -l | grep -w -i LATIN1 | grep -i ISO-8859-15 
% /usr/local/bin/iconv -l | grep -w -i LATIN1
CP819 IBM819 ISO-8859-1 ISO-IR-100 ISO8859-1 ISO_8859-1 ISO_8859-1:1987 L1 LATIN1 CSISOLATIN1
RISCOS-LATIN1

Incorect:

% iconv -f iso-8859-15 -t UTF-8 /tmp/convchar | od -ac
0000000   c3  b0  c2  9f  e2  82  ac  c2  94  nl                        
           ð  ** 302 237   €  **  ** 302 224  \n                        
0000012
% iconv -f iso-8859-15 -t UTF-8 /tmp/convchar | od -ab
0000000   c3  b0  c2  9f  e2  82  ac  c2  94  nl                        
          303 260 302 237 342 202 254 302 224 012
0000012
% printf '\303\260'
ð% 
% printf '\342\202\254'
€% 

Corect:

% iconv -f LATIN1 /tmp/convchar | od -ac
0000000   c3  b0  c2  9f  c2  a4  c2  94  nl                            
           ð  ** 302 237   ¤  ** 302 224  \n                            
0000011

Or, with iconv(1) from packages:

% /usr/local/bin/iconv -f LATIN1 /tmp/convchar | od -ac
0000000   c3  b0  c2  9f  c2  a4  c2  94  nl                            
           ð  ** 302 237   ¤  ** 302 224  \n                            
0000011
% /usr/local/bin/iconv -f LATIN1 /tmp/convchar | od -ab
0000000   c3  b0  c2  9f  c2  a4  c2  94  nl                            
          303 260 302 237 302 244 302 224 012                            
0000011
% printf '\303\260'
ð% 
 
% printf '\303\260\302'
ð�% 
 
% printf '\302\244'
¤% 
 
% printf '\302\244\302'
¤�% 

UTF-8 bytes as Latin-1 characters bytes: ð <9F> ¤ <94>

Hex UTF-16 Surrogates

% unisurrogate 1F914
The surrogate representation of U+1F914 is U+D83E U+DD14

Also see:

The Surrogate Pair Calculator etc. by Russell W. Cottrell

A surrogate pair is defined by the Unicode Standard as “a representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.” Since Unicode is a 21-bit standard, surrogate pairs are needed by applications that use UTF-16, such as JavaScript, to display characters whose code points are greater than 16-bit. (UTF-8, the most popular HTML encoding, uses a more flexible method of representing high-bit characters and does not use surrogate pairs.)

The algorithm for converting to and from surrogate pairs is not widely published on the internet. (But the code here has been “borrowed” a time or two!) The official source is The Unicode Standard, Version 3.0 | Unicode 3.0.0 (not later versions), Section 3.7, Surrogates.

Conversion to UTF-8 (Hex UTF-8 Bytes)

With od(1)

% od -ac /tmp/convchar
0000000   f0  9f  a4  94  nl
          🤔  **  **  **  \n
0000005

NOTE: This is a muli-byte character, with byte count of 4, starting with f0.

From the man page for od(1):

Multi-byte characters are displayed in the area
corresponding to the first byte of the character.
The remaining bytes are shown as ‘**’.

In UTF-8 (Hex UTF-8 Bytes), this character is F0 9F A4 94.

With xxd(1)

% xxd < /tmp/convchar 
00000000: f09f a494 0a                             .....

The xxd(1) output’s screen capture so the colours are visible:

Unicode character name Thinking Face - Conversion to UTF-8 with xxd(1)

Conversion to Unicode (Unicode Hex Point)

OS and shell: FreeBSD 14, tcsh

With iconv(1) and xxd(1)

% iconv -t utf-32 /tmp/convchar | xxd
00000000: 0000 feff 0001 f914 0000 000a            ............

In Unicode (Unicode Hex Point), this character is 01f914; that is, 1f914.

With od(1), printf(1) and uniname(1)

% od -ab /tmp/convchar
0000000   f0  9f  a4  94  nl                                            
          360 237 244 224 012                                            
0000005
% printf '\360\237\244\224' | env LINES=0 uniname
character  byte     UTF-32   encoded as     glyph   name
        0        0  01F914   F0 9F A4 94    🤔      Character in undefined range

The xxd(1) output’s screen capture so the colours are visible:

Unicode character name Thinking Face - Conversion to UTF-8 with iconv(1) and xxd(1)


With All-In-One Tool: UTF-8 Conversion Tool by Richard Tobin

UTF-8 Conversion Tool – Interpreting a Unicode character as Hex UTF-8 bytes for a character represented with: F0 9F A4 94

Unicode character name Thinking Face - Conversions with UTF-8 Conversion Tool by Richard Tobin


With All-In-One Tool: uni Tool by Martin Tournoij

uni - Query the Unicode database from the commandline, with good support for emojis

Available as a package for FreeBSD:

uni on FreshPorts

uni on pkgs.org

In addition, available as a WASM (WebAssembly) demo: https://arp242.github.io/uni-wasm/

uni WASM (WebAssembly) demo details: Running Go CLI programs in the browser with WASM

Project home page: https://github.com/arp242/uni

For uni help, see Footnote 4. [4]

% uni identify < /tmp/convchar
             Dec    UTF8        HTML       Name
'🤔' U+1F914 129300 f0 9f a4 94 &#x1f914;  THINKING FACE
% uni identify --format '%unicode %name' < /tmp/convchar
Unicode Name
8.0     THINKING FACE

Include all columns:

NOTE: Here, json field represents what UTF-8 Conversion Tool by Richard Tobin calls UTF-8 bytes as Latin-1 Characters Bytes.

% uni identify --format all < /tmp/convchar
             Width Cells Dec    Hex   Oct    Bin               UTF8        UTF16LE     UTF16BE     HTML      XML       JSON         Keysym Digraph Name          Plane                            Cat          Block                                Script Props Unicode Aliases Refs
'🤔' U+1F914 wide  2     129300 1f914 374424 11111100100010100 f0 9f a4 94 3e d8 14 dd d8 3e dd 14 &#x1f914; &#x1f914; \ud83e\udd14                THINKING FACE Supplementary Multilingual Plane Other_Symbol Supplemental Symbols and Pictographs Common       8.0

Output data as JSON:

% uni identify --as json --format all < /tmp/convchar
[{
        "aliases": "",
        "bin":     "11111100100010100",
        "block":   "Supplemental Symbols and Pictographs",
        "cat":     "Other_Symbol",
        "cells":   "2",
        "char":    "🤔",
        "cpoint":  "U+1F914",
        "dec":     "129300",
        "digraph": "",
        "hex":     "1f914",
        "html":    "&#x1f914;",
        "json":    "\\ud83e\\udd14",
        "keysym":  "",
        "name":    "THINKING FACE",
        "oct":     "374424",
        "plane":   "Supplementary Multilingual Plane",
        "props":   "",
        "refs":    "",
        "script":  "Common",
        "unicode": "8.0",
        "utf16be": "d8 3e dd 14",
        "utf16le": "3e d8 14 dd",
        "utf8":    "f0 9f a4 94",
        "width":   "wide",
        "xml":     "&#x1f914;"
}]

With All-In-One Tool: unicode Tool by Radovan Garabík

Project home page: https://github.com/garabik/unicode

unicode, simple command line utility that displays properties for a given unicode character, or searches unicode database for a given name.

% git clone https://github.com/garabik/unicode.git
% cd unicode/
% ls
changelog       MANIFEST.in     README          setup.py
COPYING         paracode        README-paracode unicode
debian          paracode.1      setup.cfg       unicode.1
% python3 setup.py --help
[ . . . ]
  setup.py build      will build the package underneath 'build/'
[ . . . ]
% python3 setup.py build
% ls -Alhrt | tail -1
drwxr-xr-x  3 dusko wheel    3B Aug 12 19:50 build
% ls -Alhrt build/
total 1
drwxr-xr-x  2 dusko wheel    4B Aug 12 19:50 scripts-3.9
 
% ls -Alhrt build/scripts-3.9/
total 25
-rwxr-xr-x  1 dusko wheel   40K Aug 12 19:50 unicode
-rwxr-xr-x  1 dusko wheel  7.0K Aug 12 19:50 paracode
% build/scripts-3.9/unicode --help
[ . . . ]
  --download            Try to dowload UnicodeData.txt
[ . . . ]
% build/scripts-3.9/unicode --download
Downloading UnicodeData.txt from http://www.unicode.org/Public/15.1.0/ucd/UnicodeData.txt
downloading.../home/dusko/.unicode/UnicodeData.txt.gz downloaded
% build/scripts-3.9/unicode 1F914
U+1F914 THINKING FACE
UTF-8: f0 9f a4 94 UTF-16BE: d83edd14 Decimal: &#129300; Octal: \0374424
🤔
Category: So (Symbol, Other); East Asian width: W (wide)
Bidi: ON (Other Neutrals)

NOTE: This tool also shows some additional information, like Decomposition:

% build/scripts-3.9/unicode 00c0
U+00C0 LATIN CAPITAL LETTER A WITH GRAVE
UTF-8: c3 80 UTF-16BE: 00c0 Decimal: &#192; Octal: \0300
À (à)
Lowercase: 00E0
Category: Lu (Letter, Uppercase); East Asian width: N (neutral)
Bidi: L (Left-to-Right)
Decomposition: 0041 0300

What’s the Name of the Unicode Character?

Straight from Unicode.org

% fetch http://unicode.org/Public/UNIDATA/UnicodeData.txt
UnicodeData.txt                                       1869 kB  690 kBps    02s
% grep -i 1F914 UnicodeData.txt
1F914;THINKING FACE;So;0;ON;;;;;N;;;;;

The name of this character is Thinking Face.

From UCD (Unicode Character Database) Package in FreeBSD

Available as a package on FreeBSD 14.

% sudo pkg install UCD
% pkg query %Fp UCD | wc -l
      81
% pkg query %Fp UCD
[ . . . ]
/usr/local/share/unicode/ucd/ReadMe.txt
[ . . . ]
/usr/local/share/unicode/ucd/UnicodeData.txt
[ . . . ]
% grep 1F914 /usr/local/share/unicode/ucd/UnicodeData.txt
1F914;THINKING FACE;So;0;ON;;;;;N;;;;;
% grep -r -n -i 1F914 /usr/local/share/unicode/ucd/
/usr/local/share/unicode/ucd/UnicodeData.txt:33380:1F914;THINKING FACE;So;0;ON;;;;;N;;;;;
/usr/local/share/unicode/ucd/extracted/DerivedName.txt:43179:1F914         ; THINKING FACE
/usr/local/share/unicode/ucd/NamesList.txt:53248:1F914  THINKING FACE

With the uni Tool

On FreeBSD 14 installed as sudo pkg install uni.

% uni print 1F914
             Dec    UTF8        HTML       Name
'🤔' U+1F914 129300 f0 9f a4 94 &#x1f914;  THINKING FACE

Confirm:

% uni search thinking face
             Dec    UTF8        HTML       Name
'🤔' U+1F914 129300 f0 9f a4 94 &#x1f914;  THINKING FACE

Converting from Unicode/UTF to ISO

From utf8 on grml - Converting Files - GrmlWiki:

Converting files from Unicode / UTF to ISO:

% iconv -c -f utf8 -t iso-8859-15 < utffile > isofile

and vice versa:

% iconv -f iso-8859-15 -t utf8 < isofile > utffile

Unicode Escape Formats

From Unicode Escape Formats:

The following are ASCII representations of Unicode characters known to be used in various contexts. In a few cases we also include unusual representations of integers since integers are sometimes converted to characters.


My Selection of Tools and References

UNUM: Unicode/HTML/Numeric Character Code Converter

Interconvert numbers, Unicode, and HTML/XHTML entities

Perl tool

Author: John Walker (www.fourmilab.ch), founder of Autodesk, Inc. and co-author of AutoCAD

Project home page: https://www.fourmilab.ch/webtools/unum/

About UNUM - From author’s Unix Utilities page:

Web authors who use characters from other languages, mathematical symbols, fancy punctuation, and other typographic embellishment in their documents often find themselves juggling the Unicode book, an HTML entity reference, and a programmer’s calculator to convert back and forth between the various representations. This stand-alone command line Perl program contains complete databases of Unicode characters and character blocks and HTML/XHTML named character references, and permits easy lookup and interconversion among all the formats, including octal, decimal, and hexadecimal numbers. The program works best on a recent version of Perl, such as v5.8.5 or later, but requires no Perl library modules. New version 3.4-14.0.0 (September 2021) updates to the Unicode 14.0.0 standard and the new scripts, characters, and emoji it adds.


uni.pl: List Unicode symbols matching pattern

Project home page: uni.pl - Perl script from leahneukirchen (Leah Neukirchen) - List Unicode symbols matching pattern


Perl, uni.pl, xxd, iconv, hexdump (hd), od, printf

% ./uni.pl crossbones
☠       2620    SKULL AND CROSSBONES
🕱       1F571   BLACK SKULL AND CROSSBONES

% ./uni.pl 2620
☠       2620    SKULL AND CROSSBONES

% ./uni.pl 2620 | cut -w -f1
☠

% ./uni.pl 2620 | cut -w -f1 > skandcr.txt

% cat skandcr.txt
☠

% xxd < skandcr.txt 
00000000: e298 a00a                                ....

NOTE: xxd(1) displayed the first three dots at the in red colour, and the fourth dot in yellow. Accordingly, it also displayed e298 a0 in red, and 0a in yellow.

% uchardet < skandcr.txt
UTF-8
% iconv -t utf8 < skandcr.txt
☠
 
% iconv -t utf8 skandcr.txt | od -ac 
0000000   e2  98  a0  nl
           ☠  **  **  \n
0000004

NOTE: So in UTF-8, the symbol (for skull and crossbones) below e2 and ** in the next two groups, that is, below 98 and below a0 so this symbol consists of 3 bytes (or 6 digits).

In octal:

% iconv -t utf8 skandcr.txt | od -ab
0000000   e2  98  a0  nl
          342 230 240 012
0000004

Pick up the first three bytes:

% printf '\342\230\240'
☠% 

Reference:

How do you echo a 4-digit Unicode character in Bash?

% iconv -t utf16 skandcr.txt | xxd
00000000: feff 2620 000a                           ..& ..

NOTE: xxd(1) displayed the first dot in red, the second dot in dark green, the ampresend (&) in light green, the next dot in white, the last dot in yellow. Accordingly, it also displayed fe in red, ff in dark green, 2620 in light green, 00 in white, and 0a in yellow.

NOTE:

In UTF-16, a BOM (U+FEFF) are the first bytes of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file for stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a “noncharacter” that should never appear in the text.

  • If the 16-bit units are represented in big-endian byte order (“UTF-16BE”), the BOM is the (hexadecimal) byte sequence FE FF
  • If the 16-bit units use little-endian order (“UTF-16LE”), the BOM is the (hexadecimal) byte sequence FF FE

For the IANA registered charsets UTF-16BE and UTF-16LE, a byte-order mark should not be used because the names of these character sets already determine the byte order.

. . .

UTF-32

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.

The BOM for little-endian UTF-32 is the same pattern as a little-endian UTF-16 BOM followed by a UTF-16 NUL character, an unusual example of the BOM being the same pattern in two different encodings. Programmers using the BOM to identify the encoding will have to decide whether UTF-32 or UTF-16 with a NUL first character is more likely.

Source: Wikipedia - Byte order mark (BOM)

% perl -CS -E 'say "\x{2620}"'
☠

NOTE: In my tests, Unicode table for you didn’t work when I accessed it from ftrain.com; however, it worked when I saved that page locally as .html and then opened it with my Web browser, Mozilla Firefox.

  • Wakamai Fondue - What can my font do?

    Wakamai Fondue is a tool that answers the question “What can my font do?”

    Drop a font on it, or click the circle to upload one, and Wakamai Fondue will tell you about the features in the font. It will also give you all the CSS needed to actually use these features in your web projects!

    Everything is processed inside the browser - your font will not be sent to a server!

  • Perl Unicode Cookbook: The Standard Preamble

    Apr 2, 2012 by Tom Christiansen

    Editor’s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. This is the first recipe in the series.


% perl -E 'my $x = "\N{SKULL AND CROSSBONES}"; say $x' | xxd
Wide character in say at -e line 1.
00000000: e298 a00a                                ....
% perl -E 'my $x = "\N{SKULL AND CROSSBONES}"; say $x' | hexdump -C
Wide character in say at -e line 1.
00000000  e2 98 a0 0a                                       |....|
00000004

To avoid Wide character in say warning:

% perl -E 'use open qw(:std :encoding(UTF-8)); my $x = "\N{SKULL AND CROSSBONES}"; say $x'
☠

Or:

% perl -E 'binmode(STDOUT, ":encoding(UTF-8)"); my $x = "\N{SKULL AND CROSSBONES}"; say $x'
☠

Or:

% perl -CS -E 'my $x = "\N{SKULL AND CROSSBONES}"; say $x'
☠

References:

How to get rid of Wide character in print at?

The use utf8 means Perl expects your source code to be UTF-8.

The open pragma can change the encoding of the standard filehandles:

use open qw( :std :encoding(UTF-8) );

And, whatever is going to deal with your output needs to expect UTF-8 too. If you want to see it correctly in your terminal, then you need to set up that correctly (but that’s nothing to do with Perl).

Use of ‘use utf8;’ gives me ‘Wide character in print’

You can use this

perl -CS filename

It will also terminates that error.

Reference (abridged):

The -C flag controls some of the Perl Unicode features.

As of 5.8.1, the -C can be followed either by a number or a list of option letters.
The letters, their numeric values, and effects are as follows; listing the letters is equal to summing the numbers.

    I     1   STDIN is assumed to be in UTF-8
    O     2   STDOUT will be in UTF-8
    E     4   STDERR will be in UTF-8
    S     7   I + O + E

[ . . . ]

If you’re not just running a one-liner, see here:

perlunicook - Cookbookish examples of handling Unicode in Perl - ℞ 15: Declare STD{IN,OUT,ERR} to be utf8


Python

Paste the character you want to analyze between single quotes of Python’s built-in ord() function.

% python3
>>> 
>>> import unicodedata

>>> ord('🤔')
129300

>>> chr(129300)
'🤔'

>>> unicodedata.name('🤔')
'THINKING FACE'

>>> "\N{Thinking Face}"
'🤔'

>>> u=chr(129300)

>>> u.encode('utf-8')
b'\xf0\x9f\xa4\x94'

>>> hex(129300)
'0x1f914'
% python3 -c 'print(0x1f914)'
129300
 
% python3 -c 'print(chr(129300))'
🤔

Unicode HOWTO - Python Documentation


vim

vim: ga  # OR :as(cii)

libgrapheme

Project home page: https://libs.suckless.org/libgrapheme/

$ git clone https://git.suckless.org/libgrapheme
$ cd libgrapheme
$ ./configure
$ sudo make install
$ vi example.c
$ cat example.c
#include <grapheme.h>
#include <stdint.h>
#include <stdio.h>

int
main(void)
{
        /* UTF-8 encoded input */
        char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0"
                  "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0"
                  "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0"
                  "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!";
        size_t ret, len, off;

        printf("Input: \"%s\"\n", s);

        /* print each grapheme cluster with byte-length */
        printf("grapheme clusters in NUL-delimited input:\n");
        for (off = 0; s[off] != '\0'; off += ret) {
                ret = grapheme_next_character_break_utf8(s + off, SIZE_MAX);
                printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
        }
        printf("\n");

        /* do the same, but this time string is length-delimited */
        len = 17;
        printf("grapheme clusters in input delimited to %zu bytes:\n", len);
        for (off = 0; off < len; off += ret) {
                ret = grapheme_next_character_break_utf8(s + off, len - off);
                printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
        }

        return 0;
}
$ cc -o example example.c -lgrapheme
$ ./example
Input: "Tëst 👨👩👦 🇺🇸 न\u0940 ந\u0bbf!"
grapheme clusters in NUL-delimited input:
 1 bytes | T
 2 bytes | ë
 1 bytes | s
 1 bytes | t
 1 bytes |
18 bytes | 👨👩👦
 1 bytes |
 8 bytes | 🇺🇸
 1 bytes |
 6 bytes | न\u0940
 1 bytes |
 6 bytes | ந\u0bbf
 1 bytes | !

grapheme clusters in input delimited to 17 bytes:
 1 bytes | T
 2 bytes | ë
 1 bytes | s
 1 bytes | t
 1 bytes |
11 bytes | 👨👩

References

(Retrieved on Sep 17, 2024)

NOTE: Three pictures below show that pasting the farmer emoji in xterm terminal emulator with csh (C shell), bash (GNU Bourne-again shell) and sh (Bourne shell) shells on my FreeBSD 14 system moved cursor four cells forward.

Unicode character name Farmer (emoji) - In FreeBSD, pasting it in xterm with csh shell moved cursor four cells forward

Unicode character name Farmer (emoji) - In FreeBSD, pasting it in xterm with bash shell moved cursor four cells forward

Unicode character name Farmer (emoji) - In FreeBSD, pasting it in xterm with sh (Bourne shell) moved cursor four cells forward

This blog post describes why this happens and how terminal emulator and program authors can achieve consistent spacing for all characters.

. . .

Traditionally, terminals simply read an input byte stream and mapped each individual byte to a cell in the grid. For example, the stream “1234” is 4 bytes, and programmers can very easily read a byte at a time and place it into the next cell, move the cursor right one, repeat.

Eventually, “wide characters” came along. Common wide characters are Asian characters such as 橋 or Emoji such as 😃. A function wcwidth was added to libc to return the width of a wide character in cells. Wide characters were given a width of “2” (usually). Therefore, if you type 橋 in a terminal emulator, the character will take up two grid cells and your cursor should jump forward by two cells.

And this is how most terminal emulators and terminal programs (shells, TUIs, etc.) are implemented today: they process input characters via wcwidth and move the cursor accordingly. And for a short period of time, this worked completely fine. But today, this is no longer adequate and results in many errors.

Grapheme Clustering

It turns out that a single 32-bit value is not adequate to represent every user-perceived character in the world. A “user-perceived character” is how the Unicode Standard defines a grapheme.

Let’s consider the emoji “🧑‍🌾”. The emoji should look something like this in case your computer doesn’t support it. I think every human would agree this is a single “user-perceived character” or grapheme. The Unicode Standard itself defines this as a single grapheme so regardless of your personal opinion, international standards say this is one grapheme.

For computers, its not so obvious. “🧑‍🌾” is three codepoints (U+1F9D1 🧑, U+200D, and U+1F33E 🌾), three 32-bit values when UTF-32 encoded, or 11 bytes when UTF-8 encoded (assuming 8-bits is a byte, which is a fairly safe assumption nowadays).

. . .

What’s with the zero-width character? The codepoint U+200D is known as a Zero-Width Joiner (ZWJ) and has a standards-defined width of zero. The ZWJ tells text processing systems to treat the codepoints around it as joined into a single character. That’s why you can type both “🧑‍🌾” and “🧑🌾”; the only difference between these two quoted values is the farmer on the left has a zero-width joiner between the two emoji.

. . .

Grapheme clustering is the process that lets a program see three 32-bit values as a single user-perceived character. The algorithm for grapheme clustering is defined in UAX #29, “Unicode Text Segmentation”.


Footnotes

[1] command -V and type is POSIX-compatible, while in tcsh, you can use which.

% ps $$
  PID TT  STAT    TIME COMMAND
27010 17  Ss   0:00.71 -csh (csh)

% printf %s\\n "$SHELL"
/bin/csh

% command -V csh; type csh; which csh; whereis -a csh; where csh
csh is /bin/csh
csh is /bin/csh
/bin/csh
csh: /bin/csh /usr/share/man/man1/csh.1.gz
/bin/csh

% command -V tcsh; type tcsh; which tcsh; whereis -a tcsh; where tcsh
tcsh is /bin/tcsh
tcsh is /bin/tcsh
/bin/tcsh
tcsh: /bin/tcsh /usr/share/man/man1/tcsh.1.gz
/bin/tcsh

% ls -lh /usr/share/man/man1/csh.1.gz /usr/share/man/man1/tcsh.1.gz
-r--r--r--  2 root wheel   65K Jul 27  2022 /usr/share/man/man1/csh.1.gz
-r--r--r--  2 root wheel   65K Jul 27  2022 /usr/share/man/man1/tcsh.1.gz
 
% diff /usr/share/man/man1/csh.1.gz /usr/share/man/man1/tcsh.1.gz

% ls -lh /bin/csh /bin/tcsh
-r-xr-xr-x  2 root wheel  432K Apr  8 13:31 /bin/csh
-r-xr-xr-x  2 root wheel  432K Apr  8 13:31 /bin/tcsh
 
% diff /bin/csh /bin/tcsh

% csh --version
tcsh 6.22.04 (Astron) 2021-04-26 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec
 
% tcsh --version
tcsh 6.22.04 (Astron) 2021-04-26 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec

% /bin/csh --version
tcsh 6.22.04 (Astron) 2021-04-26 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec
 
% /bin/tcsh --version
tcsh 6.22.04 (Astron) 2021-04-26 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec
% builtins 
:          @          alias      alloc      bg         bindkey    break
breaksw    builtins   case       cd         chdir      complete   continue
default    dirs       echo       echotc     else       end        endif
endsw      eval       exec       exit       fg         filetest   foreach
glob       goto       hashstat   history    hup        if         jobs
kill       limit      log        login      logout     ls-F       nice
nohup      notify     onintr     popd       printenv   pushd      rehash
repeat     sched      set        setenv     settc      setty      shift
source     stop       suspend    switch     telltc     termname   time
umask      unalias    uncomplete unhash     unlimit    unset      unsetenv
wait       where      which      while      
 
% command -V builtins; type builtins; which builtins; whereis -a builtins; where builtins
builtins: not found
builtins: not found
builtins: shell built-in command.
builtins: /usr/share/man/man1/builtins.1.gz
builtins is a shell built-in

References

  • What is the unix command to find out what executable file corresponds to a given command?

    For the T C Shell, tcsh, the built-in is the which command - not to be confused with any external command by that name:

    % which ls
    ls: aliased to ls-F
    % which \ls
    /bin/ls
    
  • Why not use “which”? What to use then?

    Comment by Stéphane Chazelas May 21, 2014: yes, csh (and which is still a csh script on most commercial Unices) does read ~/.cshrc when non-interactive. That’s why you’ll notice csh scripts usually start with #!/bin/csh -f. which does not because it aims to give you the aliases because it’s meant as a tool for (interactive) users of csh. POSIX shells users have command -v.

    . . .

    History

    . . .

    The early Unix shells until the late 70s had no functions or aliases. Only the traditional looking up of executables in $PATH. csh introduced aliases around 1978 (though csh was first released in 2BSD, in May 1979), and also the processing of a .cshrc for users to customize the shell (every shell, as csh, reads .cshrc even when not interactive like in scripts).

    . . .

    csh got a lot more popular than the Bourne shell as (though it had an awfully worse syntax than the Bourne shell) it was adding a lot of more convenient and nice features for interactive use.

    In 3BSD (1980), a which csh script was added for the csh users to help identify an executable, and it’s a hardly different script you can find as which on many commercial Unices nowadays (like Solaris, HP/UX, AIX or Tru64).

    . . .

    Here you go: which came first for the most popular shell at the time (and csh was still popular until the mid-90s), which is the main reason why it got documented in books and is still widely used.

    . . .

    A similar functionality was not added to the Bourne shell until 1984 in SVR2 with the type builtin command. The fact that it is builtin (as opposed to an external script) means that it can give you the right information (to some extent) as it has access to the internals of the shell.

    . . .

    The which csh script meanwhile was removed from NetBSD (as it was builtin in tcsh and of not much use in other shells), and the functionality added to whereis (when invoked as which, whereis behaves like which except that it only looks up executables in $PATH. In OpenBSD and FreeBSD, which was also changed to one written in C that looks up commands in $PATH only.

  • How can I check if a program exists from a Bash script?

    Answered by user lhunath on Mar 24, 2009 - Edited by user Daniel Kaplan on Sep 12, 2023:

    . . . POSIX compatible:

    command -v <the_command> . . .


[2] For example, on my FreeBSD 14 system, certain emojis were missing or not displaying correctly in Mozilla Firefox web browser. On the following sites some emojis, especially emoji ZWJ (ZERO WIDTH JOINER) sequences, were not displaying; that is, there were blank spaces in their places.

I fixed it by uninstalling (deinstalling) Noto Fonts family for emoji (package name on FreeBSD: noto-emoji).

$ sudo pkg remove noto-emoji

Others reported that they fixed it by:

  • copying TwemojiMozilla.ttf font file to ~/.fonts directory (cp -i /usr/local/lib/firefox/fonts/TwemojiMozilla.ttf ~/.fonts)
  • preventing the Mozilla font interfering with their system emoji font, by going to about:config in Firefox and changing gfx.font_rendering.opentype_svg.enabled to false or doing the opposite of the previous step; that is, removing /usr/local/lib/firefox/fonts/TwemojiMozilla.ttf in FreeBSD (or /usr/lib/firefox/fonts/TwemojiMozilla.ttf in Linux)
  • by ignoring Firefox’s Twemoji font for emojis and using the system font for emojis - by going to about:config in Firefox and changing font.name-list.emoji from Twemoji Mozilla to emoji

References for this:

  • Emoji - openSUSE Wiki

    Install colored emoji fonts

    openSUSE provides the following colored emoji fonts:

    • Noto Color Emoji (noto-coloremoji-fonts), the default emoji font of most Android smart phones.
    • Emoji One (emojione-color-font), an open source emoji project with best Unicode coverage. Note: Emoji One is deprecated and was replaced by JoyPixels.
    • Twitter Emoji (twemoji-color-font), used by Twitter website and mobile applications.

    You can install one of them. There is no need to install multiple emoji fonts and it may cause problems.

    . . .

    Firefox emoji font configuration

    Firefox comes with built-in emoji font: Twemoji Mozilla. And it is used by default. To change it to your system emoji font, go to about:config page, search font.name-list.emoji and change it to the emoji font name your would like to use.

  • Solved - Emoji not displayed or overlapping text - The FreeBSD Forums
  • Emoji ZWJ Sequence

  • Firefox Font Troubleshooting - ArchWiki (Arch Linux Wiki)

    Firefox has a setting which determines how many replacements it will allow from Fontconfig. To allow it to use all your replacement rules, change gfx.font_rendering.fontconfig.max_generic_substitutions to 127 (the highest possible value).

    Firefox ships with the Twemoji Mozilla font. To use the system emoji font, set font.name-list.emoji to emoji in about:config. Additionally, to prevent the Mozilla font interfering with your system emoji font, change gfx.font_rendering.opentype_svg.enabled to false or remove /usr/lib/firefox/fonts/TwemojiMozilla.ttf.

  • Twemoji Confs - Reddit (self.linux)

    Comment by es20490446e[S]:

    Probably fontconfig was never designed with emoji in mind, where a specific set of glyphs should take over any other used font in the system. So figuring out how to do that got too messy for humans.

    . . .

    Comment by xtifr: Do need the Noto fonts, but those are included, and don’t need any special configuration. Just run apt install fonts-noto.

    . . .

    Comment by WhyNotHugo: I did en up taking a different approach though; ignore non-emojis glyphs for Twemoji, and make it the first font to be used: Configure Twemoji to be globally used for emoji - Archived from the original on Jul 2, 2020


[3] On my FreeBSD 14 system, the iconv -l command produced different output with iconv(1) from base install versus iconv(1) from packages.

% iconv -l | wc -l
     216
 
% iconv -l | grep -i latin | wc -l
      12
 
% iconv -l | grep -i latin1 | wc -l
       2

From the man page for iconv(1) on FreeBSD 14:

-l    Lists available codeset names.  Note that not all combinations of
      from_name and to_name are valid.
% locate iconv | grep man | wc -l
      22

% locate iconv | grep man | grep man1 | wc -l
       3
 
% locate iconv | grep man | grep -v jail | grep -v external | grep man1 
/usr/local/lib/perl5/5.36/perl/man/man1/piconv.1.gz
/usr/local/share/man/man1/iconv.1.gz
/usr/share/man/man1/iconv.1.gz
% man iconv | grep -i list
     -l    Lists available codeset names.  Note that not all combinations of
 
% man /usr/share/man/man1/iconv.1.gz | head -1
ICONV(1)                FreeBSD General Commands Manual               ICONV(1)

% man /usr/share/man/man1/iconv.1.gz | grep -i list
     -l    Lists available codeset names.  Note that not all combinations of
 
% man /usr/local/share/man/man1/iconv.1.gz | head -1
ICONV(1)                   Linux Programmer's Manual                  ICONV(1)
% man /usr/local/share/man/man1/iconv.1.gz | grep -n -i list
19:       implementation, they are listed in the iconv_open(3) manual page.
69:       The iconv -l or iconv --list command lists the names of the supported
72:       whitespace, and alias names of an encoding are listed on the same line
88:              lists the supported encodings.
% man /usr/local/share/man/man1/iconv.1.gz | sed -n 69,72p
       The iconv -l or iconv --list command lists the names of the supported
       encodings, in a system dependent format. For the libiconv
       implementation, the names are printed in upper case, separated by
       whitespace, and alias names of an encoding are listed on the same line
% command -V iconv; type iconv; which iconv; whereis -a iconv
iconv is /usr/bin/iconv
iconv is /usr/bin/iconv
/usr/bin/iconv
iconv: /usr/bin/iconv /usr/local/bin/iconv /usr/share/man/man1/iconv.1.gz /usr/local/share/man/man1/iconv.1.gz /usr/share/man/man3/iconv.3.gz /usr/local/share/man/man3/iconv.3.gz
% locate iconv | grep bin 
/usr/bin/iconv
/usr/local/bin/iconv
/usr/local/bin/piconv
% diff /usr/bin/iconv /usr/local/bin/iconv
Binary files /usr/bin/iconv and /usr/local/bin/iconv differ
% /usr/bin/iconv -l | wc -l
     216
 
% /usr/local/bin/iconv -l | wc -l
     196
% pkg which /usr/bin/iconv
/usr/bin/iconv was not found in the database
 
% pkg which /usr/local/bin/iconv
/usr/local/bin/iconv was installed by package libiconv-1.17_1
% pkg query %Fp libiconv-1.17_1
/usr/local/bin/iconv
/usr/local/include/iconv.h
/usr/local/include/libcharset.h
/usr/local/include/localcharset.h
/usr/local/lib/libcharset.a
/usr/local/lib/libcharset.so
/usr/local/lib/libcharset.so.1
/usr/local/lib/libcharset.so.1.0.0
/usr/local/lib/libiconv.a
/usr/local/lib/libiconv.so
/usr/local/lib/libiconv.so.2
/usr/local/lib/libiconv.so.2.6.1
/usr/local/share/doc/libiconv/iconv.1.html
/usr/local/share/doc/libiconv/iconv.3.html
/usr/local/share/doc/libiconv/iconv_close.3.html
/usr/local/share/doc/libiconv/iconv_open.3.html
/usr/local/share/doc/libiconv/iconv_open_into.3.html
/usr/local/share/doc/libiconv/iconvctl.3.html
/usr/local/share/licenses/libiconv-1.17_1/GPLv3
/usr/local/share/licenses/libiconv-1.17_1/LICENSE
/usr/local/share/licenses/libiconv-1.17_1/catalog.mk
/usr/local/share/man/man1/iconv.1.gz
/usr/local/share/man/man3/iconv.3.gz
/usr/local/share/man/man3/iconv_close.3.gz
/usr/local/share/man/man3/iconv_open.3.gz
/usr/local/share/man/man3/iconv_open_into.3.gz
/usr/local/share/man/man3/iconvctl.3.gz

From the man page for pkg-shlib on FreeBSD 14:

pkg shlib – display which installed package provides a specfic shared
library, and the installed packages which require it
  
library is the filename of the library without any leading path, but
including the ABI version number.  Only exact matches are handled.
% pkg shlib libiconv
No packages provide libiconv.
No packages require libiconv.
 
% pkg shlib libiconv.so.2
libiconv.so.2 is provided by the following packages:
libiconv-1.17_1
libiconv.so.2 is linked to by the following packages:
p5-Locale-libintl-1.33
chromium-123.0.6312.58
glib-2.80.0,2
libdatrie-0.2.13_2
vlc-3.0.20_5,4
mutt-2.2.13
fontforge-20230101
groff-1.23.0_3
fvwm-2.6.9_3
rsync-3.2.7_1
 
% pkg shlib libiconv.so.2.6.1
No packages provide libiconv.so.2.6.1.
No packages require libiconv.so.2.6.1.

[4] For a brief help for uni tool:

% uni 
Usage: uni [command] [flags]

uni queries the unicode database. https://github.com/arp242/uni

Flags:
    -f, -format    Output format.
    -a, -as        How to print the results: list (default), json, or table.
    -c, -compact   More compact output.
    -r, -raw       Don't use graphical variants or add combining characters.
    -p, -pager     Output to $PAGER.
    -o, -or        Use "or" when searching instead of "and".

Commands:
    list           List Unicode data such as blocks, categories, etc.
    identify       Identify all the characters in the given strings.
    search         Search description for any of the words.
    print          Print characters by codepoint, category, or block.
    emoji          Search emojis.

Use "uni help" or "uni -h" for a more detailed help.

Use uni help or uni -h for a more detailed help.