Ansi, Utf8 & Utf16 encoding problems

General discussion for topics related to the FreeBASIC project or its community.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Ansi, Utf8 & Utf16 encoding problems

Post by caseih »

Okay yes you're right. If you manually enable code page 65001 it appears that UTF-8 can work on the Windows Console, provided you have a font that can do it. I think the blog was talking about the internals of the Console subsystem, which is based on a grid of UTF-16 cells. So somewhere in the system there's a codepage translation to UTF-16. At least pre-1809. With code pages the whole thing is certainly a mess. And when you combine that with byte-oriented string literals in source code files with unicode encodings, there's ample room for all kinds of mojibake and other disasters. If an editor changes the encoding used in the file without you knowing it, out comes mojibake.

MS was an early adopter of unicode, but betting the farm on UCS-2 and later UTF-16, although seeming like a future-proof idea at the time, turned out to be a mistake. Ahh well what can you do.

Probably it is now time for all Windows 10 users to install the new Windows terminal app and use it for everything command line. Solves the problems once and for all. Well other than the other problems support for UTF-8 in the narrow Win32 API calls isn't perfect.
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by marcov »

Note that you can also manifest a Windows EXE to initialize with UTF8 defaults since afaik Windows 10 1905 or so.

https://docs.microsoft.com/en-us/window ... -code-page
robert
Posts: 169
Joined: Aug 06, 2019 18:45

Re: Ansi, Utf8 & Utf16 encoding problems

Post by robert »

caseih wrote:Okay yes you're right. If you manually enable code page 65001 it appears that UTF-8 can work on the Windows Console, provided you have a font that can do it. I think the blog was talking about the internals of the Console subsystem, which is based on a grid of UTF-16 cells. So somewhere in the system there's a codepage translation to UTF-16. At least pre-1809. With code pages the whole thing is certainly a mess. And when you combine that with byte-oriented string literals in source code files with unicode encodings, there's ample room for all kinds of mojibake and other disasters. If an editor changes the encoding used in the file without you knowing it, out comes mojibake.
Hi Caseih:

Relevant to your comments are these statements.

Quoted from
https://en.wikipedia.org/wiki/Unicode_i ... ft_Windows
Microsoft's compilers often fail at producing UTF-8 string constants from UTF-8 source files. The most reliable method is to turn off UNICODE, not mark the input file as being UTF-8 (i.e. do not use a BOM), and arrange the string constants to have the UTF-8 bytes. If a BOM was added, a Microsoft compiler will interpret the strings as UTF-8, convert them to UTF-16, then convert them back into the current locale, thus destroying the UTF-8. Without a BOM and using a single-byte locale, Microsoft compilers will leave the bytes in a quoted string unchanged.
Quoted from
http://utf8everywhere.org/#faq.literal
Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals....
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by marcov »

Robert: that post is a horrible mix of confusion wrt UTF8 as document or API encoding.

Basically UTF8 is favoured because the *nixers were too lazy to rework their old junk codebases. Open Source tends to favor weak compromises rather than revolution.
robert
Posts: 169
Joined: Aug 06, 2019 18:45

Re: Ansi, Utf8 & Utf16 encoding problems

Post by robert »

marcov wrote:Robert: that post is a horrible mix of confusion wrt UTF8 as document or API encoding.

Basically UTF8 is favoured because the *nixers were too lazy to rework their old junk codebases. Open Source tends to favor weak compromises rather than revolution.
Hi marcov:

I guess who is lazy depends on perspective.

Quoted from
https://en.wikipedia.org/wiki/UTF-8
In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.
Eventually, it was realized that UTF-16 wasn't adequate. But wait, as you have seen, there's more ...

Quoted from your reference

https://docs.microsoft.com/en-us/window ... -code-page
Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.
Revolution? Go Golang!
robert
Posts: 169
Joined: Aug 06, 2019 18:45

Re: Ansi, Utf8 & Utf16 encoding problems

Post by robert »

IsTextUnicode

Bush hid the facts

https://en.wikipedia.org/wiki/Bush_hid_the_facts
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by marcov »

robert wrote:
Revolution? Go Golang!
Mindless peer-pressure isn't necessarily progress.

Fact is that the meagre gains of UTF8 as document encoding already disappear if the filesystem has compression (like Windows NT had since mid nineties).

It is the legacy 1-byte *nix cruft for which UTF16 was an insurmountable mountain.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

robert wrote:If a BOM was added, a Microsoft compiler will interpret the strings as UTF-8, convert them to UTF-16, then convert them back into the current locale, thus destroying the UTF-8
Fantastic... thanks for finding this jewel, Robert. I've always tried to avoid M$ C compilers and the CRT functions, now I know why, lol
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Ansi, Utf8 & Utf16 encoding problems

Post by caseih »

marcov wrote:Basically UTF8 is favoured because the *nixers were too lazy to rework their old junk codebases. Open Source tends to favor weak compromises rather than revolution.
That's very funny. And complete nonsense of course. UTF-8 is favored because it is simply the most logical encoding to use (at least once it was developed after Windows first adopted unicode), and completely backwards-compatible with ASCII. MS wasted a lot of years on wide API calls duplicating all the narrow ones, only to now redo the narrow API calls to support UTF-8 in the latest versions of Windows 10. But as they say, hindsight is 20/20. UTF-8 didn't exist back when Windows started doing Unicode.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Ansi, Utf8 & Utf16 encoding problems

Post by caseih »

marcov wrote:Fact is that the meagre gains of UTF8 as document encoding already disappear if the filesystem has compression (like Windows NT had since mid nineties).

It is the legacy 1-byte *nix cruft for which UTF16 was an insurmountable mountain.
You've said many insightful things over the years, but this certainly isn't one of them! UTF-8 is king for at least two reasons. One, it's compatible with ASCII (all ASCII files are valid UTF-8 files), and two, you can't ever escape bytes anyway. Files are streams of bytes. UTF-16's advantages over UTF-8 disappear when you deal with characters above 65535, which is now very common in the world. I forget the technical terms, but basically indexing a string in UTF-16 is no faster than UTF-8 for the same reasons.

Really, there's no reason not to use UTF-8 as *the* text encoding standard. Internal to a language or OS the encoding doesn't matter, but any time you cross an API boundary, UTF-8 isn't the worst choice by a long shot. I do take issue with Golang using UTF-8 for in-memory representation of strings, but there are so many subtle nuances to indexing unicode strings that having an O(1) equivalent to MID$ probably isn't the end of the world.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

badidea
Posts: 2586
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Ansi, Utf8 & Utf16 encoding problems

Post by badidea »

caseih wrote:I do take issue with Golang using UTF-8 for in-memory representation of strings, but there are so many subtle nuances to indexing unicode strings that having an O(1) equivalent to MID$ probably isn't the end of the world.
Not only Go, also Rust uses UTF-8 in memory for strings
https://doc.rust-lang.org/std/string/struct.String.html
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by marcov »

caseih wrote:
marcov wrote:Basically UTF8 is favoured because the *nixers were too lazy to rework their old junk codebases. Open Source tends to favor weak compromises rather than revolution.
That's very funny. And complete nonsense of course. UTF-8 is favored because it is simply the most logical encoding to use (at least once it was developed after Windows first adopted unicode),
Most of the pro utf8 arguments are bogus or dramatised. UTF-8 was designed as a tight document encoding for storage and transmission(*), and never meant for general use. The reason that that changed was activism by Spolsky and other people from the Web development world, undoing done UTF16 investments in favor of an easy ride for *nix, as long as backwards compatibility and multi encoding usage within the same system was not that important. The dirty shortcut so to say.
and completely backwards-compatible with ASCII.
Yes, but only the low ascii parts, since multibyte encoding already starts at 128. While UTF16 is linear for the whole BMP, including 128..255 values that partially are similar to much used ASCII extensions like Is ISO-8859-1. Yes, the unit is bigger and not 1-byte, but the handling of ascii units is actually more similar for UTF16. Many western european languages can be processed in UTF16 with only 1->byte conversion
MS wasted a lot of years on wide API calls duplicating all the narrow ones,
NT was wide from nearly the start in the early nineties, the version that is a model for the current architecture (NT3?). All other APIs convert to wide. Win9x was a dodo when it arrived, a stop gap before memory had become cheap enough for real systems (be it *nix or Windows), the long game was always for NT.
only to now redo the narrow API calls to support UTF-8 in the latest versions of Windows 10. But as they say, hindsight is 20/20. UTF-8 didn't exist back when Windows started doing Unicode.
No redoing. They basically added utf8 as a system level 1-byte encoding to the existing conversion for the "A" to "W" apis. But Windows is careful with backwards compatibility, so you have to manifest that so that your old 8-bit encoding programs remain working.

Don't get me wrong, I don't care much either way, and I'm fine with UTF8 in general, and UCS2/UTF16 goals have also not been invariant, since the emoji cruft made characters beyond the BMP also in use by people that are NOT scholars of ancient antiquity scripts.

But there is so much horribly biassed spin in this discussion that I wanted to set the record straight, even though I know it is not a popular opinion.

(*) and even that is dubious since most transmission is nowadays compressed, and storage has ballooned AND has compression options.
Last edited by marcov on Feb 26, 2021 10:34, edited 2 times in total.
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by marcov »

caseih wrote: I forget the technical terms, but basically indexing a string in UTF-16 is no faster than UTF-8 for the same reasons.
Rule of thumb is that processing UTF16 is slower because it means accessing more bytes for Western scripts. It is about equal for cyrillic and middle eastern languages, and it is worse for Far East scripts.

Aside from memory speed effects, processing UTF16 is faster because the exceptions (surrogates) are much more rare, and processing can be more expensive. For Western scripts these two compete, for other languages it is usually a netto loss. But not all operations need to process surrogates fully. For just copying etc this is not important or only some minor extra codepaths exist for it (like e.g. splitting, searching)

Nowadays probably it matters less, since the CPU/memory speed ratio has become higher, so a few wasted cycles are less important. That said, strings are a much smaller percentage of modern applications, so going for UTF8 out of memory performance reasons is also a bit overdramatised. A better image compression probably frees way more memory.

But those are all reasons now, and not reasons when this push for utf8 was effectuated. In the long term, any performance effects would have been negligible, so that was not a reason to push so hard for UTF8. Which is how we get back to the migratory aspects, and the fact that in the Unix world software is a constant flux and hardcore backwards compatibility is not that important, even on relatively short 5 years scale.

Parts of *nix (the kernel, which just shovels bytes) were already fairly encoding agnostic, so utf8 slotted in relatively easily, and legacy concerns were dismissed. Which is how we arrived in the current situation. F*ck the rest with done utf16 investments, handling legacy encodings (and applications that rely on them) and non western locales. It might be a done deal now, but that doesn't make it right.
Last edited by marcov on Feb 26, 2021 10:37, edited 1 time in total.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

marcov wrote:But there is so much horribly biassed spin in this discussion that I wanted to set the record straight
That's right, the Internet is full of emotional debates about the pros and cons of Utf-x, but you are fighting against windmills ;-)

As regards processing time, it rarely matters. Converting between Utf-8 and Utf-16 is obviously much faster than sending a character to the console or to a MessageBox: you must translate the single byte or word of the character into a hundred pixels (using a font), and poke all these pixels into video memory... that is much slower.

I've made another test with compound characters, i.e. those that require more than one code point. They work just fine both with Utf-8 and Utf-16. I have yet to find buggy behaviour for GUI applications like MessageBox or the RichEdit control. Given that M$ has hundreds of Millions clients in Utf-x countries like China, I would be surprised if it didn't work.

Of course, the console is another story; it works fine for Russian with the Lucida Console or Consolas fonts on my Italian OS, but to display Chinese in the console, you need some acrobatics - unless you have a Chinese Windows version, I suppose.
Post Reply