Can of worms: Proper unicode support in FB

For other topics related to the FreeBASIC project or its community.
caseih
Posts: 1504
Joined: Feb 26, 2007 5:32

Can of worms: Proper unicode support in FB

Postby caseih » Jul 04, 2015 2:34

Here's a can of worms. Unicode. FreeBasic already supports wstring, which holds wide characters. However, unless I got it wrong, they aren't dynamic strings. Furthermore, they are just thin wrappers around OS-level wide character support (mainly to support the Win32 wchar-based APIs), which on Windows is UCS-2, which leads to interesting problems like where some unicode characters have to be represented by two or more UCS-2 codes. So chopping a string with mid() would not always do the right thing on such environments. Also, since wstring is a very thin abstraction, it contains no logic for encoding the wide string to bytes, or decoding from bytes.

One strength of BASIC is dynamic strings, and it would be nice to have them just work with unicode. Seems to me to properly support unicode throughout FB, dynamic strings need to accomodate one or more full unicode encodings somehow, all string functions in the FB runtime library need to work with unicode (not just wstring), and there needs to be runtime library support for encoding unicode to bytes, and decoding bytes into unicode. Things get complicated though as the runtime needs to be able to implicitly encode unicode to byte strings when you do things like print a string to the screen, which can get complicated as the runtime has to determine what the output character encoding is (there are no end to problems with this and Python on the Windows command console window which has no real concept of unicode). There's also the issue of memory consumption. Unicode is just an abstraction; it requires a byte encoding to store it in memory. wstring uses UCS-2 on windows and UTF-32 on Linux. Four bytes per character is a bit wasteful if used for all strings. Many languages like Go use UTF-8 as the internal representation, which is the most compact. However it is not O(n) for indexing characters. So a function like mid() would have to always start at the beginning of the string and count in to find the i-th character.

The only language I've seen so far that seems to get unicode right is Python 3.3+. A string is unicode on Python 3 from the very beginning, and is always O(1) for indexing. But it requires some significant effort to work with at first, since you have to be careful to decode input and encode output streams. And when you want to work with bytes you have be explicit about it. Suddenly file open modes become important again (text vs binary).

Will unicode ever be something that FB deals with easily? For now it seems like the best bet is to manually store unicode strings in a UTF-8 byte encoding, and manipulate them with a library like ICU, libunistring, or glib. Seems like all the major languages are baking in support for unicode, some better than others (Javascript has issues!).
grindstone
Posts: 726
Joined: May 05, 2015 5:35
Location: Germany

Re: Can of worms: Proper unicode support in FB

Postby grindstone » Jul 04, 2015 8:47

Hi caseih,

Maybe this link can help you.

Regards
grindstone
marcov
Posts: 2969
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: Can of worms: Proper unicode support in FB

Postby marcov » Jul 04, 2015 9:24

caseih wrote:Here's a can of worms. Unicode. FreeBasic already supports wstring, which holds wide characters. However, unless I got it wrong, they aren't dynamic strings.


Depends on which you implement. Manual wchar_t* strings or COM BSTR strings. The latter should be refcounted inside COM.

Furthermore, they are just thin wrappers around OS-level wide character support (mainly to support the Win32 wchar-based APIs), which on Windows is UCS-2, which leads to interesting problems like where some unicode characters have to be represented by two or more UCS-2 codes.


Lugging around the LInux specific OS api iconv is also no solution, because of its lgpl license (YET another DLL to ship)

So chopping a string with mid() would not always do the right thing on such environments. Also, since wstring is a very thin abstraction, it contains no logic for encoding the wide string to bytes, or decoding from bytes.


That's a normal solution, and there is nothing wrong with that. Most operations only shuffle characters around, and if you only do MID based on a character position based on a valid substring expression, it won't be a problem.

Worse, just resolving surrogates to codepoints will still not solve the issue, since a codepoint is still not equal to a printable glyph. (utf32 even isn't!). So your solution only works for Western languages, but in UTF16 you won't quickly encounter surrogates there anyway.

In general if you avoid string chopping with hardcoded indexes you only need to deal with surrogates if you are really rendering them.

And that saves an slowing and at-arms-length API conversion wall on the outside of the system interacting more freely with the world outside. Interpreters like Python that keep the world at an arm's length don't care, but IMHO FB is not in that category.

One strength of BASIC is dynamic strings, and it would be nice to have them just work with unicode.


yes ( well, of any language with a proper stringtype as opposed to library based manual pointer wrangling),

but I don't know why you wish to make such dramatic system decisions on maintaining bad (and unmaintainable) code, that won't stand the test of time anyway.

Seems to me to properly support unicode throughout FB, dynamic strings need to accomodate one or more full unicode encodings somehow,


Define "full" unicode encoding. There is *NO* Unicode encoding where a working unit (codepoint or whatever) guaranteedly corresponds with a printable glyph. This was deliberately done because the properties of the rendering medium would make choices different. (e.g. afaik in Thai characters are constructed out of parts and due to the high number of combinations there is no glyph for each one, and in other East Asian languages text might be rendered differently depending on the size of the font (more room allows for more complex glyphs)

all string functions in the FB runtime library need to work with unicode (not just wstring), and there needs to be runtime library support for encoding unicode to bytes, and decoding bytes into unicode.


Most systems that really abstract strings work with utf16 Like Java, C# to minimize these problems. Python states "ordinal" so I assume that is utf32, with a codepoint per 32-bit int (or dynamically parsed from utf16), but that is still only codepoint, and chopping the wrong place will mutilate code.

Using some index based on codepoints has two major problems:
- getting to array n means parsing 1..n-1, making character access O(n) and a loop over all characters O(n^2)
- codepoints are still not the hard fundamental unit of the past that most people that are new to unicode hope it will be.

An illustration of the latter issue are Apple systems. They use so called denormalized representation in that they store an accented letter (even a trema or French accent) as a separate codepoint. So there are TWO codepoints in a string 'e instead of é
etc. Apple systems might throw an exception if the strings are not in denormalized form, so it is safer to keep the whole system denormalized. Note this is afaik mostly for the objc frameworks, less so for the Unix subsystem.

Things get complicated though as the runtime needs to be able to implicitly encode unicode to byte strings when you do things like print a string to the screen, which can get complicated as the runtime has to determine what the output character encoding is (there are no end to problems with this and Python on the Windows command console window which has no real concept of unicode).


Run "cmd /u" and it will be unicode. IF the programs use the correct APIs.

There's also the issue of memory consumption. Unicode is just an abstraction; it requires a byte encoding to store it in memory. wstring uses UCS-2 on windows and UTF-32 on Linux. Four bytes per character is a bit wasteful if used for all strings. Many languages like Go use UTF-8 as the internal representation, which is the most compact. However it is not O(n) for indexing characters. So a function like mid() would have to always start at the beginning of the string and count in to find the i-th character.


Neither is utf16 or even utf32, though the incidences get increasingly rare in that order. See below for my preference.

The only language I've seen so far that seems to get unicode right is Python 3.3+. A string is unicode on Python 3 from the very beginning, and is always O(1) for indexing. But it requires some significant effort to work with at first, since you have to be careful to decode input and encode output streams. And when you want to work with bytes you have be explicit about it.
Suddenly file open modes become important again (text vs binary).


It is a hack that relies on issues being more rare with utf32 at the cost of horrible performance problems. Probably they only did this because they had too much code lying around that had sucky indexing, and they faced an anti-upgrade revolt, but probably they plan to clear it up in later versions when the legacy has been whittled away.

Will unicode ever be something that FB deals with easily? For now it seems like the best bet is to manually store unicode strings in a UTF-8 byte encoding, and manipulate them with a library like ICU, libunistring, or glib. Seems like all the major languages are baking in support for unicode, some better than others (Javascript has issues!).


Since FB is already not really native in philosophy, but emulates Unix on all systems, it is probably better to keep it utf8. But maybe it can be done in a way that allows Windows users to easily communicate with APIs (which are nearly all utf16 based, the 1-byte ones are not unicode). Manpower is also a reason for that.

Personally, long term, I would steer towards the native encoding of the system (utf16 on Windows as in application type, but of course utf8 for storage and textfile I/O), and utf8 on most unices with OS X being an exception. On OS X it depends on what you do. If you are more of an Apple person, using Apple apis you'll want utf16, if you live entirely within the Unix subsystem without native GUI, then utf8 is maybe better.
marcov
Posts: 2969
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: Can of worms: Proper unicode support in FB

Postby marcov » Jul 04, 2015 9:47

I thought I'd better post what I would do in a different post then the previous message. The suggestion is meant as illustration of could be done, and as usual I take my inspiration from FPC. Not entirely in a good way though, this same discussion has been raging there since 2009, and is still stalled. Also, while the solution is potentially good, it is a incompleted and bogged down by legacy and compatibility concerns. The idea is mostly from Delphi, but there are some expansions, most of those are from FPC's Jonas.

One of the key problems is the three or four encodings of Windows. (OEM,Ansi,Wide(utf16) and utf8 for most textfile output)

Anyway:

- have a one byte character type which has an additional field, codepage. We use the windows 16-bit codepage numbers, the unix string codepages are not suitable as internal format.
- have a two byte character type. UTF16 Native, as in wholly FB memory managed.
- (optional) on windows have a two byte BSTR com managed UTF16 type. (depends on if you want COM). You need native since the COM type is a tad slow.
- (optional) utf32 type for rendering, but if you have a easy to use (automanaged memory) array type (array of int32) , it is easier to simply use that. It is mainly for use during rendering, and you won't do many string operations on that.
- the default string is opaque and can be either one or two, depending on OS, preference etc. Or just for contingency long term.
- This means an effort must be made to make normal string code as abstract as possible. Deemphasize the type "char", e.g. make it assignment incompatible with the result of the [] operator on a string. So <string>:=<string>[] not <char>:=<string>[]

This (reserving one unit of the stringtype to hold a char) is another common trap besides character based indexing in the unicode world.

Now, the one byte string type is special, semantically this will be 16-bit worth of stringtypes (65536 string types). Runtime it will be one single type. (so one set of helpers for all, which can be parameterized by the codepage type). The compiler can insert automatic conversions.

Some of the codepage numbers are runtime, as in their real encoding is managed by the RTS. E.g. codepage 0 is equal to the codepage in some variable, which on startup is set to ACS but can be changed after. Similarly for the OEM type.

One codepage (in our case 0xFFFF) is "verbatim", and mostly used as type for parameters in RTS string routines. It signals the compiler to not do any conversions if the codepage of the string to be passed doesn't match the codepage of the declaration. But since codepage is a field, the helper can check the incoming type and act appropriately. This allows passing any strings to general routines without forced conversions.

You might want to emulate some additional runtime determined codepage aliases so that if you want to set the base type to something else than ACS, you can still use type 0 to declare "A" Windows API functions. This is one of the things we got wrong.

Internally, you maintain some variables that indicate codepages for various tasks (e.g. file I/O output, crt output, api call). The RTL calls are declared with the 0xFFFF verbatim type, check the incoming codepage against the task they perform (say in this case console output on Windows). They get 65001=utf8, but the console output is OEM-> does manual conversion , and OEM is written to the console

It is the combination of a very low runtime layer (that avoids too many string types to implement) but still being able to declaratively say "this code expects this codepage", without forcing everything to a "one type to rule all". The 1-byte vs 2-byte separation is maintained though, to keep s[] cheap, and to avoid that every routine doing any form of character access must have some directive or routine to force it.

Going one step further and also have the encoding baseunit size (1 or 2) in the type was discussed, but Delphi went a different way, so that was not further explored.

The runtime codepage types are the key here. They allow to runtime (though in practice only on startup) to change what internal codepages types mean, which can allow for dramatic changes without loss of performance. E.g. you can set the output type for console to ANSI when started in a normal cmd.exe and to utf8 for "cmd /U" in which case the conversion to utf16 API goes lossless.
caseih
Posts: 1504
Joined: Feb 26, 2007 5:32

Re: Can of worms: Proper unicode support in FB

Postby caseih » Jul 04, 2015 15:43

marcov wrote:Depends on which you implement. Manual wchar_t* strings or COM BSTR strings. The latter should be refcounted inside COM.


COM is windows-only.
And that saves an slowing and at-arms-length API conversion wall on the outside of the system interacting more freely with the world outside. Interpreters like Python that keep the world at an arm's length don't care, but IMHO FB is not in that category.

Well anytime you are working with unicode you have to be explicit in your communication with the outside world. byte-encodings matter when talking to another entity, since bytes are all that computers really know about.

yes ( well, of any language with a proper stringtype as opposed to library based manual pointer wrangling),
but I don't know why you wish to make such dramatic system decisions on maintaining bad (and unmaintainable) code, that won't stand the test of time anyway.

What bad code are you talking about?

Define "full" unicode encoding. There is *NO* Unicode encoding where a working unit (codepoint or whatever) guaranteedly corresponds with a printable glyph. This was deliberately done because the properties of the rendering medium would make choices different. (e.g. afaik in Thai characters are constructed out of parts and due to the high number of combinations there is no glyph for each one, and in other East Asian languages text might be rendered differently depending on the size of the font (more room allows for more complex glyphs)

By supporting unicode, perhaps we could say Unicode v6. In other words a dynamic string should be able to store any character (or whatever you want to call them) that has a unicode representation. BMP or not BMP. The programmer shouldn't have to worry about it. Just stick it in a string and be done.
Most systems that really abstract strings work with utf16 Like Java, C# to minimize these problems. Python states "ordinal" so I assume that is utf32, with a codepoint per 32-bit int (or dynamically parsed from utf16), but that is still only codepoint, and chopping the wrong place will mutilate code.

I believe you're correct about UTF-32, though Python compresses it by removing leading zeros, so a string could use one byte, two bytes, or four bytes per unicode character (or whatever we want to call them).

EDIT: Here's an example of what you are talking about:
[code=python file=unicodetest.py]>>> a="o\u0308o\u0327"
>>> print (a)
öo̧
>>> print (a[1:])

>>> print (a[0])
o
>>> print (a[1])

>>> print (a[2])
o
>>> print (a[3])

>>> print (a[1:])

>>> print (a[0:1])
o
>>> print (a[0:2])
ö
[/code]
Confusing, but working with decomposed graphemes is problematic. I don't see a way around it other than to convert all the decomposed graphemes into their dedicated codepoints.
Run "cmd /u" and it will be unicode. IF the programs use the correct APIs.

Yes I'm aware of this. Part of me is highly bothered by Windows instead of choosing UTF-8 for stream i/o, choosing a completely separate set of calls which use UTF-16 (but used to use UCS-2). The Win32 API is full of that sort of thing.
Its is a hack that relies on issues being more rare with utf32 at the cost of horrible performance problems. Probably they only did this because they had too much code lying around that had sucky indexing, and they faced an anti-upgrade revolt, but probably they plan to clear it up in later versions when the legacy has been whittled away.

Having been passively involved in the Python community since before Python's flexible string representation was introduced I can say I have no idea what you are referring to here. What is a "hack?" FSR is clearly a good thing for Python and it is the standard going forward. Please explain more what you're talking about. From what I've read, FSR is a good solution, and it frees programmers from worrying about messing up encodings by chopping strings in the wrong place.
Personally, long term, I would steer towards the native encoding of the system (utf16 on Windows as in application type, but of course utf8 for storage and textfile I/O), and utf8 on most unices with OS X being an exception. On OS X it depends on what you do. If you are more of an Apple person, using Apple apis you'll want utf16, if you live entirely within the Unix subsystem without native GUI, then utf8 is maybe better.

Yes this is true, especially because when calling native APIs you need to have encodings that work easily and quickly with them.
Last edited by caseih on Jul 04, 2015 16:19, edited 6 times in total.
caseih
Posts: 1504
Joined: Feb 26, 2007 5:32

Re: Can of worms: Proper unicode support in FB

Postby caseih » Jul 04, 2015 15:52

Although indexing a UTF-16 string is not O(1), from what I've read, by storing some additional information about the string out of band (the byte locations of non-BMP pairs), it is possible to make indexing O(log k), where k is the number of the non-BMP characters in the string, which is totally acceptable.
marcov
Posts: 2969
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: Can of worms: Proper unicode support in FB

Postby marcov » Jul 04, 2015 18:22

caseih wrote:Although indexing a UTF-16 string is not O(1), from what I've read, by storing some additional information about the string out of band (the byte locations of non-BMP pairs), it is possible to make indexing O(log k), where k is the number of the non-BMP characters in the string, which is totally acceptable.


If and only if you only use very long strings. Since any more complex access mechanism, while better for high N, is usually worse for short ones. Specially since s[] is normally one assembler instruction. A few more and you are already magnitudes slower.

A lot of strings are only up to 30 bytes.

Return to “Community Discussion”

Who is online

Users browsing this forum: No registered users and 3 guests