Basic programmer will die?

caseih · Post by **caseih** » Feb 10, 2025 0:14

Okay. I looked at hello_UTF8.bas. It's kind of a misnamed file. Although it does show that you can store UTF-8 bytes in a string and print them out to a UTF-8-supporting terminal, it has nothing to do with using and manipulating Unicode or UTF-8 from within FB, so my comments were accurate.

Berkeley · Post by **Berkeley** » Feb 10, 2025 21:07

caseih wrote: ↑Feb 10, 2025 0:14 Okay. I looked at hello_UTF8.bas. It's kind of a misnamed file. Although it does show that you can store UTF-8 bytes in a string and print them out to a UTF-8-supporting terminal, it has nothing to do with using and manipulating Unicode or UTF-8 from within FB, so my comments were accurate.

See https://www.freebasic.net/forum/viewtopic.php?t=32807 if you need. Most string functions work with UTF-8, but there are special cases of course like getting a specific char - which isn't 1 byte in size - or transform lower case to upper case etc.

caseih · Post by **caseih** » Feb 11, 2025 0:53

Special cases? All string manipulation requires being able to slice a string on the codepoint boundaries. Most string functions will run without runtime errors on UTf-8 -encoded strings, but they certainly won't actually work right since they know nothing about UTF-8 encoding. MID$("台灣係一個獨立、民主嘅國家",5,3) will just get you mojibake. FB has zero support for UTF-8 in the built-in runtime library this present time. And I would not recommend using any FB string manipulation statements on UTF-8 bytes. Use a third-party library.

Berkeley · Post by **Berkeley** » Feb 11, 2025 22:22

caseih wrote: ↑Feb 11, 2025 0:53 Special cases? All string manipulation requires being able to slice a string on the codepoint boundaries.

And this will normally happen. If you have a "Chinese word" it's a full UTF-8 code. You won't have to ensure that a code gets cut in the middle. Some issues are that a character might count as 3 "letters" - the string length in characters is wrong, you are dealing with size in bytes. But it's also juggling with china if you use the FreeBASIC functions of course, e.g. replacing a word with 5 letters - one word might be longer than the other, not 5 bytes both...

caseih · Post by **caseih** » Feb 15, 2025 17:32

When you encode a unicode string into UTF-8, each code point takes up between 1 and 4 bytes. In other words its a variable-length encoding, and MID is not aware of this:

Code: Select all

$ cat test.bas                                        
print mid$("台灣係一個獨立、民主嘅國家",5,3)

$ file test.bas                                       
test.bas: Unicode text, UTF-8 text
$ /opt/FreeBASIC/bin/fbc test.bas
$ ./test                                              
���

Mojibake.

Code: Select all

$hexdump test.bas
$ hexdump -C test.bas                                    
00000000  70 72 69 6e 74 20 6d 69  64 24 28 22 e5 8f b0 e7  |print mid$("....|
00000010  81 a3 e4 bf 82 e4 b8 80  e5 80 8b e7 8d a8 e7 ab  |................|
00000020  8b e3 80 81 e6 b0 91 e4  b8 bb e5 98 85 e5 9c 8b  |................|
00000030  e5 ae b6 22 2c 35 2c 33  29 0a 0a                 |...",5,3)..|
0000003b

MID$ simply does not work on UTF-8 because FB doesn't have any UTF-8 support. This is not a criticism of FB. C also does not support UTF-8. If you work with UTF-8 you will need to use third-party libraries to decode from and encode to UTF-8 byte streams, and to slice up UTF-8 -encoded byte streams properly.

Unicode is very complex and difficult to properly deal with. In the MID$ example, one might naively think that slicing out three characters starting at position 5 is easy with wide strings. But unicode has other concepts that make that hard, such as non-printing code points, and code points that change the output depending on their pairing (accenting and such).

There is some support for UTF-8 in the fb compiler, but I'm not clear on exactly where the decoding happens. Here's an example of using WSTRING and MID$ works (keeping in mind the previous caveat about code point pairing):

Code: Select all

$ cat test.bas
dim as wstring*30 mystring

mystring = "台灣係一個獨立、民主嘅國家" 'fbc is decoding this at compile time?

print mid$(mystring,5,3)

$ /opt/FreeBASIC/bin/fbc test.bas
$ ./test
個獨立

Löwenherz · Post by **Löwenherz** » Feb 15, 2025 19:37

Code: Select all

' 1 example ----------------------- //
#ifdef __FB_WIN32__
# define unicode
# include once "windows.bi"
#endif

const LANG = "Chinese"
dim helloworld as wstring * 20 => " 你好世界"
	messagebox( 0, helloworld, """Hello World!"" in " & LANG & ":", MB_OK )

' 2 example ----------------------- //
#define UNICODE
#INCLUDE ONCE "Afx/CWindow.inc"
USING Afx
DIM cws AS CWSTR = "  你好世界 " chinese txt translated in "hello world"
AfxMsg MID(cws, 2)  ' whole translation
'AfxMsg MID(cws, 5,3) ' only one word

caseih · Post by **caseih** » Feb 15, 2025 19:48

It's probably worth knowing that on Windows, wide characters are all just 2 bytes long. Years ago when Windows first supported unicode in win32, that was enough (unicode was only 16 bits back then). The 2-byte uncoding was called UCS-2. Now unicode is 21-bits per code point (as of 2019 I think) so 16 bits was no longer enough. Along the way, Microsoft changed from using UCS-2 to using UTF-16 in their wide string structures. Unfortunately this means that naive implementations of string routines won't always work. Thus even with WSTRING, MID will still fail under some circumstances on win32 (but not on Linux where WSTRING is 32-bit).

The future of all operating systems is UTF-8 now. Even MS is migrating their API to that encoding. Makes a lot of sense. It's compact and can deal easily with future expansion up to 6 bytes max (encoded size). The downside to UTF-8 is that any indexing operation is now O(n) whereas with fixed-width representations, they would be O(1).

Josep Roca · Post by **Josep Roca** » Feb 16, 2025 5:34

> Even MS is migrating their API to that encoding.

No, it is not. It is just using a trick to make the "A" API functions, that are simple wrappers that convert ansi strings to UTF-16 and then call the "W" functions, to use CP_UTF8, instead of CP_ACP, in the ansi to UTF-16 conversions done with MultiByteToWideChar and WideCharToMultiByte.

See: https://learn.microsoft.com/en-us/windo ... -code-page

caseih · Post by **caseih** » Feb 16, 2025 16:24

Sorry for the nerding out about unicode. Sure. You can call it a trick if you want. But it doesn't change the fact that UTF-8 is the universal text data interchange standard and unicode encoding scheme now, and win32 is moving to support it across its narrow API. How it does that behind the scenes is less important.

Please note I am not arguing that FB should necessarily adopt UTF-8 internally for unicode strings, although that is a possibility some day. My main point was that one cannot directly manipulate UTF-8 -encoded strings with FB's standard runtime library without third-party libraries to help.

Berkeley · Post by **Berkeley** » Feb 16, 2025 20:46

caseih wrote: ↑Feb 15, 2025 17:32 MID is not aware of this:

If you want 3 Unicode-letters of a string, you'll get crap. If you want to cut a word e.g. to the next found space char (by INSTR), it works as intended. Probably you might therefore not get any error with UTF-8 except for the output functions. But it's very unlikely. You assume very often that +1 or -1 is the character after or before the current position. A UTF-8-safe function has to check whether the addressed byte is part of a UTF-8 sequence resp. ensure, that it's ASCII (0-127) or the starting byte of a UTF-8 sequence.

Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?

Re: Basic programmer will die?