Basic programmer will die?

General discussion for topics related to the FreeBASIC project or its community.
caseih
Posts: 2197
Joined: Feb 26, 2007 5:32

Re: Basic programmer will die?

Post by caseih »

Okay. I looked at hello_UTF8.bas. It's kind of a misnamed file. Although it does show that you can store UTF-8 bytes in a string and print them out to a UTF-8-supporting terminal, it has nothing to do with using and manipulating Unicode or UTF-8 from within FB, so my comments were accurate.
Berkeley
Posts: 115
Joined: Jun 08, 2024 15:03

Re: Basic programmer will die?

Post by Berkeley »

caseih wrote: Feb 10, 2025 0:14 Okay. I looked at hello_UTF8.bas. It's kind of a misnamed file. Although it does show that you can store UTF-8 bytes in a string and print them out to a UTF-8-supporting terminal, it has nothing to do with using and manipulating Unicode or UTF-8 from within FB, so my comments were accurate.
See https://www.freebasic.net/forum/viewtopic.php?t=32807 if you need. Most string functions work with UTF-8, but there are special cases of course like getting a specific char - which isn't 1 byte in size - or transform lower case to upper case etc.
caseih
Posts: 2197
Joined: Feb 26, 2007 5:32

Re: Basic programmer will die?

Post by caseih »

Special cases? All string manipulation requires being able to slice a string on the codepoint boundaries. Most string functions will run without runtime errors on UTf-8 -encoded strings, but they certainly won't actually work right since they know nothing about UTF-8 encoding. MID$("台灣係一個獨立、民主嘅國家",5,3) will just get you mojibake. FB has zero support for UTF-8 in the built-in runtime library this present time. And I would not recommend using any FB string manipulation statements on UTF-8 bytes. Use a third-party library.
Berkeley
Posts: 115
Joined: Jun 08, 2024 15:03

Re: Basic programmer will die?

Post by Berkeley »

caseih wrote: Feb 11, 2025 0:53 Special cases? All string manipulation requires being able to slice a string on the codepoint boundaries.
And this will normally happen. If you have a "Chinese word" it's a full UTF-8 code. You won't have to ensure that a code gets cut in the middle. Some issues are that a character might count as 3 "letters" - the string length in characters is wrong, you are dealing with size in bytes. But it's also juggling with china if you use the FreeBASIC functions of course, e.g. replacing a word with 5 letters - one word might be longer than the other, not 5 bytes both...
caseih
Posts: 2197
Joined: Feb 26, 2007 5:32

Re: Basic programmer will die?

Post by caseih »

When you encode a unicode string into UTF-8, each code point takes up between 1 and 4 bytes. In other words its a variable-length encoding, and MID is not aware of this:

Code: Select all

$ cat test.bas                                        
print mid$("台灣係一個獨立、民主嘅國家",5,3)

$ file test.bas                                       
test.bas: Unicode text, UTF-8 text
$ /opt/FreeBASIC/bin/fbc test.bas
$ ./test                                              
���
Mojibake.

Code: Select all

$hexdump test.bas
$ hexdump -C test.bas                                    
00000000  70 72 69 6e 74 20 6d 69  64 24 28 22 e5 8f b0 e7  |print mid$("....|
00000010  81 a3 e4 bf 82 e4 b8 80  e5 80 8b e7 8d a8 e7 ab  |................|
00000020  8b e3 80 81 e6 b0 91 e4  b8 bb e5 98 85 e5 9c 8b  |................|
00000030  e5 ae b6 22 2c 35 2c 33  29 0a 0a                 |...",5,3)..|
0000003b
MID$ simply does not work on UTF-8 because FB doesn't have any UTF-8 support. This is not a criticism of FB. C also does not support UTF-8. If you work with UTF-8 you will need to use third-party libraries to decode from and encode to UTF-8 byte streams, and to slice up UTF-8 -encoded byte streams properly.

Unicode is very complex and difficult to properly deal with. In the MID$ example, one might naively think that slicing out three characters starting at position 5 is easy with wide strings. But unicode has other concepts that make that hard, such as non-printing code points, and code points that change the output depending on their pairing (accenting and such).

There is some support for UTF-8 in the fb compiler, but I'm not clear on exactly where the decoding happens. Here's an example of using WSTRING and MID$ works (keeping in mind the previous caveat about code point pairing):

Code: Select all

$ cat test.bas
dim as wstring*30 mystring

mystring = "台灣係一個獨立、民主嘅國家" 'fbc is decoding this at compile time?

print mid$(mystring,5,3)

$ /opt/FreeBASIC/bin/fbc test.bas
$ ./test
個獨立
Last edited by caseih on Feb 15, 2025 19:39, edited 1 time in total.
Löwenherz
Posts: 278
Joined: Aug 27, 2008 6:26
Location: Bad Sooden-Allendorf, Germany

Re: Basic programmer will die?

Post by Löwenherz »

Code: Select all

' 1 example ----------------------- //
#ifdef __FB_WIN32__
# define unicode
# include once "windows.bi"
#endif

const LANG = "Chinese"
dim helloworld as wstring * 20 => " 你好世界"
	messagebox( 0, helloworld, """Hello World!"" in " & LANG & ":", MB_OK )

' 2 example ----------------------- //
#define UNICODE
#INCLUDE ONCE "Afx/CWindow.inc"
USING Afx
DIM cws AS CWSTR = "  你好世界 " chinese txt translated in "hello world"
AfxMsg MID(cws, 2)  ' whole translation
'AfxMsg MID(cws, 5,3) ' only one word
caseih
Posts: 2197
Joined: Feb 26, 2007 5:32

Re: Basic programmer will die?

Post by caseih »

It's probably worth knowing that on Windows, wide characters are all just 2 bytes long. Years ago when Windows first supported unicode in win32, that was enough (unicode was only 16 bits back then). The 2-byte uncoding was called UCS-2. Now unicode is 21-bits per code point (as of 2019 I think) so 16 bits was no longer enough. Along the way, Microsoft changed from using UCS-2 to using UTF-16 in their wide string structures. Unfortunately this means that naive implementations of string routines won't always work. Thus even with WSTRING, MID will still fail under some circumstances on win32 (but not on Linux where WSTRING is 32-bit).

The future of all operating systems is UTF-8 now. Even MS is migrating their API to that encoding. Makes a lot of sense. It's compact and can deal easily with future expansion up to 6 bytes max (encoded size). The downside to UTF-8 is that any indexing operation is now O(n) whereas with fixed-width representations, they would be O(1).
Josep Roca
Posts: 615
Joined: Sep 27, 2016 18:20
Location: Valencia, Spain

Re: Basic programmer will die?

Post by Josep Roca »

> Even MS is migrating their API to that encoding.

No, it is not. It is just using a trick to make the "A" API functions, that are simple wrappers that convert ansi strings to UTF-16 and then call the "W" functions, to use CP_UTF8, instead of CP_ACP, in the ansi to UTF-16 conversions done with MultiByteToWideChar and WideCharToMultiByte.

See: https://learn.microsoft.com/en-us/windo ... -code-page
caseih
Posts: 2197
Joined: Feb 26, 2007 5:32

Re: Basic programmer will die?

Post by caseih »

Sorry for the nerding out about unicode. Sure. You can call it a trick if you want. But it doesn't change the fact that UTF-8 is the universal text data interchange standard and unicode encoding scheme now, and win32 is moving to support it across its narrow API. How it does that behind the scenes is less important.

Please note I am not arguing that FB should necessarily adopt UTF-8 internally for unicode strings, although that is a possibility some day. My main point was that one cannot directly manipulate UTF-8 -encoded strings with FB's standard runtime library without third-party libraries to help.
Berkeley
Posts: 115
Joined: Jun 08, 2024 15:03

Re: Basic programmer will die?

Post by Berkeley »

caseih wrote: Feb 15, 2025 17:32 MID is not aware of this:
If you want 3 Unicode-letters of a string, you'll get crap. If you want to cut a word e.g. to the next found space char (by INSTR), it works as intended. Probably you might therefore not get any error with UTF-8 except for the output functions. But it's very unlikely. You assume very often that +1 or -1 is the character after or before the current position. A UTF-8-safe function has to check whether the addressed byte is part of a UTF-8 sequence resp. ensure, that it's ASCII (0-127) or the starting byte of a UTF-8 sequence.
Post Reply