[SOLVED] Pass UTF-8 strings

3oheicrw · Post by **3oheicrw** » Apr 02, 2022 9:38

The C library I'm using expects strings to be UTF-8. In C it's simply const char* without any extra efforts. I tried using it with FreeBASIC and it doesn't work. The displayed text is blank. Removing any Unicode characters then it displayed properly.

I passed it directly like this: test_C_function("test 中国語"), doesn't work, removing the Unicode part to be test_C_function("test") then it works.

Vortex · Post by **Vortex** » Apr 02, 2022 10:30

Hello,

You would like to check this project, UTF-8 Variable Length String Library :

viewtopic.php?t=26170

3oheicrw · Post by **3oheicrw** » Apr 02, 2022 11:45

Vortex wrote: ↑Apr 02, 2022 10:30 Hello,

You would like to check this project, UTF-8 Variable Length String Library :

viewtopic.php?t=26170

I could dim it as ustring but the problem is how to pass it to the C library? The C library knows nothing about ustring. It only speaks const char*.

TJF · Post by **TJF** » Apr 02, 2022 12:20

const char* translates to CONST ZSTRING PTR

Example

Code: Select all

test_C_function(@"test 中国語")

VAR c_str = @"test 中国語" '' creates a ZSTRING PTR
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)

caseih · Post by **caseih** » Apr 02, 2022 13:56

3oheicrw wrote: ↑Apr 02, 2022 9:38 The C library I'm using expects strings to be UTF-8. In C it's simply const char* without any extra efforts. I tried using it with FreeBASIC and it doesn't work. The displayed text is blank. Removing any Unicode characters then it displayed properly.

I passed it directly like this: test_C_function("test 中国語"), doesn't work, removing the Unicode part to be test_C_function("test") then it works.

The reason your quoted string literal didn't work is because on Windows that string (and your source code file iteself) is UCS-2 encoded, which your C function does not understand. You must *encode* your FB unicode string into bytes (ztring). FB ships with a number of functions to assist in this. There's an include file called "utf_conv.bi". You can use WCharToUTF() to do the conversion for you.

Of course it gets a bit confusing because if you make sure your source code file is stored in UTF-8 (which you can do even on Windows), then the string literals will already be UTF-8 encoded.

TJF · Post by **TJF** » Apr 02, 2022 14:06

caseih wrote: ↑Apr 02, 2022 13:56Of course it gets a bit confusing because if you make sure your source code file is stored in UTF-8 (which you can do even on Windows), then the string literals will already be UTF-8 encoded.

Encoding the FB source in UTF-8 is mandatory for using that C-lib. Each serious IDE should do this by default, ie Geany.

3oheicrw · Post by **3oheicrw** » Apr 02, 2022 16:02

caseih wrote: ↑Apr 02, 2022 13:56 The reason your quoted string literal didn't work is because on Windows that string (and your source code file iteself) is UCS-2 encoded, which your C function does not understand. You must *encode* your FB unicode string into bytes (ztring). FB ships with a number of functions to assist in this. There's an include file called "utf_conv.bi". You can use WCharToUTF() to do the conversion for you.

Of course it gets a bit confusing because if you make sure your source code file is stored in UTF-8 (which you can do even on Windows), then the string literals will already be UTF-8 encoded.

My source file is already UTF-8. I don't know how to *encode* the string as you said, please give me an example.

@TJF I followed your instruction but it doesn't work, either passing the string directly with @ or passing the var holding it.

3oheicrw · Post by **3oheicrw** » Apr 02, 2022 16:29

Vortex wrote: ↑Apr 02, 2022 10:30 Hello,

You would like to check this project, UTF-8 Variable Length String Library :

viewtopic.php?t=26170

I tried to pass an ustring holding var to the function and it compiled but now instead of a blank text the text is full of ?, only the ASCII part of it displayed correctly.

TJF · Post by **TJF** » Apr 02, 2022 17:18

3oheicrw wrote: ↑Apr 02, 2022 16:02My source file is already UTF-8.

How do you prove that? When you save my code in UTF-8 encoding, it will work.

Here's the HEX output:

Code: Select all

0x00000000: 74 65 73 74 5f 43 5f 66 75 6e 63 74 69 6f 6e 28 40 22 74 65 73 74 20 e4 b8 ad e5 9b bd e8 aa 9e  test_C_function(@"test .........
0x00000020: 22 29 0a 0a 56 41 52 20 63 5f 73 74 72 20 3d 20 40 22 74 65 73 74 20 e4 b8 ad e5 9b bd e8 aa 9e  ")..VAR c_str = @"test .........
0x00000040: 22 20 27 27 20 63 72 65 61 74 65 73 20 61 20 5a 53 54 52 49 4e 47 20 50 54 52 0a 74 65 73 74 5f  " '' creates a ZSTRING PTR.test_
0x00000060: 43 5f 66 75 6e 63 74 69 6f 6e 28 63 5f 73 74 72 29 0a 3f 4c 45 46 54 28 2a 63 5f 73 74 72 2c 20  C_function(c_str).?LEFT(*c_str, 
0x00000080: 34 29 0a 3f 4d 49 44 28 2a 63 5f 73 74 72 2c 20 36 29 0a                                         4).?MID(*c_str, 6).

3oheicrw · Post by **3oheicrw** » Apr 02, 2022 17:32

@TJF My source file is UTF-8. Yours seems to be UTF-8 without BOM. I have EF BB BF at the beginning of my source file in HEX. Could you try saving your source as UTF-8 with BOM like me and test if it still work? My text editor doesn't support UTF-8 without BOM, it will always add EF BB BF at the beginning of text file.

Edit: your code doesn't work, it only prints squares.

TJF · Post by **TJF** » Apr 02, 2022 17:53

I never tried fbc with BOM source code. But I'm sure its working without BOM -> switch to another editor.

3oheicrw wrote: ↑Apr 02, 2022 17:32Edit: your code doesn't work, it only prints squares.

Where do you see that squares? In your terminal? Is it prepared to output in UTF-8 encoding?

In a first step comment out the lines with test_C_function and prepare your system to get the matching output in your terminal from FB code only.

caseih · Post by **caseih** » Apr 02, 2022 22:36

Sounds like your Windows console is not using UTF-8. A quick search reveals that you can set it with the command "chcp 65001". If you do that I suspect your original code will work.

If you are working with WSTRING and if you need to programmatically decode bytes in some format into unicode (WString), or output unicode from a WString to a particular encoding when writing to a file or printing to a console (or passing data to a function), you need to encode that unicode string into a byte encoding. See that include file I mentioned for functions in the FB runtime library to ease that. It's called utf_conv.bi and ships with FB.

3oheicrw · Post by **3oheicrw** » Apr 03, 2022 2:40

It's fun everyone. After the command "chcp 65001" instead of squares it outputs squares with ? inside. Unfortunately I think I have to admit being defeated. Thanks for your help anyway.

Edit: Don't waste time arguing about the terminal everyone. The C library I'm using is a GUI toolkit like IUP, what I said that not displayed correctly (sometimes blank text, sometimes with the non ASCII part displayed as ?) is exactly the window title. test_C_function is the function used to set the window title. It has nothing to do with the terminal. The cli code of TJF only used as example that there is something wrong. That is regardless of the GUI powered by that library or the terminal, the UTF-8 text displayed incorrectly, so it's not the fault of the C library, it's the fault in my coding or in the compiler itself.

TJF · Post by **TJF** » Apr 03, 2022 5:50

In order to test an UTF-8 application, it's clever to prepare the system for correct debug output in the console.

I've developed more than 100 UTF-8 applications, from minimal examples to real projects, and never faced any issue in the compiler.

I'm sure my code is working on a well configured system.

Perhaps you want to start with checking the examples in folder examples/unicode?

3oheicrw · Post by **3oheicrw** » Apr 03, 2022 7:32

TJF wrote: ↑Apr 03, 2022 5:50 In order to test an UTF-8 application, it's clever to prepare the system for correct debug output in the console.

I've developed more than 100 UTF-8 applications, from minimal examples to real projects, and never faced any issue in the compiler.

I'm sure my code is working on a well configured system.

Perhaps you want to start with checking the examples in folder examples/unicode?

From the screenshot it seems your application is running under Linux, Linux speaks UTF-8 natively unlike Windows. I tried compiling and run hello_UTF8.bas, the text get printed on the terminal is full of squares but the text on messagebox is displayed correctly. It's nothing strange, as the text itself is wstring which is UTF-16 that is what Windows speaks natively, so the messagebox from Windows API has no troubles dealing with it. It's strange as that example is the showcase for UTF-8 on FreeBASIC, but everything it is about UTF-16! The content of all hello_UTF* are the same, I guest the only different is the source file's encoding. But after all, it's still wstring being used, and wstring is UFT-16!

Edit: most of people don't know the Windows command prompt has the ability to redirect the program's stdio to text file just like the shell on Linux. I use test > text.txt to do so and what I found is no squares but full of meaningless unicode characters that has nothing to do with my original text.

[SOLVED] Pass UTF-8 strings

[SOLVED] Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings