[SOLVED] Pass UTF-8 strings
[SOLVED] Pass UTF-8 strings
The C library I'm using expects strings to be UTF-8. In C it's simply const char* without any extra efforts. I tried using it with FreeBASIC and it doesn't work. The displayed text is blank. Removing any Unicode characters then it displayed properly.
I passed it directly like this: test_C_function("test 中国語"), doesn't work, removing the Unicode part to be test_C_function("test") then it works.
I passed it directly like this: test_C_function("test 中国語"), doesn't work, removing the Unicode part to be test_C_function("test") then it works.
Last edited by 3oheicrw on Apr 03, 2022 12:53, edited 1 time in total.
Re: Pass UTF-8 strings
Hello,
You would like to check this project, UTF-8 Variable Length String Library :
viewtopic.php?t=26170
You would like to check this project, UTF-8 Variable Length String Library :
viewtopic.php?t=26170
Re: Pass UTF-8 strings
I could dim it as ustring but the problem is how to pass it to the C library? The C library knows nothing about ustring. It only speaks const char*.Vortex wrote: ↑Apr 02, 2022 10:30 Hello,
You would like to check this project, UTF-8 Variable Length String Library :
viewtopic.php?t=26170
Re: Pass UTF-8 strings
const char* translates to CONST ZSTRING PTR
Example
Example
Code: Select all
test_C_function(@"test 中国語")
VAR c_str = @"test 中国語" '' creates a ZSTRING PTR
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)
Re: Pass UTF-8 strings
The reason your quoted string literal didn't work is because on Windows that string (and your source code file iteself) is UCS-2 encoded, which your C function does not understand. You must *encode* your FB unicode string into bytes (ztring). FB ships with a number of functions to assist in this. There's an include file called "utf_conv.bi". You can use WCharToUTF() to do the conversion for you.3oheicrw wrote: ↑Apr 02, 2022 9:38 The C library I'm using expects strings to be UTF-8. In C it's simply const char* without any extra efforts. I tried using it with FreeBASIC and it doesn't work. The displayed text is blank. Removing any Unicode characters then it displayed properly.
I passed it directly like this: test_C_function("test 中国語"), doesn't work, removing the Unicode part to be test_C_function("test") then it works.
Of course it gets a bit confusing because if you make sure your source code file is stored in UTF-8 (which you can do even on Windows), then the string literals will already be UTF-8 encoded.
Re: Pass UTF-8 strings
Encoding the FB source in UTF-8 is mandatory for using that C-lib. Each serious IDE should do this by default, ie Geany.
Re: Pass UTF-8 strings
My source file is already UTF-8. I don't know how to *encode* the string as you said, please give me an example.caseih wrote: ↑Apr 02, 2022 13:56 The reason your quoted string literal didn't work is because on Windows that string (and your source code file iteself) is UCS-2 encoded, which your C function does not understand. You must *encode* your FB unicode string into bytes (ztring). FB ships with a number of functions to assist in this. There's an include file called "utf_conv.bi". You can use WCharToUTF() to do the conversion for you.
Of course it gets a bit confusing because if you make sure your source code file is stored in UTF-8 (which you can do even on Windows), then the string literals will already be UTF-8 encoded.
@TJF I followed your instruction but it doesn't work, either passing the string directly with @ or passing the var holding it.
Re: Pass UTF-8 strings
I tried to pass an ustring holding var to the function and it compiled but now instead of a blank text the text is full of ?, only the ASCII part of it displayed correctly.Vortex wrote: ↑Apr 02, 2022 10:30 Hello,
You would like to check this project, UTF-8 Variable Length String Library :
viewtopic.php?t=26170
Re: Pass UTF-8 strings
How do you prove that? When you save my code in UTF-8 encoding, it will work.
Here's the HEX output:
Code: Select all
0x00000000: 74 65 73 74 5f 43 5f 66 75 6e 63 74 69 6f 6e 28 40 22 74 65 73 74 20 e4 b8 ad e5 9b bd e8 aa 9e test_C_function(@"test .........
0x00000020: 22 29 0a 0a 56 41 52 20 63 5f 73 74 72 20 3d 20 40 22 74 65 73 74 20 e4 b8 ad e5 9b bd e8 aa 9e ")..VAR c_str = @"test .........
0x00000040: 22 20 27 27 20 63 72 65 61 74 65 73 20 61 20 5a 53 54 52 49 4e 47 20 50 54 52 0a 74 65 73 74 5f " '' creates a ZSTRING PTR.test_
0x00000060: 43 5f 66 75 6e 63 74 69 6f 6e 28 63 5f 73 74 72 29 0a 3f 4c 45 46 54 28 2a 63 5f 73 74 72 2c 20 C_function(c_str).?LEFT(*c_str,
0x00000080: 34 29 0a 3f 4d 49 44 28 2a 63 5f 73 74 72 2c 20 36 29 0a 4).?MID(*c_str, 6).
Re: Pass UTF-8 strings
@TJF My source file is UTF-8. Yours seems to be UTF-8 without BOM. I have EF BB BF at the beginning of my source file in HEX. Could you try saving your source as UTF-8 with BOM like me and test if it still work? My text editor doesn't support UTF-8 without BOM, it will always add EF BB BF at the beginning of text file.
Edit: your code doesn't work, it only prints squares.
Edit: your code doesn't work, it only prints squares.
Re: Pass UTF-8 strings
I never tried fbc with BOM source code. But I'm sure its working without BOM -> switch to another editor.
In a first step comment out the lines with test_C_function and prepare your system to get the matching output in your terminal from FB code only.
Where do you see that squares? In your terminal? Is it prepared to output in UTF-8 encoding?
In a first step comment out the lines with test_C_function and prepare your system to get the matching output in your terminal from FB code only.
Re: Pass UTF-8 strings
Sounds like your Windows console is not using UTF-8. A quick search reveals that you can set it with the command "chcp 65001". If you do that I suspect your original code will work.
If you are working with WSTRING and if you need to programmatically decode bytes in some format into unicode (WString), or output unicode from a WString to a particular encoding when writing to a file or printing to a console (or passing data to a function), you need to encode that unicode string into a byte encoding. See that include file I mentioned for functions in the FB runtime library to ease that. It's called utf_conv.bi and ships with FB.
If you are working with WSTRING and if you need to programmatically decode bytes in some format into unicode (WString), or output unicode from a WString to a particular encoding when writing to a file or printing to a console (or passing data to a function), you need to encode that unicode string into a byte encoding. See that include file I mentioned for functions in the FB runtime library to ease that. It's called utf_conv.bi and ships with FB.
Re: Pass UTF-8 strings
It's fun everyone. After the command "chcp 65001" instead of squares it outputs squares with ? inside. Unfortunately I think I have to admit being defeated. Thanks for your help anyway.
Edit: Don't waste time arguing about the terminal everyone. The C library I'm using is a GUI toolkit like IUP, what I said that not displayed correctly (sometimes blank text, sometimes with the non ASCII part displayed as ?) is exactly the window title. test_C_function is the function used to set the window title. It has nothing to do with the terminal. The cli code of TJF only used as example that there is something wrong. That is regardless of the GUI powered by that library or the terminal, the UTF-8 text displayed incorrectly, so it's not the fault of the C library, it's the fault in my coding or in the compiler itself.
Edit: Don't waste time arguing about the terminal everyone. The C library I'm using is a GUI toolkit like IUP, what I said that not displayed correctly (sometimes blank text, sometimes with the non ASCII part displayed as ?) is exactly the window title. test_C_function is the function used to set the window title. It has nothing to do with the terminal. The cli code of TJF only used as example that there is something wrong. That is regardless of the GUI powered by that library or the terminal, the UTF-8 text displayed incorrectly, so it's not the fault of the C library, it's the fault in my coding or in the compiler itself.
Re: Pass UTF-8 strings
In order to test an UTF-8 application, it's clever to prepare the system for correct debug output in the console.
I've developed more than 100 UTF-8 applications, from minimal examples to real projects, and never faced any issue in the compiler.
I'm sure my code is working on a well configured system.
Perhaps you want to start with checking the examples in folder examples/unicode?
I've developed more than 100 UTF-8 applications, from minimal examples to real projects, and never faced any issue in the compiler.
I'm sure my code is working on a well configured system.
Perhaps you want to start with checking the examples in folder examples/unicode?
Re: Pass UTF-8 strings
From the screenshot it seems your application is running under Linux, Linux speaks UTF-8 natively unlike Windows. I tried compiling and run hello_UTF8.bas, the text get printed on the terminal is full of squares but the text on messagebox is displayed correctly. It's nothing strange, as the text itself is wstring which is UTF-16 that is what Windows speaks natively, so the messagebox from Windows API has no troubles dealing with it. It's strange as that example is the showcase for UTF-8 on FreeBASIC, but everything it is about UTF-16! The content of all hello_UTF* are the same, I guest the only different is the source file's encoding. But after all, it's still wstring being used, and wstring is UFT-16!TJF wrote: ↑Apr 03, 2022 5:50 In order to test an UTF-8 application, it's clever to prepare the system for correct debug output in the console.
I've developed more than 100 UTF-8 applications, from minimal examples to real projects, and never faced any issue in the compiler.
I'm sure my code is working on a well configured system.
Perhaps you want to start with checking the examples in folder examples/unicode?
Edit: most of people don't know the Windows command prompt has the ability to redirect the program's stdio to text file just like the shell on Linux. I use test > text.txt to do so and what I found is no squares but full of meaningless unicode characters that has nothing to do with my original text.