properly print UTF string contents in an xml file with expat/libxml2
properly print UTF string contents in an xml file with expat/libxml2
May I ask is there anybody once properly printed utf string( Chinese string,Japanese string,etc) in an xml file with expat/libxml2 from fbc/example/xml https://github.com/freebasic/fbc/tree/m ... amples/xml?
With fbc64 1.12.0 under windows,I can compile expat.bas and libxml.bas ,and properly print the test.xml to the console.But when I modified a couple of strings to Chinese ,it failed to properly print the content anymore.And then whatever I modified the source (const char* ->String->wstring->zstring ptr->wstring ptr),and source encoding,and windows console code page,failed to print properly.
With fbc64 1.12.0 under windows,I can compile expat.bas and libxml.bas ,and properly print the test.xml to the console.But when I modified a couple of strings to Chinese ,it failed to properly print the content anymore.And then whatever I modified the source (const char* ->String->wstring->zstring ptr->wstring ptr),and source encoding,and windows console code page,failed to print properly.
Re: properly print UTF string contents in an xml file with expat/libxml2
The easiest way to print out UTF-8-encoded bytes is to make sure your console is natively using UTF-8 as its code page. MS calls it 65001. chcp 65001 at the command prompt. Also I recommend you make sure to default to the new Windows Terminal instead of the old console system. Windows Terminal has full support for UTF-8 when you set the code page. Whereas I'm not totally sure about the win32 console.
You could also find a library to decode UTF-8 into wide strings and print those using the win32 wide API calls. But I don't think FB has the capability of printing wide strings directly. Would have to use win32 api calls, or libc calls like wprintf.
Eventually Windows will be defaulting to UTF-8 everywhere. For now you can enable globally by doing:
Win-R, type "intl.cpl" and press enter.
Click on Adminsitrative tab and check the box "use Unicode UTF-8 for worldwide language support."
You could also find a library to decode UTF-8 into wide strings and print those using the win32 wide API calls. But I don't think FB has the capability of printing wide strings directly. Would have to use win32 api calls, or libc calls like wprintf.
Eventually Windows will be defaulting to UTF-8 everywhere. For now you can enable globally by doing:
Win-R, type "intl.cpl" and press enter.
Click on Adminsitrative tab and check the box "use Unicode UTF-8 for worldwide language support."
Re: properly print UTF string contents in an xml file with expat/libxml2
Thanks for the help.
Yes,My Win10 console has been set defualt to 65001 exactly as your guided here
I have no too much difficulties to properly print Chinese characters from fb other sources to the console,but this time,for Chinese characters in an xml file,reading by expat/libxml these two libraries as mentioned the above,just failed when trying to print into console.I checked libxml.bi and nothing special found related to char* /string/zstring (convertions) between.Just failed to print properly,whatever I do the char*/string/zstring/wstring convertions before statement.
Yes,My Win10 console has been set defualt to 65001 exactly as your guided here
Code: Select all
Win-R, type "intl.cpl" and press enter.
Click on Adminsitrative tab and check the box "use Unicode UTF-8 for worldwide language support."
Code: Select all
print
Re: properly print UTF string contents in an xml file with expat/libxml2
Are you sure the data you're getting from the XML file is still UTF-8? Or has it been converted to some other encoding or code page? Since your console is UTF-8 and can display UTF-8 bytes correctly from other sources, this suggests the data you're feeding to PRINT is not UTF-8.
Re: properly print UTF string contents in an xml file with expat/libxml2
Yes,I am very sure.I looked very close at encoding after the xml file has been modified.I keep the xml file in UTF8 so that to try various string convertions and also source file encoding changing to see whether there is a bingo but unfortunately not lucky.caseih wrote: ↑Jan 19, 2025 1:25 Are you sure the data you're getting from the XML file is still UTF-8? Or has it been converted to some other encoding or code page? Since your console is UTF-8 and can display UTF-8 bytes correctly from other sources, this suggests the data you're feeding to PRINT is not UTF-8.
The only lucky yet very weird thing is when I open the libxml.bas file in notepad++ and try to compile and run inside it (with well configured NppExec,the console inside Notepad++ has been set both input and oupt to be UTF8),the printing result is perfect.But I can't imagine what this means as Windows console/VS Code console has been set to 65001 already.I can't see the difference between windows console and notepad++ console.How come the former can't print properly but the latter can.
Re: properly print UTF string contents in an xml file with expat/libxml2
@PeterHu
you need to save your program source with a Unicode BOM, in the Geany IDE: Document -> Write Unicode BOM
you need to save your program source with a Unicode BOM, in the Geany IDE: Document -> Write Unicode BOM
Re: properly print UTF string contents in an xml file with expat/libxml2
Thanks,I just downloaded Geany and tried.
For the source file libxml.bas,the Geany IDE shows Document-> has already checked Write Unicode BOM (W))
The compiling process is successful after a very quick and easy setting with freebasic,but the running result is just the same as in Windows console cmdline ( fbc64 libxml.bas and libxml test1.xml)
Re: properly print UTF string contents in an xml file with expat/libxml2
@PeterHu
please post your example that won't work
please post your example that won't work
Re: properly print UTF string contents in an xml file with expat/libxml2
for example,with this test1.xml,compile fbc/examples/xml/libxml.bas and run libxml test1.xml,under windows console I have no chance to properly print the content(Chinese characters).
libxml.bas
test1.xml
libxml.bas
Code: Select all
#include once "libxml/xmlreader.bi"
#define NULL 0
Dim As String filename = Command(1)
If( Len( filename ) = 0 ) Then
Print "Usage: libxml filename"
'filename="test1.xml"
End 1
End If
Dim As xmlTextReaderPtr reader = xmlReaderForFile( filename, NULL, 0 )
If (reader = NULL) Then
Print "Unable to open "; filename
End 1
End If
Dim As Integer ret = xmlTextReaderRead( reader )
Do While( ret = 1 )
Dim As Const ZString Ptr constname = xmlTextReaderConstName( reader )
Dim As Const ZString Ptr value = xmlTextReaderConstValue( reader )
Print xmlTextReaderDepth( reader ); _
xmlTextReaderNodeType( reader ); _
" "; *constname; _
xmlTextReaderIsEmptyElement(reader); _
xmlTextReaderHasValue( reader ); _
*value
ret = xmlTextReaderRead( reader )
Loop
xmlFreeTextReader( reader )
If( ret <> 0 ) Then
Print "failed to parse: "; filename
End If
xmlCleanupParser( )
xmlMemoryDump()
Code: Select all
<?xml version="1.0"?>
<gjob:Helping xmlns:gjob="http://www.gnome.org/some-location">
<gjob:Jobs>
<gjob:Job>
<gjob:Project ID="3"/>
<gjob:Application>GBackup</gjob:Application>
<gjob:Category>Development</gjob:Category>
<gjob:Update>
<gjob:Status>打开Open</gjob:Status>
<gjob:Modified>Mon, 07 Jun 1999 20:27:45 -0400 MET DST</gjob:Modified>
<gjob:Salary>USD 0.00</gjob:Salary>
</gjob:Update>
<gjob:Developers>
<gjob:Developer>
</gjob:Developer>
</gjob:Developers>
<gjob:Contact>
<gjob:Person>Nathan Clemons南森</gjob:Person>
<gjob:Email>nathan@windsofstorm.net</gjob:Email>
<gjob:Company>
</gjob:Company>
<gjob:Organisation>
</gjob:Organisation>
<gjob:Webpage>
</gjob:Webpage>
<gjob:Snailmail>
</gjob:Snailmail>
<gjob:Phone>
</gjob:Phone>
</gjob:Contact>
<gjob:Requirements>
The program should be released as free software, under the GPL.
</gjob:Requirements>
<gjob:Skills>
</gjob:Skills>
<gjob:Details>
开源世界A GNOME based system that will allow a superuser to configure
compressed and uncompressed files and/or file systems to be backed
up with a supported media in the system. This should be able to
perform via find commands generating a list of files that are passed
to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine
or via operations performed on the filesystem itself. Email
notification and GUI status display very important.
</gjob:Details>
</gjob:Job>
</gjob:Jobs>
</gjob:Helping>
-
- Posts: 845
- Joined: Jul 26, 2018 18:28
Re: properly print UTF string contents in an xml file with expat/libxml2
For example, I had no problems with this code (Chinese words are displayed correctly in the console):
Code: Select all
#include once "libxml/xmlreader.bi"
'#define NULL 0
Shell "chcp 936"
Dim As String filename = Command(1)
If( Len( filename ) = 0 ) Then
Print "Usage: libxml filename"
filename= "test1.xml"
'End 1
End If
Dim As xmlTextReaderPtr reader = xmlReaderForFile( filename, NULL, 0 )
If (reader = NULL) Then
Print "Unable to open "; filename
End 1
End If
Dim As Integer ret = xmlTextReaderRead( reader )
Do While( ret = 1 )
Dim As Const ZString Ptr constname = xmlTextReaderConstName( reader )
Dim As Const ZString Ptr value = xmlTextReaderConstValue( reader )
Print xmlTextReaderDepth( reader ); _
xmlTextReaderNodeType( reader ); _
" "; *constname; _
xmlTextReaderIsEmptyElement(reader); _
xmlTextReaderHasValue( reader ); _
*value
ret = xmlTextReaderRead( reader )
Loop
xmlFreeTextReader( reader )
If( ret <> 0 ) Then
Print "failed to parse: "; filename
End If
xmlCleanupParser( )
xmlMemoryDump()
Re: properly print UTF string contents in an xml file with expat/libxml2
thanks Xusinboy Bekchanov
code page 936 was one of the things that was needed, I would venture to guess that for most Windows users there would be missing libraries, namely xml2 and it's dependencies
I have msys2 installed, so I can borrow the necessary libs from there, libxml2.a, liblzma.a, libz.a libiconv.a and libws2_32.a
I had to add the following to the test.bas file

code page 936 was one of the things that was needed, I would venture to guess that for most Windows users there would be missing libraries, namely xml2 and it's dependencies
I have msys2 installed, so I can borrow the necessary libs from there, libxml2.a, liblzma.a, libz.a libiconv.a and libws2_32.a
I had to add the following to the test.bas file
Code: Select all
#inclib "ws2_32"
#inclib "iconv"
#inclib "lzma"
#inclib "z"
Re: properly print UTF string contents in an xml file with expat/libxml2
Thank you all for the help!
I will try later sometime.
At this moment I can't figure out what was I missing after having tried to change between console code page 65001 vs 936,bas source file utf8 with /with out BOM,even gb2312;string convertions etc.But as mentioned ,except with Notepad++,failed to print properly in Windows console.
[For libxml2.a 64bit and other 64 bit dll and *.a,I got from other resources and they work properly with c /freebasic in my computer.]
I will try later sometime.
At this moment I can't figure out what was I missing after having tried to change between console code page 65001 vs 936,bas source file utf8 with /with out BOM,even gb2312;string convertions etc.But as mentioned ,except with Notepad++,failed to print properly in Windows console.
[For libxml2.a 64bit and other 64 bit dll and *.a,I got from other resources and they work properly with c /freebasic in my computer.]
Re: properly print UTF string contents in an xml file with expat/libxml2
The fact that the console is in UTF-8 mode and it's not printing out your data properly says that the bytes you are trying to print are not in UTF-8.
I wouldn't think that the encoding of libxml.bas should have anything to do with it (notepad++). Truly strange.The only lucky yet very weird thing is when I open the libxml.bas file in notepad++ and try to compile and run inside it (with well configured NppExec,the console inside Notepad++ has been set both input and oupt to be UTF8),the printing result is perfect.But I can't imagine what this means as Windows console/VS Code console has been set to 65001 already.I can't see the difference between windows console and notepad++ console.How come the former can't print properly but the latter can.
You're definitely having encoding issues and codepage mismatches. Maybe libxml is encoding the unicode to something other than UTF-8.