properly print UTF string contents in an xml file with expat/libxml2

New to FreeBASIC? Post your questions here.
Post Reply
PeterHu
Posts: 218
Joined: Jul 24, 2022 4:57

properly print UTF string contents in an xml file with expat/libxml2

Post by PeterHu »

May I ask is there anybody once properly printed utf string( Chinese string,Japanese string,etc) in an xml file with expat/libxml2 from fbc/example/xml https://github.com/freebasic/fbc/tree/m ... amples/xml?

With fbc64 1.12.0 under windows,I can compile expat.bas and libxml.bas ,and properly print the test.xml to the console.But when I modified a couple of strings to Chinese ,it failed to properly print the content anymore.And then whatever I modified the source (const char* ->String->wstring->zstring ptr->wstring ptr),and source encoding,and windows console code page,failed to print properly.
caseih
Posts: 2181
Joined: Feb 26, 2007 5:32

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by caseih »

The easiest way to print out UTF-8-encoded bytes is to make sure your console is natively using UTF-8 as its code page. MS calls it 65001. chcp 65001 at the command prompt. Also I recommend you make sure to default to the new Windows Terminal instead of the old console system. Windows Terminal has full support for UTF-8 when you set the code page. Whereas I'm not totally sure about the win32 console.

You could also find a library to decode UTF-8 into wide strings and print those using the win32 wide API calls. But I don't think FB has the capability of printing wide strings directly. Would have to use win32 api calls, or libc calls like wprintf.

Eventually Windows will be defaulting to UTF-8 everywhere. For now you can enable globally by doing:
Win-R, type "intl.cpl" and press enter.
Click on Adminsitrative tab and check the box "use Unicode UTF-8 for worldwide language support."
PeterHu
Posts: 218
Joined: Jul 24, 2022 4:57

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by PeterHu »

Thanks for the help.
Yes,My Win10 console has been set defualt to 65001 exactly as your guided here

Code: Select all

Win-R, type "intl.cpl" and press enter.
Click on Adminsitrative tab and check the box "use Unicode UTF-8 for worldwide language support."
I have no too much difficulties to properly print Chinese characters from fb other sources to the console,but this time,for Chinese characters in an xml file,reading by expat/libxml these two libraries as mentioned the above,just failed when trying to print into console.I checked libxml.bi and nothing special found related to char* /string/zstring (convertions) between.Just failed to print properly,whatever I do the char*/string/zstring/wstring convertions before

Code: Select all

print
statement.
caseih
Posts: 2181
Joined: Feb 26, 2007 5:32

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by caseih »

Are you sure the data you're getting from the XML file is still UTF-8? Or has it been converted to some other encoding or code page? Since your console is UTF-8 and can display UTF-8 bytes correctly from other sources, this suggests the data you're feeding to PRINT is not UTF-8.
PeterHu
Posts: 218
Joined: Jul 24, 2022 4:57

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by PeterHu »

caseih wrote: Jan 19, 2025 1:25 Are you sure the data you're getting from the XML file is still UTF-8? Or has it been converted to some other encoding or code page? Since your console is UTF-8 and can display UTF-8 bytes correctly from other sources, this suggests the data you're feeding to PRINT is not UTF-8.
Yes,I am very sure.I looked very close at encoding after the xml file has been modified.I keep the xml file in UTF8 so that to try various string convertions and also source file encoding changing to see whether there is a bingo but unfortunately not lucky.

The only lucky yet very weird thing is when I open the libxml.bas file in notepad++ and try to compile and run inside it (with well configured NppExec,the console inside Notepad++ has been set both input and oupt to be UTF8),the printing result is perfect.But I can't imagine what this means as Windows console/VS Code console has been set to 65001 already.I can't see the difference between windows console and notepad++ console.How come the former can't print properly but the latter can.
srvaldez
Posts: 3558
Joined: Sep 25, 2005 21:54

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by srvaldez »

@PeterHu
you need to save your program source with a Unicode BOM, in the Geany IDE: Document -> Write Unicode BOM
PeterHu
Posts: 218
Joined: Jul 24, 2022 4:57

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by PeterHu »

srvaldez wrote: Jan 19, 2025 10:26 @PeterHu
you need to save your program source with a Unicode BOM, in the Geany IDE: Document -> Write Unicode BOM
Thanks,I just downloaded Geany and tried.

For the source file libxml.bas,the Geany IDE shows Document-> has already checked Write Unicode BOM (W))
The compiling process is successful after a very quick and easy setting with freebasic,but the running result is just the same as in Windows console cmdline ( fbc64 libxml.bas and libxml test1.xml)
srvaldez
Posts: 3558
Joined: Sep 25, 2005 21:54

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by srvaldez »

@PeterHu
please post your example that won't work
PeterHu
Posts: 218
Joined: Jul 24, 2022 4:57

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by PeterHu »

for example,with this test1.xml,compile fbc/examples/xml/libxml.bas and run libxml test1.xml,under windows console I have no chance to properly print the content(Chinese characters).
libxml.bas

Code: Select all

#include once "libxml/xmlreader.bi"
#define NULL 0

Dim As String filename = Command(1)
If( Len( filename ) = 0 ) Then
    Print "Usage: libxml filename"
    'filename="test1.xml"
    End 1
End If

Dim As xmlTextReaderPtr reader = xmlReaderForFile( filename, NULL, 0 )
If (reader = NULL) Then
    Print "Unable to open "; filename
    End 1
End If

Dim As Integer ret = xmlTextReaderRead( reader )
Do While( ret = 1 )
    Dim As Const ZString Ptr constname = xmlTextReaderConstName( reader )
    Dim As Const ZString Ptr value = xmlTextReaderConstValue( reader )

    Print xmlTextReaderDepth( reader ); _
        xmlTextReaderNodeType( reader ); _
        " "; *constname; _
        xmlTextReaderIsEmptyElement(reader); _
        xmlTextReaderHasValue( reader ); _
        *value

    ret = xmlTextReaderRead( reader )
Loop

xmlFreeTextReader( reader )

If( ret <> 0 ) Then
    Print "failed to parse: "; filename
End If

xmlCleanupParser( )
xmlMemoryDump()

test1.xml

Code: Select all

<?xml version="1.0"?>
<gjob:Helping xmlns:gjob="http://www.gnome.org/some-location">
  <gjob:Jobs>

    <gjob:Job>
      <gjob:Project ID="3"/>
      <gjob:Application>GBackup</gjob:Application>
      <gjob:Category>Development</gjob:Category>

      <gjob:Update>
        <gjob:Status>打开Open</gjob:Status>
        <gjob:Modified>Mon, 07 Jun 1999 20:27:45 -0400 MET DST</gjob:Modified>
        <gjob:Salary>USD 0.00</gjob:Salary>
      </gjob:Update>

      <gjob:Developers>
        <gjob:Developer>
        </gjob:Developer>
      </gjob:Developers>

      <gjob:Contact>
        <gjob:Person>Nathan Clemons南森</gjob:Person>
        <gjob:Email>nathan@windsofstorm.net</gjob:Email>
        <gjob:Company>
        </gjob:Company>
        <gjob:Organisation>
        </gjob:Organisation>
        <gjob:Webpage>
        </gjob:Webpage>
        <gjob:Snailmail>
        </gjob:Snailmail>
        <gjob:Phone>
        </gjob:Phone>
      </gjob:Contact>

      <gjob:Requirements>
      The program should be released as free software, under the GPL.
      </gjob:Requirements>

      <gjob:Skills>
      </gjob:Skills>

      <gjob:Details>
      开源世界A GNOME based system that will allow a superuser to configure 
      compressed and uncompressed files and/or file systems to be backed 
      up with a supported media in the system.  This should be able to 
      perform via find commands generating a list of files that are passed 
      to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine 
      or via operations performed on the filesystem itself. Email 
      notification and GUI status display very important.
      </gjob:Details>

    </gjob:Job>

  </gjob:Jobs>
</gjob:Helping>


Xusinboy Bekchanov
Posts: 845
Joined: Jul 26, 2018 18:28

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by Xusinboy Bekchanov »

For example, I had no problems with this code (Chinese words are displayed correctly in the console):

Code: Select all

#include once "libxml/xmlreader.bi"
'#define NULL 0

Shell "chcp 936"

Dim As String filename = Command(1)
If( Len( filename ) = 0 ) Then
    Print "Usage: libxml filename"
    filename= "test1.xml"
    'End 1
End If

Dim As xmlTextReaderPtr reader = xmlReaderForFile( filename, NULL, 0 )
If (reader = NULL) Then
    Print "Unable to open "; filename
    End 1
End If

Dim As Integer ret = xmlTextReaderRead( reader )
Do While( ret = 1 )
    Dim As Const ZString Ptr constname = xmlTextReaderConstName( reader )
    Dim As Const ZString Ptr value = xmlTextReaderConstValue( reader )

    Print xmlTextReaderDepth( reader ); _
        xmlTextReaderNodeType( reader ); _
        " "; *constname; _
        xmlTextReaderIsEmptyElement(reader); _
        xmlTextReaderHasValue( reader ); _
        *value

    ret = xmlTextReaderRead( reader )
Loop

xmlFreeTextReader( reader )

If( ret <> 0 ) Then
    Print "failed to parse: "; filename
End If

xmlCleanupParser( )
xmlMemoryDump()
srvaldez
Posts: 3558
Joined: Sep 25, 2005 21:54

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by srvaldez »

thanks Xusinboy Bekchanov :)
code page 936 was one of the things that was needed, I would venture to guess that for most Windows users there would be missing libraries, namely xml2 and it's dependencies
I have msys2 installed, so I can borrow the necessary libs from there, libxml2.a, liblzma.a, libz.a libiconv.a and libws2_32.a
I had to add the following to the test.bas file

Code: Select all

#inclib "ws2_32"
#inclib "iconv"
#inclib "lzma"
#inclib "z"
PeterHu
Posts: 218
Joined: Jul 24, 2022 4:57

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by PeterHu »

Thank you all for the help!

I will try later sometime.

At this moment I can't figure out what was I missing after having tried to change between console code page 65001 vs 936,bas source file utf8 with /with out BOM,even gb2312;string convertions etc.But as mentioned ,except with Notepad++,failed to print properly in Windows console.

[For libxml2.a 64bit and other 64 bit dll and *.a,I got from other resources and they work properly with c /freebasic in my computer.]
caseih
Posts: 2181
Joined: Feb 26, 2007 5:32

Re: properly print UTF string contents in an xml file with expat/libxml2

Post by caseih »

PeterHu wrote: Jan 19, 2025 10:25Yes,I am very sure.I looked very close at encoding after the xml file has been modified.I keep the xml file in UTF8 so that to try various string convertions and also source file encoding changing to see whether there is a bingo but unfortunately not lucky.
The fact that the console is in UTF-8 mode and it's not printing out your data properly says that the bytes you are trying to print are not in UTF-8.
The only lucky yet very weird thing is when I open the libxml.bas file in notepad++ and try to compile and run inside it (with well configured NppExec,the console inside Notepad++ has been set both input and oupt to be UTF8),the printing result is perfect.But I can't imagine what this means as Windows console/VS Code console has been set to 65001 already.I can't see the difference between windows console and notepad++ console.How come the former can't print properly but the latter can.
I wouldn't think that the encoding of libxml.bas should have anything to do with it (notepad++). Truly strange.

You're definitely having encoding issues and codepage mismatches. Maybe libxml is encoding the unicode to something other than UTF-8.
Post Reply