WIN to UTF8 conversion

Windows specific questions.
Post Reply
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

WIN to UTF8 conversion

Post by diakin »

Code: Select all

Declare Function WideCharToMultiByte Lib "kernel32" Alias "WideCharToMultiByte" _
(ByVal codepage As Long, ByVal dwFlags As Long, _
ByVal lpWideCharStr As String, ByVal cchWideChar As Long,_
ByVal lpMultiByteStr As String, ByVal cchMultiByte As Long, _
ByVal lpDefaultChar As String, ByVal lpUsedDefaultChar As Long) As Long

Declare Function MultiByteToWideChar Lib "kernel32" Alias "MultiByteToWideChar" _
(ByVal codepage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As String, _
ByVal cchMultiByte As Long, ByVal lpWideCharStr As String, ByVal cchWideChar As Long) As Long

Const MB_PRECOMPOSED = &H1
Const WIN = 1251
Const KOI = 20866
Const DOS = 866
Const Iso = 28595
Const UTF8=65001 


'*****************************************
Function Convert( strSrc As String, nFromCP As Long, nToCP As Long) As String
print"strSrc=";strSrc
Dim nLen As Long
Dim strDst As String
Dim strRet As String
Dim nRet As Long

nLen = Len(strSrc)
print"nLen=";nLen
strDst = String$(nLen * 2, Chr(44))
strRet = String$(nLen * 2, Chr(45))
nRet = MultiByteToWideChar(nFromCP, MB_PRECOMPOSED, strSrc, nLen, strDst, nLen)
nRet = WideCharToMultiByte(nToCP, 0, strDst, nRet, strRet, nLen * 2, 0, 0)
Convert = Left$(strRet, nRet)
End Function

dosStr$="qwerty"

winStr$= Convert(dosStr$, dos, WIN)
print "winStr$="; winStr$;"<<<"

UTF8Str$=Convert(winStr$, WIN, UTF8)
print "UTF8Str$="; UTF8Str$;"<<<"

dosStr$=Convert(UTF8Str$, UTF8, dos)
print "dosStr$="; dosStr$;"<<<"
This code works good, exepts for

Code: Select all

dosStr$=Convert(UTF8Str$, UTF8, dos)
print "dosStr$="; dosStr$;"<<<"
This return empty dosStr$

How can I correct this?

With best regards, Andrew Shelkovenko.
http://www.wildgardenseed.com/RQDP/ - Rapid-Q Basic documentation Project
RQ Search and Replace - http://mira.home.line1.ru/rqsr.html
v1ctor
Site Admin
Posts: 3804
Joined: May 27, 2005 8:08
Location: SP / Bra[s]il
Contact:

Post by v1ctor »

Try allocating at least n * 4, UTF-8 encoding can take up to 4-bytes as UTF32 characters should never be above 1m (while they can hold 4b).
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

Post by diakin »

Like this?

Code: Select all

strDst = String$(nLen * 4, Chr(44)) 
strRet = String$(nLen * 4, Chr(45)) 
nRet = MultiByteToWideChar(nFromCP, MB_PRECOMPOSED, strSrc, nLen, strDst, nLen) 
nRet = WideCharToMultiByte(nToCP, 0, strDst, nRet, strRet, nLen * 4, 0, 0) 
the same result ;((
v1ctor
Site Admin
Posts: 3804
Joined: May 27, 2005 8:08
Location: SP / Bra[s]il
Contact:

Post by v1ctor »

I'm not sure where you found the DOS and WIN code-pages, this works fine though:

Code: Select all

option explicit

#include "windows.bi"

Function Convert( strSrc As String, nFromCP As Long, nToCP As Long) As zstring ptr
	Dim nLen As Long 
	Dim strDst as wstring ptr
	Dim strRet as zstring ptr
	Dim nRet As Long 
	
	nLen = Len(strSrc) 
	
	strDst = allocate( (nLen+1) * 2 ) 
	strRet = allocate( (nLen+1) * 4 )
	
	MultiByteToWideChar( nFromCP, MB_PRECOMPOSED, strSrc, nLen+1, strDst, nLen+1 ) 
	WideCharToMultiByte( nToCP, 0, strDst, nlen+1, strRet, nLen+1, NULL, NULL ) 

	deallocate( strDst )
	
	Convert = strRet
	
End Function 

print *Convert("qwerty", CP_ACP, CP_UTF8)
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

Post by diakin »

I'm not sure where you found the DOS and WIN code-pages, this works fine though
Thank you Vic.
Yes, your code works, but try this

Code: Select all

UTF8str$=*Convert("q"+chr$(192)+chr$(192)+chr$(192), CP_ACP, CP_UTF8) 
print"UTF8str$=";UTF8str$
for i=1 to len (UTF8str$)
print (hex(ASC(UTF8str$, i)));" ";
next i
print

CP_ACPstr$=*Convert(*Convert("q"+chr$(192)+chr$(192)+chr$(192), CP_ACP, CP_UTF8), CP_UTF8, CP_ACP) 
print"CP_ACPstr$=";CP_ACPstr$
1. in UTF8str$ only 2 chr$(192) retutned
2. CP_ACPstr$ is empty.

May be I not correctly understand, how MultiByteToWideChar works?
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

Post by diakin »

Code: Select all

Print "CP_UTF8=";*Convert("qwerty", CP_ACP, CP_UTF8);"<<<"
Print "CP_ACP=";*Convert("qwerty", CP_UTF8,CP_ACP );"<<<"
CP_ACP to CP_UTF8 works good

CP_UTF8 to CP_ACP return empty string.

How can I convert file from UTF-8 to Ansi?
gel
Posts: 28
Joined: May 27, 2005 22:50
Location: WA, USA

Post by gel »

.

Oops, I didn't look at v1ctor's code close enough. Sorry.
Last edited by gel on Aug 02, 2006 20:31, edited 3 times in total.
v1ctor
Site Admin
Posts: 3804
Joined: May 27, 2005 8:08
Location: SP / Bra[s]il
Contact:

Post by v1ctor »

Win API quirks, from MSDN: "Note: For the code page 65001 (UTF-8), dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS."

Code: Select all

option explicit

#include once "windows.bi"

'':::::
function convert _
	( _
		byval src as zstring ptr, _
		byval fromCP as integer, _
		byval toCP as integer _
	) as string
        
	dim as integer lgt
    dim as wstring ptr buf
    dim as zstring ptr res
        
	lgt = len( src ) 
        
	buf = allocate( (lgt+1) * 2 ) 
	res = allocate( (lgt+1) * 4 )
        
	MultiByteToWideChar( fromCP, iif( fromCP = CP_UTF8, 0, MB_PRECOMPOSED ), src, lgt+1, buf, lgt+1 )
    WideCharToMultiByte( toCP, 0, buf, lgt+1, res, lgt+1, NULL, NULL )

	function = *res
        
	deallocate( res )
	deallocate( buf )
        
end function 

Print "'"; convert( "qwerty", CP_ACP, CP_UTF8 ); "'"
Print "'"; convert( "qwerty", CP_UTF8, CP_ACP ); "'"
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

Post by diakin »

Yeah, thanks for this link
This code works fine.
Problem was in this
Note: For the code page 65001 (UTF-8), dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS.

nRet = MultiByteToWideChar(nFromCP, 0, strSrc, nLen, strDst, nLen)

Code: Select all

Declare Function WideCharToMultiByte Lib "kernel32" Alias "WideCharToMultiByte" _
(Byval codepage As Long, Byval dwFlags As Long, _
Byval lpWideCharStr As String, Byval cchWideChar As Long,_
Byval lpMultiByteStr As String, Byval cchMultiByte As Long, _
Byval lpDefaultChar As String, Byval lpUsedDefaultChar As Long) As Long

Declare Function MultiByteToWideChar Lib "kernel32" Alias "MultiByteToWideChar" _
(Byval codepage As Long, Byval dwFlags As Long, Byval lpMultiByteStr As String, _
Byval cchMultiByte As Long, Byval lpWideCharStr As String, Byval cchWideChar As Long) As Long

Const MB_PRECOMPOSED = &H1
Const WIN = 1251
Const KOI = 20866
Const DOS = 866
Const Iso = 28595
Const UTF8=65001 


'*****************************************
Function Convert( strSrc As String, nFromCP As Long, nToCP As Long) As String
Print"strSrc=";strSrc
Dim nLen As Long
Dim strDst As String
Dim strRet As String
Dim nRet As Long

nLen = Len(strSrc)
Print"nLen=";nLen
strDst = String$(nLen * 2, Chr(44))
strRet = String$(nLen * 2, Chr(45))
nRet = MultiByteToWideChar(nFromCP, 0, strSrc, nLen, strDst, nLen)
print"nRet=";nRet
nRet = WideCharToMultiByte(nToCP, 0, strDst, nRet, strRet, nLen * 2, 0, 0)
print"nRet=";nRet
Convert = Left$(strRet, nRet)
End Function



dosStr$="ä뢠¯à®«"

winStr$= Convert(dosStr$, dos, WIN)
Print "winStr$="; winStr$;"<<<"

UTF8Str$=Convert(winStr$, WIN, UTF8)
Print "UTF8Str$="; UTF8Str$;"<<<"

dosStr$=Convert(UTF8Str$, UTF8, dos)
Print "dosStr$="; dosStr$;"<<<"

WBR, Andrew
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

Post by diakin »

v1ctor wrote:Win API quirks, from MSDN: "Note: For the code page 65001 (UTF-8), dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS."
Thanks V1c, I found it too ;-))

I try your code and result is
'qwert♦2'
'qwert♦2'

Second point
For hight part of ascii table (ascii >7f) it don't works at all !

Print "'"; convert( "ôûâàïð", CP_ACP, CP_UTF8 ); "'"
Print "'"; convert( "ôûâàïð", CP_UTF8, CP_ACP ); "'"

'╨┤╨╗Ё♦2'
'?'

Where is the problem?


WBR, Andrew

A.
This string is wrong (stupidly)
Print "'"; convert( "ôûâàïð", CP_UTF8, CP_ACP ); "'"

Right is

ss$=convert( "ôûâàïð", CP_ACP, CP_UTF8 )
Print "'";ss$ ; "'"
' convert back
Print "'"; convert( ss$, CP_UTF8, CP_ACP ); "'"

Sorry.
Last edited by diakin on Aug 05, 2006 23:06, edited 1 time in total.
DrV
Site Admin
Posts: 2116
Joined: May 27, 2005 18:39
Location: Midwestern USA
Contact:

Post by DrV »

The console uses the OEM code page (CP_OEMCP), not the ANSI code page (CP_ACP).

See this page for more information.
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

Post by diakin »

DrV wrote:The console uses the OEM code page (CP_OEMCP), not the ANSI code page (CP_ACP).

See this page for more information.
Thanks! It very interesting link.
But problem is that function return wrong result for
input string "qwerty" too.

input string is "qwerty"
result must be "qwerty" too, but return "qwert♦2"

Hmm.. I test it under Win98, possible it's reason.
But code that I write in some posts above works good for
strings from low part of ascii table and for hight part too.

Sorry for my bad English.

WBR, Andrew
gel
Posts: 28
Joined: May 27, 2005 22:50
Location: WA, USA

Post by gel »

diakin,

Change lgt = Len( src ) to lgt = Len( *src ).
diakin
Posts: 102
Joined: May 28, 2005 6:06
Location: Russia, St-Petersburg
Contact:

Post by diakin »

gel wrote:diakin,

Change lgt = Len( src ) to lgt = Len( *src ).
Yeah..
and

WideCharToMultiByte( toCP, 0, buf, (lgt+1), res, (lgt+1)*2, NULL, NULL )

also should be
buf = allocate( (lgt+1) * 2 )
res = allocate( (lgt+1) *2 ) ' edited !!! was res = allocate( (lgt+1) )


An all works fine!
Thank you.

WBR, Andrew
Post Reply