How to determine is string is UTF-8 encoded

PaulSquires · Post by **PaulSquires** » Jul 19, 2018 20:50

Maybe this is simple?

I have a STRING and need to know if that string is encoded as UTF-8 or is just a regular ANSI string.

Does anyone have code to perform this check? Is this something easy that I am not seeing?
(Windows platform)

PaulSquires · Post by **PaulSquires** » Jul 20, 2018 0:16

StackOverflow seems to comment that the following should work (on Windows):

Code: Select all

' On Windows, you can use MultiByteToWideChar() with the CP_UTF8 codepage 
' and the MB_ERR_INVALID_CHARS flag. If the function fails, the string is not valid UTF-8.
if MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, STRPTR(sText), LEN(sText), NULL, 0) = 0 then
   ' Not a valid UTF-8 string
end if

jj2007 · Post by **jj2007** » Jul 20, 2018 2:54

Interesting, and yes, it works indeed, see here.

For example, by adding a chr(169) to an English string, it becomes invalid Utf8 (but it prints fine to the console provided codepage is set to 1252):

Let esi="Click on this button"+Chr$(169) ; Ansi with one extra character
[Click on this button©] is not a valid Utf8 string

But attention, it does not answer the question "is this Utf8?". Ordinary Ansi strings like "Hello World" are (valid) Utf8 strings, too. In fact, the only method to distinguish plain Ansi from "real" Utf8 might be to see if the Ansi string len is the same as the Utf8 string len.

marcov · Post by **marcov** » Jul 20, 2018 11:49

- If there is a BOM, it is UTF8. (more for document text, less for separate strings). This is the only non-heuristic
- walk the string, and try to decode. Make sure that if an utf8 multibyte sequence starts, the high bits of the next byte are as expected. (e.g. if the byte starts with high bits 110, the next bytes highest two bits must be 10 etc. If not it is not utf8.
- you can also test if the decoded unicode code points are valid. (might require extensive tables, so more rarely done, sometimes simple heuristic checks can be used for a rough determination of script/language type)
- if it has well formed utf-8 sequences multibyte count them. Then if utf-8=count>0

munair had some utf8 libraries ported from Lazarus iirc. It might be worthwhile to check them out.

https://en.wikipedia.org/wiki/UTF-8

Pierre Bellisle · Post by **Pierre Bellisle** » Jul 20, 2018 18:28

IsTextUnicode

PaulSquires · Post by **PaulSquires** » Jul 20, 2018 18:57

Pierre Bellisle wrote:IsTextUnicode

Thanks Pierre, that function only indicates that a string is unicode, but not what type of unicode it is. I was looking to test if a string is explicitly utf-8 and from my research I don't think it is possible.

jj2007 · Post by **jj2007** » Jul 20, 2018 19:49

PaulSquires wrote:I was looking to test if a string is explicitly utf-8 and from my research I don't think it is possible.

As marcov wrote: If there is a Utf8 BOM, it's Utf8. If there is no BOM, and you can exclude that it's UTF16, then one possible test is to compare the Ansi length to the Utf8 length. If they are equal, it's not Utf8 (or rather: it's an Ansi subset of Utf8). If, however, the byte count is higher than the Utf8 string length, then it is certainly Utf8. Another question is the relevance of all this. Do you have a concrete case where it matters? Or, in particular, where it would be useful to know if it's valid Utf8?

WQ1980 · Post by **WQ1980** » Jul 20, 2018 20:18

Fl_utf8test
http://www.fltk.org/doc-1.4/group__fl__ ... eddb3fc17a
viewtopic.php?f=14&t=24547&hilit=fltk

+

check_utf8
viewtopic.php?t=17636

dodicat · Post by **dodicat** » Jul 20, 2018 20:45

Maybe look into the world of C code.
There are plenty of discussions out there.
After about five minutes I got hold of:

Code: Select all

 

 function isutf8( byval _string as  zstring ptr ) as boolean
     if _string = 0 then return 0
    if left(*_string,2)= wchr(&hFF,&hFE) or left(*_string,2)=wchr(&hFE,&hFF) then print "utf16":exit function 'utf-16 boms
     if left(*_string,3)=wchr(&hEF,&hBB,&hBF) then return 1  'utf-8 bom
	dim bytes as const ubyte ptr = cptr(const ubyte ptr, _string)
	while *bytes
		if (((bytes[0] = &h09) orelse (bytes[0] = &h0A)) orelse (bytes[0] = &h0D)) orelse ((&h20 <= bytes[0]) andalso (bytes[0] <= &h7E)) then
			bytes += 1
			continue while
		end if
		if ((&hC2 <= bytes[0]) andalso (bytes[0] <= &hDF)) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &hBF)) then
			bytes += 2
			continue while
		end if
		if ((((bytes[0] = &hE0) andalso ((&hA0 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) orelse ((((((&hE1 <= bytes[0]) andalso (bytes[0] <= &hEC)) orelse (bytes[0] = &hEE)) orelse (bytes[0] = &hEF)) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF)))) orelse (((bytes[0] = &hED) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &h9F))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) then
			bytes += 3
			continue while
		end if
		if (((((bytes[0] = &hF0) andalso ((&h90 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) andalso ((&h80 <= bytes[3]) andalso (bytes[3] <= &hBF))) orelse (((((&hF1 <= bytes[0]) andalso (bytes[0] <= &hF3)) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) andalso ((&h80 <= bytes[3]) andalso (bytes[3] <= &hBF)))) orelse ((((bytes[0] = &hF4) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &h8F))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) andalso ((&h80 <= bytes[3]) andalso (bytes[3] <= &hBF))) then
			bytes += 4
			continue while
		end if
		return 0
	wend
	return 1
end function


dim as wstring * 200 test =wchr(&hEF,&hBB,&hBF) + "  The bom is on"
print test
print isutf8(test)
print
test=wchr(&hFE,&hFF) + "  utf-16 bom it attatched"
print test
print isutf8(test)
print
test="Σὲ γνωρίζω ἀπὸ τὴν κόψη"
print test
print isutf8(test)


sleep

After all, the win api is written in C, so it can achieve nothing which C can't.

Pierre Bellisle · Post by **Pierre Bellisle** » Jul 20, 2018 20:52

>> "that function only indicates that a string is unicode"

Hey Paul,
Yep, in fact, "try to indicates" would be even more appropriate.
I posted it because several ideas may be extracted from it to help testing a string for the unicode side.
And sometime doing some eliminations add confidence to a decision process.
If a string have everything balanced to be unicode, then it is less likely to be UTF-8.
This make me remember of the old "Bush hid the facts" string pitfall...

Added: How quickly can you check that a string is valid unicode (UTF-8)? also looks interesting...

PaulSquires · Post by **PaulSquires** » Jul 21, 2018 0:58

I have been able to work around my issue by changing some of the program design so I no longer need code for this. It was an interesting question and I want to thanks everyone here who took the time and effort to provide explanations and code offerings.

Pierre Bellisle · Post by **Pierre Bellisle** » Jul 21, 2018 2:54

Just for the fun of it...
What if we use MultiByteToWideChar in a way that it will return an error when thanslating an invalid UTF-8 string to unicode.
The unicode result is not of importance but if the function fail then the UTF-8 string was invalid.

Code: Select all

#define JumpCompiler "<D:\Free\64\fbc.exe>"
#define JumpCompilerCmd "<-s console -w pedantic>"
#Lang "fb"

#Include Once "windows.bi"
#Include Once "win\shellapi.bi"'
'_____________________________________________________________________________

FUNCTION isUTF8(BYVAL sText AS STRING) AS LONG
 Dim RetVal    AS LONG
 Dim LastError AS LONG

 RetVal = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, STRPTR(sText), LEN(sText), BYVAL 0, 0)
 LastError = GetLastError()
 IF RetVal = 0 THEN
   PRINT "Invalid UTF-8 string!"
   PRINT " Error  " & STR$(LastError)
   PRINT " RetVal " & STR$(RetVal)
   'Possible errors
   'ERROR_INSUFFICIENT_BUFFER    0122 0x007A A supplied buffer size was not large enough, or it was incorrectly set to NULL.
   'ERROR_INVALID_FLAGS          1004 0x03EC The values supplied for flags were not valid.
   'ERROR_INVALID_PARAMETER      0087 0x0057 Any of the parameter values was invalid.
   'ERROR_NO_UNICODE_TRANSLATION 1113 0x0459 Invalid Unicode was found in a string.
   FUNCTION = LastError
 ELSE
   PRINT "Valid UTF-8 string!" '& STR$(RetVal)
   FUNCTION = FALSE
 END IF

END FUNCTION
'_____________________________________________________________________________

 Color 14
 PRINT "-----------------------------"
 isUTF8("Invalid characters" & CHR$(&hC0, &h80))       'Invalid
 PRINT "-----------------------------"
 isUTF8("Invalid characters" & CHR$(&hED, &hB2, &h80)) 'Invalid
 PRINT "-----------------------------"
 isUTF8("Jos" & CHR$(130)) 'José                       'Invalid
 PRINT "-----------------------------"
 isUTF8("Jos" & CHR$(&hC3, &hA9)) 'José                'Valid
 PRINT "-----------------------------"
 isUTF8("Paul")                                        'Valid
 PRINT "-----------------------------"

 Color 7 : Print "Press a key or click to end" : Dim buttons As Long
 Do : GetMouse(0, 0, 0, Buttons) : IF buttons Or Len(InKey) Then Exit Do : End If : Sleep 100 : Loop
'_____________________________________________________________________________
'

jj2007 · Post by **jj2007** » Jul 21, 2018 7:11

Pierre Bellisle wrote:What if we use MultiByteToWideChar in a way that it will return an error when thanslating an invalid UTF-8 string to unicode.

Good idea but Paul had it a bit earlier, see his post of Jul 20, 2018 2:16

marcov · Post by **marcov** » Jul 21, 2018 10:00

PaulSquires wrote:
Pierre Bellisle wrote:IsTextUnicode
Thanks Pierre, that function only indicates that a string is unicode, but not what type of unicode it is. I was looking to test if a string is explicitly utf-8 and from my research I don't think it is possible.

Please read again, it is only certain if there is a BOM, the rest is heuristics. If you find a malformed utf8 sequence, you are pretty sure it is not utf8.

utf8 vs utf16 is even easier, at the byte level for western text, utf16 has large numbers of zeros, regardless of it is LE or BE.

The utf8 heuristics more likely go wrong on short strings or languages with few accents. (like e.g. Dutch). The longer the text, the more reliable it is.

Of course if no utf8 sequences are found, it is either plain, unextended ASCII (0..127), or UTF-8 without anything that needs UTF-8. However in that case, the ambiguity doesn't affect how to treat the string. You can process it as plain ascii or as UTF-8

jj2007 · Post by **jj2007** » Jul 21, 2018 13:39

marcov wrote:The utf8 heuristics more likely go wrong on short strings or languages with few accents. (like e.g. Dutch). The longer the text, the more reliable it is.

Right. One of the best tests is for string lengths (pseudocode; uLen returns Utf8 chars):

Code: Select all

  SetGlobals v1$="La vita è bella"	; è is Ascii 232
  SetGlobals v1n$="La vita e bella"	; same but no accent
  SetGlobals v2$="Das Leben ist schön"	; ö is Ascii 246
  SetGlobals v2n$="Das Leben ist schon"	; no Umlaut
  SetGlobals v3$="Жизнь прекрасна"	; Russian
  SetGlobals v4$="生活是美好的"	; Chinese
  Init
  Print v1$, Str$("\tLen=%i", Len(v1$)), Str$(", uLen=%i\n", uLen(v1$))
  Print v1n$, Str$("\tLen=%i", Len(v1n$)), Str$(", uLen=%i\n", uLen(v1n$))
  Print v2$, Str$("\tLen=%i", Len(v2$)), Str$(", uLen=%i\n", uLen(v2$))
  Print v2n$, Str$("\tLen=%i", Len(v2n$)), Str$(", uLen=%i\n", uLen(v2n$))
  Print v3$, Str$("\tLen=%i", Len(v3$)), Str$(", uLen=%i\n", uLen(v3$))
  Print v4$, Str$("\tLen=%i", Len(v4$)), Str$(", uLen=%i\n", uLen(v4$))

Output:

Code: Select all

La vita è bella Len=16, uLen=15
La vita e bella Len=15, uLen=15
Das Leben ist schön     Len=20, uLen=19
Das Leben ist schon     Len=19, uLen=19
Жизнь прекрасна Len=29, uLen=15
生活是美好的  Len=18, uLen=6

If it's Ansi with occasional accents, then byte len and char count are identical or close. If it's "true" Utf8, they will differ widely, by a factor 2 in case of Russian (only the space character has the same length), by a factor 3 for Chinese. But it remains a heuristical test, and you might produce a false positive, although that's very unlikely.

How to determine is string is UTF-8 encoded

How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine if string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded

Re: How to determine is string is UTF-8 encoded