How to determine is string is UTF-8 encoded
-
- Posts: 1002
- Joined: Jul 14, 2005 23:41
How to determine is string is UTF-8 encoded
Maybe this is simple?
I have a STRING and need to know if that string is encoded as UTF-8 or is just a regular ANSI string.
Does anyone have code to perform this check? Is this something easy that I am not seeing?
(Windows platform)
I have a STRING and need to know if that string is encoded as UTF-8 or is just a regular ANSI string.
Does anyone have code to perform this check? Is this something easy that I am not seeing?
(Windows platform)
-
- Posts: 1002
- Joined: Jul 14, 2005 23:41
Re: How to determine is string is UTF-8 encoded
StackOverflow seems to comment that the following should work (on Windows):
Code: Select all
' On Windows, you can use MultiByteToWideChar() with the CP_UTF8 codepage
' and the MB_ERR_INVALID_CHARS flag. If the function fails, the string is not valid UTF-8.
if MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, STRPTR(sText), LEN(sText), NULL, 0) = 0 then
' Not a valid UTF-8 string
end if
Re: How to determine if string is UTF-8 encoded
Interesting, and yes, it works indeed, see here.
For example, by adding a chr(169) to an English string, it becomes invalid Utf8 (but it prints fine to the console provided codepage is set to 1252):
Let esi="Click on this button"+Chr$(169) ; Ansi with one extra character
[Click on this button©] is not a valid Utf8 string
But attention, it does not answer the question "is this Utf8?". Ordinary Ansi strings like "Hello World" are (valid) Utf8 strings, too. In fact, the only method to distinguish plain Ansi from "real" Utf8 might be to see if the Ansi string len is the same as the Utf8 string len.
For example, by adding a chr(169) to an English string, it becomes invalid Utf8 (but it prints fine to the console provided codepage is set to 1252):
Let esi="Click on this button"+Chr$(169) ; Ansi with one extra character
[Click on this button©] is not a valid Utf8 string
But attention, it does not answer the question "is this Utf8?". Ordinary Ansi strings like "Hello World" are (valid) Utf8 strings, too. In fact, the only method to distinguish plain Ansi from "real" Utf8 might be to see if the Ansi string len is the same as the Utf8 string len.
Re: How to determine is string is UTF-8 encoded
- If there is a BOM, it is UTF8. (more for document text, less for separate strings). This is the only non-heuristic
- walk the string, and try to decode. Make sure that if an utf8 multibyte sequence starts, the high bits of the next byte are as expected. (e.g. if the byte starts with high bits 110, the next bytes highest two bits must be 10 etc. If not it is not utf8.
- you can also test if the decoded unicode code points are valid. (might require extensive tables, so more rarely done, sometimes simple heuristic checks can be used for a rough determination of script/language type)
- if it has well formed utf-8 sequences multibyte count them. Then if utf-8=count>0
munair had some utf8 libraries ported from Lazarus iirc. It might be worthwhile to check them out.
https://en.wikipedia.org/wiki/UTF-8
- walk the string, and try to decode. Make sure that if an utf8 multibyte sequence starts, the high bits of the next byte are as expected. (e.g. if the byte starts with high bits 110, the next bytes highest two bits must be 10 etc. If not it is not utf8.
- you can also test if the decoded unicode code points are valid. (might require extensive tables, so more rarely done, sometimes simple heuristic checks can be used for a rough determination of script/language type)
- if it has well formed utf-8 sequences multibyte count them. Then if utf-8=count>0
munair had some utf8 libraries ported from Lazarus iirc. It might be worthwhile to check them out.
https://en.wikipedia.org/wiki/UTF-8
-
- Posts: 56
- Joined: Dec 11, 2016 17:22
-
- Posts: 1002
- Joined: Jul 14, 2005 23:41
Re: How to determine is string is UTF-8 encoded
Thanks Pierre, that function only indicates that a string is unicode, but not what type of unicode it is. I was looking to test if a string is explicitly utf-8 and from my research I don't think it is possible.Pierre Bellisle wrote:IsTextUnicode
Re: How to determine is string is UTF-8 encoded
As marcov wrote: If there is a Utf8 BOM, it's Utf8. If there is no BOM, and you can exclude that it's UTF16, then one possible test is to compare the Ansi length to the Utf8 length. If they are equal, it's not Utf8 (or rather: it's an Ansi subset of Utf8). If, however, the byte count is higher than the Utf8 string length, then it is certainly Utf8. Another question is the relevance of all this. Do you have a concrete case where it matters? Or, in particular, where it would be useful to know if it's valid Utf8?PaulSquires wrote:I was looking to test if a string is explicitly utf-8 and from my research I don't think it is possible.
Re: How to determine is string is UTF-8 encoded
Maybe look into the world of C code.
There are plenty of discussions out there.
After about five minutes I got hold of:
After all, the win api is written in C, so it can achieve nothing which C can't.
There are plenty of discussions out there.
After about five minutes I got hold of:
Code: Select all
function isutf8( byval _string as zstring ptr ) as boolean
if _string = 0 then return 0
if left(*_string,2)= wchr(&hFF,&hFE) or left(*_string,2)=wchr(&hFE,&hFF) then print "utf16":exit function 'utf-16 boms
if left(*_string,3)=wchr(&hEF,&hBB,&hBF) then return 1 'utf-8 bom
dim bytes as const ubyte ptr = cptr(const ubyte ptr, _string)
while *bytes
if (((bytes[0] = &h09) orelse (bytes[0] = &h0A)) orelse (bytes[0] = &h0D)) orelse ((&h20 <= bytes[0]) andalso (bytes[0] <= &h7E)) then
bytes += 1
continue while
end if
if ((&hC2 <= bytes[0]) andalso (bytes[0] <= &hDF)) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &hBF)) then
bytes += 2
continue while
end if
if ((((bytes[0] = &hE0) andalso ((&hA0 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) orelse ((((((&hE1 <= bytes[0]) andalso (bytes[0] <= &hEC)) orelse (bytes[0] = &hEE)) orelse (bytes[0] = &hEF)) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF)))) orelse (((bytes[0] = &hED) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &h9F))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) then
bytes += 3
continue while
end if
if (((((bytes[0] = &hF0) andalso ((&h90 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) andalso ((&h80 <= bytes[3]) andalso (bytes[3] <= &hBF))) orelse (((((&hF1 <= bytes[0]) andalso (bytes[0] <= &hF3)) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &hBF))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) andalso ((&h80 <= bytes[3]) andalso (bytes[3] <= &hBF)))) orelse ((((bytes[0] = &hF4) andalso ((&h80 <= bytes[1]) andalso (bytes[1] <= &h8F))) andalso ((&h80 <= bytes[2]) andalso (bytes[2] <= &hBF))) andalso ((&h80 <= bytes[3]) andalso (bytes[3] <= &hBF))) then
bytes += 4
continue while
end if
return 0
wend
return 1
end function
dim as wstring * 200 test =wchr(&hEF,&hBB,&hBF) + " The bom is on"
print test
print isutf8(test)
print
test=wchr(&hFE,&hFF) + " utf-16 bom it attatched"
print test
print isutf8(test)
print
test="Σὲ γνωρίζω ἀπὸ τὴν κόψη"
print test
print isutf8(test)
sleep
-
- Posts: 56
- Joined: Dec 11, 2016 17:22
Re: How to determine is string is UTF-8 encoded
>> "that function only indicates that a string is unicode"
Hey Paul,
Yep, in fact, "try to indicates" would be even more appropriate.
I posted it because several ideas may be extracted from it to help testing a string for the unicode side.
And sometime doing some eliminations add confidence to a decision process.
If a string have everything balanced to be unicode, then it is less likely to be UTF-8.
This make me remember of the old "Bush hid the facts" string pitfall...
Added: How quickly can you check that a string is valid unicode (UTF-8)? also looks interesting...
Hey Paul,
Yep, in fact, "try to indicates" would be even more appropriate.
I posted it because several ideas may be extracted from it to help testing a string for the unicode side.
And sometime doing some eliminations add confidence to a decision process.
If a string have everything balanced to be unicode, then it is less likely to be UTF-8.
This make me remember of the old "Bush hid the facts" string pitfall...
Added: How quickly can you check that a string is valid unicode (UTF-8)? also looks interesting...
-
- Posts: 1002
- Joined: Jul 14, 2005 23:41
Re: How to determine is string is UTF-8 encoded
I have been able to work around my issue by changing some of the program design so I no longer need code for this. It was an interesting question and I want to thanks everyone here who took the time and effort to provide explanations and code offerings.
-
- Posts: 56
- Joined: Dec 11, 2016 17:22
Re: How to determine is string is UTF-8 encoded
Just for the fun of it...
What if we use MultiByteToWideChar in a way that it will return an error when thanslating an invalid UTF-8 string to unicode.
The unicode result is not of importance but if the function fail then the UTF-8 string was invalid.
What if we use MultiByteToWideChar in a way that it will return an error when thanslating an invalid UTF-8 string to unicode.
The unicode result is not of importance but if the function fail then the UTF-8 string was invalid.
Code: Select all
#define JumpCompiler "<D:\Free\64\fbc.exe>"
#define JumpCompilerCmd "<-s console -w pedantic>"
#Lang "fb"
#Include Once "windows.bi"
#Include Once "win\shellapi.bi"'
'_____________________________________________________________________________
FUNCTION isUTF8(BYVAL sText AS STRING) AS LONG
Dim RetVal AS LONG
Dim LastError AS LONG
RetVal = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, STRPTR(sText), LEN(sText), BYVAL 0, 0)
LastError = GetLastError()
IF RetVal = 0 THEN
PRINT "Invalid UTF-8 string!"
PRINT " Error " & STR$(LastError)
PRINT " RetVal " & STR$(RetVal)
'Possible errors
'ERROR_INSUFFICIENT_BUFFER 0122 0x007A A supplied buffer size was not large enough, or it was incorrectly set to NULL.
'ERROR_INVALID_FLAGS 1004 0x03EC The values supplied for flags were not valid.
'ERROR_INVALID_PARAMETER 0087 0x0057 Any of the parameter values was invalid.
'ERROR_NO_UNICODE_TRANSLATION 1113 0x0459 Invalid Unicode was found in a string.
FUNCTION = LastError
ELSE
PRINT "Valid UTF-8 string!" '& STR$(RetVal)
FUNCTION = FALSE
END IF
END FUNCTION
'_____________________________________________________________________________
Color 14
PRINT "-----------------------------"
isUTF8("Invalid characters" & CHR$(&hC0, &h80)) 'Invalid
PRINT "-----------------------------"
isUTF8("Invalid characters" & CHR$(&hED, &hB2, &h80)) 'Invalid
PRINT "-----------------------------"
isUTF8("Jos" & CHR$(130)) 'José 'Invalid
PRINT "-----------------------------"
isUTF8("Jos" & CHR$(&hC3, &hA9)) 'José 'Valid
PRINT "-----------------------------"
isUTF8("Paul") 'Valid
PRINT "-----------------------------"
Color 7 : Print "Press a key or click to end" : Dim buttons As Long
Do : GetMouse(0, 0, 0, Buttons) : IF buttons Or Len(InKey) Then Exit Do : End If : Sleep 100 : Loop
'_____________________________________________________________________________
'
Re: How to determine is string is UTF-8 encoded
Good idea but Paul had it a bit earlier, see his post of Jul 20, 2018 2:16Pierre Bellisle wrote:What if we use MultiByteToWideChar in a way that it will return an error when thanslating an invalid UTF-8 string to unicode.
Re: How to determine is string is UTF-8 encoded
Please read again, it is only certain if there is a BOM, the rest is heuristics. If you find a malformed utf8 sequence, you are pretty sure it is not utf8.PaulSquires wrote:Thanks Pierre, that function only indicates that a string is unicode, but not what type of unicode it is. I was looking to test if a string is explicitly utf-8 and from my research I don't think it is possible.Pierre Bellisle wrote:IsTextUnicode
utf8 vs utf16 is even easier, at the byte level for western text, utf16 has large numbers of zeros, regardless of it is LE or BE.
The utf8 heuristics more likely go wrong on short strings or languages with few accents. (like e.g. Dutch). The longer the text, the more reliable it is.
Of course if no utf8 sequences are found, it is either plain, unextended ASCII (0..127), or UTF-8 without anything that needs UTF-8. However in that case, the ambiguity doesn't affect how to treat the string. You can process it as plain ascii or as UTF-8
Re: How to determine is string is UTF-8 encoded
Right. One of the best tests is for string lengths (pseudocode; uLen returns Utf8 chars):marcov wrote:The utf8 heuristics more likely go wrong on short strings or languages with few accents. (like e.g. Dutch). The longer the text, the more reliable it is.
Code: Select all
SetGlobals v1$="La vita è bella" ; è is Ascii 232
SetGlobals v1n$="La vita e bella" ; same but no accent
SetGlobals v2$="Das Leben ist schön" ; ö is Ascii 246
SetGlobals v2n$="Das Leben ist schon" ; no Umlaut
SetGlobals v3$="Жизнь прекрасна" ; Russian
SetGlobals v4$="生活是美好的" ; Chinese
Init
Print v1$, Str$("\tLen=%i", Len(v1$)), Str$(", uLen=%i\n", uLen(v1$))
Print v1n$, Str$("\tLen=%i", Len(v1n$)), Str$(", uLen=%i\n", uLen(v1n$))
Print v2$, Str$("\tLen=%i", Len(v2$)), Str$(", uLen=%i\n", uLen(v2$))
Print v2n$, Str$("\tLen=%i", Len(v2n$)), Str$(", uLen=%i\n", uLen(v2n$))
Print v3$, Str$("\tLen=%i", Len(v3$)), Str$(", uLen=%i\n", uLen(v3$))
Print v4$, Str$("\tLen=%i", Len(v4$)), Str$(", uLen=%i\n", uLen(v4$))
Code: Select all
La vita è bella Len=16, uLen=15
La vita e bella Len=15, uLen=15
Das Leben ist schön Len=20, uLen=19
Das Leben ist schon Len=19, uLen=19
Жизнь прекрасна Len=29, uLen=15
生活是美好的 Len=18, uLen=6