EnglishУкраїнськаmRussian
Login/New
Topic with no new replies

[BugFixed] utf8


Author Message
Written on: 15. 04. 2025 [23:01]
arccis
Arkadii Kysil
Topic creator
registered since: 18.11.2023
Posts: 8
I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results.

JAVASCRIPT
function setUTF8(symb){ //this function work like strFromCharUTF in 0...10FFF
	rez = "";
	if(symb < 0x80)
		rez = SYS.strFromCharCode(symb);
  else for( iCh = 5, iSt = -1; iCh >= 0; iCh--) {
		if(iSt < iCh && (symb>>(iCh*6))) iSt = iCh;
		if(iCh == iSt) rez += SYS.strFromCharCode((0xFF<<(7-iCh))|(symb>>(iCh*6)));
		else if(iCh < iSt) rez += SYS.strFromCharCode(0x80|(0x3F&(symb>>(iCh*6))));
    }
 
	return rez;
}
 
 
function setUTF8_new(symb){ 
	   if (symb <= 0x7F) {
        return SYS.strFromCharCode(symb);
    } else if (symb <= 0x7FF) {
        return SYS.strFromCharCode(
            0xC0 | (symb >> 6),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0xFFFF) {
        return SYS.strFromCharCode(
            0xE0 | (symb >> 12),
            0x80 | ((symb >> 6) & 0x3F),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0x10FFFF) {
        return SYS.strFromCharCode(
            0xF0 | (symb >> 18),              	// first 3 bits
            0x80 | ((symb >> 12) & 0x3F),			// next 6 bits
            0x80 | ((symb >> 6) & 0x3F),			// next 6 bits
            0x80 | (symb & 0x3F)							// last 6 bits
        );
    } 
}
 
// test for surrogate pairs
/*
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);
output1 = setUTF8(codePoint);
//output2 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
*/
 
 
// test in loop
for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	//output1 = setUTF8(codePoint);
	output1 = SYS.strFromCharUTF(codePoint);
	output2 = setUTF8_new(codePoint);
 
	if (output1 != output2) break;
 
}
 
 
// test  passed charCodeAt(0,"UTF-8");
/*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	output1 = setUTF8_new(codePoint);
	codePoint1 = output1.charCodeAt(0,"UTF-8"); 
 
	if (codePoint != codePoint1) break;	
}
*/
ArInt1s = "dec:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s += output1.charCodeAt(i).toString(10) + ",";
}					
	ArInt1s += "]; hex:[";
	for(i = 0; i < output1.length; i++){ 
		ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ",";
	}	
	ArInt1s += "]";
 
 
ArInt2s = "dec:[";
for(i = 0; i < output2.length; i++){ 
	ArInt2s += output2.charCodeAt(i).toString(10) + ",";
}					
	ArInt2s += "]; hex:[";
	for(i = 0; i < output2.length; i++){ 
		ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ",";
	}	
	ArInt2s += "]";
 
 
if (output1 == output2){
ArInt2s = "equal";
}


PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended.
Written on: 16. 04. 2025 [11:27]
roman
Roman Savochenko
Moderator
Contributor
Developer
registered since: 12.12.2007
Posts: 3788
"arccis" wrote:

I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results.

And what a problem with encoding in three bytes say for 0xD83D, where TMess::setUTF8() returns — 0xEDA0BD. And that is corresponded to https://en.wikipedia.org/wiki/UTF-8 and is written as:
JAVASCRIPT
0xD83D:	       1101    100000    111101
0xEDA0BD: 1110 1101 10 100000 10 111101



"arccis" wrote:

PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended.

When you don't use such U-Codes, you will not get 6 bytes.

Learn, learn and learn better than work, work and work.
Written on: 17. 04. 2025 [01:24]
arccis
Arkadii Kysil
Topic creator
registered since: 18.11.2023
Posts: 8
1) on the test 0xD83D indeed both functions (TMess::setUTF8() and setUTF8_new(symb) ) work the same.
2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair (U+D800 through U+DFFF picture1) and you need to use the following code and apply the formula to them
JAVASCRIPT
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);

after that code codePoint == U+1F600. it is smile. pic2. setUTF8_new - correct, SYS.strFromCharUTF - wrong.
3) I compared the functions and they differ U+800...U+FFF, U+10000...> . Another example is U+1FA00 pic3.
4) function charCodeAt(0,"UTF-8"); is correct in 0...10FFF
JAVASCRIPT
output1 = setUTF8_new(codePoint);
codePoint1 = output1.charCodeAt(0,"UTF-8");


I used this code for tests
JAVASCRIPT
function setUTF8_new(symb){ 
	   if (symb <= 0x7F) {
        return SYS.strFromCharCode(symb);
    } else if (symb <= 0x7FF) {
        return SYS.strFromCharCode(
            0xC0 | (symb >> 6),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0xFFFF) {
        return SYS.strFromCharCode(
            0xE0 | (symb >> 12),
            0x80 | ((symb >> 6) & 0x3F),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0x10FFFF) {
        return SYS.strFromCharCode(
            0xF0 | (symb >> 18),              	// first 3 bits
            0x80 | ((symb >> 12) & 0x3F),			// next 6 bits
            0x80 | ((symb >> 6) & 0x3F),			// next 6 bits
            0x80 | (symb & 0x3F)							// last 6 bits
        );
    } 
}
 
// test for surrogate pairs
 
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);
codePoint = 0x1FA00;
codePoint = 0x800;
 
 
//output1 = setUTF8(codePoint);
output1 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
 
 
 
 
// test in loop
/*
for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	//output1 = setUTF8(codePoint);
	output1 = SYS.strFromCharUTF(codePoint);
	output2 = setUTF8_new(codePoint);
 
	if (output1 != output2) //break;
		console +=  "U+" + codePoint.toString(16) + "\n";	
}
*/
// test  passed charCodeAt(0,"UTF-8");
/*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	output1 = setUTF8_new(codePoint);
	codePoint1 = output1.charCodeAt(0,"UTF-8"); 
 
	if (codePoint != codePoint1) break;	
}
*/
 
codePointHex = "U+" + codePoint.toString(16) + " bin:[" + codePoint.toString(2) + "]";
 
 
 
ArInt1s = "len=" + output1.length + "; ";
ArInt1s += " hex:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ",";
}	
 
ArInt1s += "]; dec:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s += output1.charCodeAt(i).toString(10) + ",";
}					
ArInt1s += "]; bin:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s +=   output1.charCodeAt(i).toString(2,8) + ",";
}	
 
ArInt1s += "]";
 
ArInt2s = "len=" + output2.length + "; ";
ArInt2s += "hex:[";
 
for(i = 0; i < output2.length; i++){ 
	ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ",";
}	
 
ArInt2s += "]; dec:[";
for(i = 0; i < output2.length; i++){ 
	ArInt2s += output2.charCodeAt(i).toString(10) + ",";
}	
ArInt2s += "]; bin:[";
for(i = 0; i < output2.length; i++){ 
	ArInt2s +=   output2.charCodeAt(i).toString(2,8) + ",";
}	
 
ArInt2s += "]";
 
 
if (output1 == output2){
	ArInt2s = "equal";
}
Attachment

pic 1.png (File type: image/png, Size: 18.32 kilobytes) — 302 downloads
pic2.png (File type: image/png, Size: 30.73 kilobytes) — 307 downloads
pic3.png (File type: image/png, Size: 29.93 kilobytes) — 289 downloads
U 800.png (File type: image/png, Size: 27.38 kilobytes) — 310 downloads
pic5.png (File type: image/png, Size: 23.62 kilobytes) — 296 downloads
Written on: 17. 04. 2025 [09:09]
roman
Roman Savochenko
Moderator
Contributor
Developer
registered since: 12.12.2007
Posts: 3788
"arccis" wrote:

2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair (U+D800 through U+DFFF picture1) and you need to use the following code and apply the formula to them

Yes, I see, that is a real problem and I have not struck in that since have not used such codes.
But that is easily fixed for the same code:
JAVASCRIPT
string TMess::setUTF8( uint32_t symb )
{
    string rez;
    if(symb < 0x80) rez += (char)symb;
    else for(int iCh = 5, iSt = -1; iCh >= 0; iCh--) {
        if(iSt < iCh && (symb>>((iCh-1)*6+(7-iCh)))) iSt = iCh;
        if(iCh == iSt) rez += (char)((0xFF<<(7-iCh))|(symb>>(iCh*6)));
        else if(iCh < iSt) rez += (char)(0x80|(0x3F&(symb>>(iCh*6))));
    }
 
    return rez;
}


So, now I have:
U+1F600 = F09F9880
U+1FA00 = F09FA880
U+D83D = EDA0BD

Learn, learn and learn better than work, work and work.
Written on: 17. 04. 2025 [11:20]
arccis
Arkadii Kysil
Topic creator
registered since: 18.11.2023
Posts: 8
old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits. and offers code without a loop.


JAVASCRIPT
string setUTF8(int32_t symb) {
    string rez;
    if (symb < 0x80)
        rez += (char)symb;
    else if (symb < 0x800) {
        rez += (char)(0xC0 | (symb >> 6));
        rez += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x10000) {
        rez += (char)(0xE0 | (symb >> 12));
        rez += (char)(0x80 | ((symb >> 6) & 0x3F));
        rez += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x110000) {
        rez += (char)(0xF0 | (symb >> 18));
        rez += (char)(0x80 | ((symb >> 12) & 0x3F));
        rez += (char)(0x80 | ((symb >> 6) & 0x3F));
        rez += (char)(0x80 | (symb & 0x3F));
    }
    return rez;
}



also the chat said that 6 bytes is not allowed by RFC 3629. and is considered unsafe, but code here:
JAVASCRIPT
std::string encodeUTF8_6Byte(uint32_t symb) {
    std::string result;
 
    if (symb < 0x80) {
        result += (char)symb;
    } else if (symb < 0x800) {
        result += (char)(0xC0 | (symb >> 6));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x10000) {
        result += (char)(0xE0 | (symb >> 12));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x200000) {
        result += (char)(0xF0 | (symb >> 18));
        result += (char)(0x80 | ((symb >> 12) & 0x3F));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x4000000) {
        result += (char)(0xF8 | (symb >> 24));
        result += (char)(0x80 | ((symb >> 18) & 0x3F));
        result += (char)(0x80 | ((symb >> 12) & 0x3F));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb <= 0x7FFFFFFF) {
        result += (char)(0xFC | (symb >> 30));
        result += (char)(0x80 | ((symb >> 24) & 0x3F));
        result += (char)(0x80 | ((symb >> 18) & 0x3F));
        result += (char)(0x80 | ((symb >> 12) & 0x3F));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else {
        // Invalid range
        result = "?";
    }
 
    return result;
}


Attachment

U 10000.png (File type: image/png, Size: 28.58 kilobytes) — 295 downloads
U 800.png (File type: image/png, Size: 27.38 kilobytes) — 303 downloads
Screenshot from 2025-04-17 09-55-17.png (File type: image/png, Size: 32.95 kilobytes) — 304 downloads
Screenshot from 2025-04-17 10-05-36.png (File type: image/png, Size: 18.08 kilobytes) — 303 downloads
Written on: 17. 04. 2025 [12:37]
roman
Roman Savochenko
Moderator
Contributor
Developer
registered since: 12.12.2007
Posts: 3788
"arccis" wrote:

old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits.

And that is corresponding with https://en.wikipedia.org/wiki/UTF-8 , which is referring to RFC 3629, so you fix the article! :)

Due to in two and three bytes that well be:
JAVASCRIPT
U+800:                100000    000000    — Last code for two bytes is U+07FF
                  11? 100000 10 000000
U+10000:     10000    000000    000000    — Last code for three bytes is U+FFFF
        111? 10000 10 000000 10 000000


"arccis" wrote:

... and offers code without a loop.

And it offers to the biomass be stupid!? :)

Learn, learn and learn better than work, work and work.
Written on: 17. 04. 2025 [13:08]
arccis
Arkadii Kysil
Topic creator
registered since: 18.11.2023
Posts: 8
it was an elegant function until Arkadii decided to draw smileys in scada. :)



3252