| 
		
			
	
	
		
    
	
	
	[BugFixed] 
	utf8
	
		| Author | Message |  
		| Written on: 15. 04. 2025 [23:01] |  
		| arccis Arkadii Kysil Topic creator registered since: 18.11.2023 Posts: 8 | I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results. 
 
 function setUTF8(symb){ //this function work like strFromCharUTF in 0...10FFF
	rez = "";
	if(symb < 0x80)
		rez = SYS.strFromCharCode(symb);
  else for( iCh = 5, iSt = -1; iCh >= 0; iCh--) {
		if(iSt < iCh && (symb>>(iCh*6))) iSt = iCh;
		if(iCh == iSt) rez += SYS.strFromCharCode((0xFF<<(7-iCh))|(symb>>(iCh*6)));
		else if(iCh < iSt) rez += SYS.strFromCharCode(0x80|(0x3F&(symb>>(iCh*6))));
    }
 
	return rez;
}
 
 
function setUTF8_new(symb){ 
	   if (symb <= 0x7F) {
        return SYS.strFromCharCode(symb);
    } else if (symb <= 0x7FF) {
        return SYS.strFromCharCode(
            0xC0 | (symb >> 6),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0xFFFF) {
        return SYS.strFromCharCode(
            0xE0 | (symb >> 12),
            0x80 | ((symb >> 6) & 0x3F),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0x10FFFF) {
        return SYS.strFromCharCode(
            0xF0 | (symb >> 18),              	// first 3 bits
            0x80 | ((symb >> 12) & 0x3F),			// next 6 bits
            0x80 | ((symb >> 6) & 0x3F),			// next 6 bits
            0x80 | (symb & 0x3F)							// last 6 bits
        );
    } 
}
 
// test for surrogate pairs
/*
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);
output1 = setUTF8(codePoint);
//output2 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
*/
 
 
// test in loop
for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	//output1 = setUTF8(codePoint);
	output1 = SYS.strFromCharUTF(codePoint);
	output2 = setUTF8_new(codePoint);
 
	if (output1 != output2) break;
 
}
 
 
// test  passed charCodeAt(0,"UTF-8");
/*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	output1 = setUTF8_new(codePoint);
	codePoint1 = output1.charCodeAt(0,"UTF-8"); 
 
	if (codePoint != codePoint1) break;	
}
*/
ArInt1s = "dec:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s += output1.charCodeAt(i).toString(10) + ",";
}					
	ArInt1s += "]; hex:[";
	for(i = 0; i < output1.length; i++){ 
		ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ",";
	}	
	ArInt1s += "]";
 
 
ArInt2s = "dec:[";
for(i = 0; i < output2.length; i++){ 
	ArInt2s += output2.charCodeAt(i).toString(10) + ",";
}					
	ArInt2s += "]; hex:[";
	for(i = 0; i < output2.length; i++){ 
		ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ",";
	}	
	ArInt2s += "]";
 
 
if (output1 == output2){
ArInt2s = "equal";
}
 PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended.
 |  
		
	 
		| Written on: 16. 04. 2025 [11:27] |  
		| roman Roman Savochenko Moderator Contributor Developer   registered since: 12.12.2007 Posts: 3788 | "arccis" wrote:
 I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results.
 
 And what a problem with encoding in three bytes say for 0xD83D, where TMess::setUTF8() returns — 0xEDA0BD. And that is corresponded to https://en.wikipedia.org/wiki/UTF-8 and is written as:
 
 0xD83D:	       1101    100000    111101
0xEDA0BD: 1110 1101 10 100000 10 111101
 
 
 "arccis" wrote:
 PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended.
 
 When you don't use such U-Codes, you will not get 6 bytes.
 
 
       Learn, learn and learn better than work, work and work.       |  
		
	 
		| Written on: 17. 04. 2025 [01:24] |  
		| arccis Arkadii Kysil Topic creator registered since: 18.11.2023 Posts: 8 | 1) on the test 0xD83D indeed both functions (TMess::setUTF8() and setUTF8_new(symb)   ) work the same. 2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair  (U+D800 through U+DFFF picture1)  and you need to use the following code and apply the formula to them
 
 code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);after that code codePoint == U+1F600. it is smile. pic2. setUTF8_new - correct, SYS.strFromCharUTF - wrong.
 3) I compared the functions  and they differ U+800...U+FFF, U+10000...> . Another example is U+1FA00 pic3.
 4) function charCodeAt(0,"UTF-8");  is correct in 0...10FFF
 
 output1 = setUTF8_new(codePoint);
codePoint1 = output1.charCodeAt(0,"UTF-8");
 I used this code for tests
 
 function setUTF8_new(symb){ 
	   if (symb <= 0x7F) {
        return SYS.strFromCharCode(symb);
    } else if (symb <= 0x7FF) {
        return SYS.strFromCharCode(
            0xC0 | (symb >> 6),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0xFFFF) {
        return SYS.strFromCharCode(
            0xE0 | (symb >> 12),
            0x80 | ((symb >> 6) & 0x3F),
            0x80 | (symb & 0x3F)
        );
    } else if (symb <= 0x10FFFF) {
        return SYS.strFromCharCode(
            0xF0 | (symb >> 18),              	// first 3 bits
            0x80 | ((symb >> 12) & 0x3F),			// next 6 bits
            0x80 | ((symb >> 6) & 0x3F),			// next 6 bits
            0x80 | (symb & 0x3F)							// last 6 bits
        );
    } 
}
 
// test for surrogate pairs
 
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);
codePoint = 0x1FA00;
codePoint = 0x800;
 
 
//output1 = setUTF8(codePoint);
output1 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
 
 
 
 
// test in loop
/*
for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	//output1 = setUTF8(codePoint);
	output1 = SYS.strFromCharUTF(codePoint);
	output2 = setUTF8_new(codePoint);
 
	if (output1 != output2) //break;
		console +=  "U+" + codePoint.toString(16) + "\n";	
}
*/
// test  passed charCodeAt(0,"UTF-8");
/*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
 
 if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue; // surrogate pairs
 
	output1 = setUTF8_new(codePoint);
	codePoint1 = output1.charCodeAt(0,"UTF-8"); 
 
	if (codePoint != codePoint1) break;	
}
*/
 
codePointHex = "U+" + codePoint.toString(16) + " bin:[" + codePoint.toString(2) + "]";
 
 
 
ArInt1s = "len=" + output1.length + "; ";
ArInt1s += " hex:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ",";
}	
 
ArInt1s += "]; dec:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s += output1.charCodeAt(i).toString(10) + ",";
}					
ArInt1s += "]; bin:[";
for(i = 0; i < output1.length; i++){ 
	ArInt1s +=   output1.charCodeAt(i).toString(2,8) + ",";
}	
 
ArInt1s += "]";
 
ArInt2s = "len=" + output2.length + "; ";
ArInt2s += "hex:[";
 
for(i = 0; i < output2.length; i++){ 
	ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ",";
}	
 
ArInt2s += "]; dec:[";
for(i = 0; i < output2.length; i++){ 
	ArInt2s += output2.charCodeAt(i).toString(10) + ",";
}	
ArInt2s += "]; bin:[";
for(i = 0; i < output2.length; i++){ 
	ArInt2s +=   output2.charCodeAt(i).toString(2,8) + ",";
}	
 
ArInt2s += "]";
 
 
if (output1 == output2){
	ArInt2s = "equal";
}Attachment 
 
 pic 1.png (File type: image/png, Size: 18.32 kilobytes) — 308 downloads
 
 pic2.png (File type: image/png, Size: 30.73 kilobytes) — 314 downloads
 
 pic3.png (File type: image/png, Size: 29.93 kilobytes) — 296 downloads
 
 U 800.png (File type: image/png, Size: 27.38 kilobytes) — 316 downloads
 
 pic5.png (File type: image/png, Size: 23.62 kilobytes) — 303 downloads
 |  
		
	 
		| Written on: 17. 04. 2025 [09:09] |  
		| roman Roman Savochenko Moderator Contributor Developer   registered since: 12.12.2007 Posts: 3788 | "arccis" wrote:
 2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair  (U+D800 through U+DFFF picture1)  and you need to use the following code and apply the formula to them
 
 Yes, I see, that is a real problem and I have not struck in that since have not used such codes.
 But that is easily fixed for the same code:
 
 string TMess::setUTF8( uint32_t symb )
{
    string rez;
    if(symb < 0x80) rez += (char)symb;
    else for(int iCh = 5, iSt = -1; iCh >= 0; iCh--) {
        if(iSt < iCh && (symb>>((iCh-1)*6+(7-iCh)))) iSt = iCh;
        if(iCh == iSt) rez += (char)((0xFF<<(7-iCh))|(symb>>(iCh*6)));
        else if(iCh < iSt) rez += (char)(0x80|(0x3F&(symb>>(iCh*6))));
    }
 
    return rez;
}
 So, now I have:
 U+1F600 = F09F9880
 U+1FA00 = F09FA880
 U+D83D = EDA0BD
 
 
       Learn, learn and learn better than work, work and work.       |  
		
	 
		| Written on: 17. 04. 2025 [11:20] |  
		| arccis Arkadii Kysil Topic creator registered since: 18.11.2023 Posts: 8 | old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits. and offers code without a loop. 
 
 
 string setUTF8(int32_t symb) {
    string rez;
    if (symb < 0x80)
        rez += (char)symb;
    else if (symb < 0x800) {
        rez += (char)(0xC0 | (symb >> 6));
        rez += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x10000) {
        rez += (char)(0xE0 | (symb >> 12));
        rez += (char)(0x80 | ((symb >> 6) & 0x3F));
        rez += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x110000) {
        rez += (char)(0xF0 | (symb >> 18));
        rez += (char)(0x80 | ((symb >> 12) & 0x3F));
        rez += (char)(0x80 | ((symb >> 6) & 0x3F));
        rez += (char)(0x80 | (symb & 0x3F));
    }
    return rez;
}
 
 also the chat said that 6 bytes is not allowed by RFC 3629. and is considered unsafe, but code here:
 
 std::string encodeUTF8_6Byte(uint32_t symb) {
    std::string result;
 
    if (symb < 0x80) {
        result += (char)symb;
    } else if (symb < 0x800) {
        result += (char)(0xC0 | (symb >> 6));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x10000) {
        result += (char)(0xE0 | (symb >> 12));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x200000) {
        result += (char)(0xF0 | (symb >> 18));
        result += (char)(0x80 | ((symb >> 12) & 0x3F));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb < 0x4000000) {
        result += (char)(0xF8 | (symb >> 24));
        result += (char)(0x80 | ((symb >> 18) & 0x3F));
        result += (char)(0x80 | ((symb >> 12) & 0x3F));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else if (symb <= 0x7FFFFFFF) {
        result += (char)(0xFC | (symb >> 30));
        result += (char)(0x80 | ((symb >> 24) & 0x3F));
        result += (char)(0x80 | ((symb >> 18) & 0x3F));
        result += (char)(0x80 | ((symb >> 12) & 0x3F));
        result += (char)(0x80 | ((symb >> 6) & 0x3F));
        result += (char)(0x80 | (symb & 0x3F));
    } else {
        // Invalid range
        result = "?";
    }
 
    return result;
}
 
 Attachment 
 
 U 10000.png (File type: image/png, Size: 28.58 kilobytes) — 301 downloads
 
 U 800.png (File type: image/png, Size: 27.38 kilobytes) — 309 downloads
 |  
		
	 
		| Written on: 17. 04. 2025 [12:37] |  
		| roman Roman Savochenko Moderator Contributor Developer   registered since: 12.12.2007 Posts: 3788 | "arccis" wrote:
 old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits.
 
 And that is corresponding with https://en.wikipedia.org/wiki/UTF-8 , which is referring to RFC 3629, so you fix the article! :)
 
 Due to in two and three bytes that well be:
 
 U+800:                100000    000000    — Last code for two bytes is U+07FF
                  11? 100000 10 000000
U+10000:     10000    000000    000000    — Last code for three bytes is U+FFFF
        111? 10000 10 000000 10 000000
 
 "arccis" wrote:
 ... and offers code without a loop.
 
 And it offers to the biomass be stupid!? :)
 
 
       Learn, learn and learn better than work, work and work.       |  
		
	 
		| Written on: 17. 04. 2025 [13:08] |  
		| arccis Arkadii Kysil Topic creator registered since: 18.11.2023 Posts: 8 | it was an elegant function until Arkadii decided to draw smileys in scada. :) |  
		
	 |  |