Автор | Повідомлення |
---|---|
Повідомлення створено: 15. 04. 2025 [23:01]
|
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results. JAVASCRIPT function setUTF8(symb){ //this function work like strFromCharUTF in 0...10FFF rez = ""; if(symb < 0x80) rez = SYS.strFromCharCode(symb); else for( iCh = 5, iSt = -1; iCh >= 0; iCh--) { if(iSt < iCh && (symb>>(iCh*6))) iSt = iCh; if(iCh == iSt) rez += SYS.strFromCharCode((0xFF<<(7-iCh))|(symb>>(iCh*6))); else if(iCh < iSt) rez += SYS.strFromCharCode(0x80|(0x3F&(symb>>(iCh*6)))); } return rez; } function setUTF8_new(symb){ if (symb <= 0x7F) { return SYS.strFromCharCode(symb); } else if (symb <= 0x7FF) { return SYS.strFromCharCode( 0xC0 | (symb >> 6), 0x80 | (symb & 0x3F) ); } else if (symb <= 0xFFFF) { return SYS.strFromCharCode( 0xE0 | (symb >> 12), 0x80 | ((symb >> 6) & 0x3F), 0x80 | (symb & 0x3F) ); } else if (symb <= 0x10FFFF) { return SYS.strFromCharCode( 0xF0 | (symb >> 18), // first 3 bits 0x80 | ((symb >> 12) & 0x3F), // next 6 bits 0x80 | ((symb >> 6) & 0x3F), // next 6 bits 0x80 | (symb & 0x3F) // last 6 bits ); } } // test for surrogate pairs /* code1 = 0xD83D; code2 = 0xDE00; codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00); output1 = setUTF8(codePoint); //output2 = SYS.strFromCharUTF(codePoint); output2 = setUTF8_new(codePoint); */ // test in loop for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) { if (codePoint >= 0xD800 && codePoint <= 0xDFFF) continue; // surrogate pairs //output1 = setUTF8(codePoint); output1 = SYS.strFromCharUTF(codePoint); output2 = setUTF8_new(codePoint); if (output1 != output2) break; } // test passed charCodeAt(0,"UTF-8"); /*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) { if (codePoint >= 0xD800 && codePoint <= 0xDFFF) continue; // surrogate pairs output1 = setUTF8_new(codePoint); codePoint1 = output1.charCodeAt(0,"UTF-8"); if (codePoint != codePoint1) break; } */ ArInt1s = "dec:["; for(i = 0; i < output1.length; i++){ ArInt1s += output1.charCodeAt(i).toString(10) + ","; } ArInt1s += "]; hex:["; for(i = 0; i < output1.length; i++){ ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ","; } ArInt1s += "]"; ArInt2s = "dec:["; for(i = 0; i < output2.length; i++){ ArInt2s += output2.charCodeAt(i).toString(10) + ","; } ArInt2s += "]; hex:["; for(i = 0; i < output2.length; i++){ ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ","; } ArInt2s += "]"; if (output1 == output2){ ArInt2s = "equal"; } PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended. |
Повідомлення створено: 16. 04. 2025 [11:27]
|
|
roman
Roman Savochenko
![]() Moderator Contributor Developer
![]() Зареєстрован(а) с: 12.12.2007
Повідомлення: 3788
|
"arccis" wrote: I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results. And what a problem with encoding in three bytes say for 0xD83D, where TMess::setUTF8() returns — 0xEDA0BD. And that is corresponded to https://en.wikipedia.org/wiki/UTF-8 and is written as: JAVASCRIPT 0xD83D: 1101 100000 111101 0xEDA0BD: 1110 1101 10 100000 10 111101 "arccis" wrote: PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended. When you don't use such U-Codes, you will not get 6 bytes. Learn, learn and learn better than work, work and work.
|
Повідомлення створено: 17. 04. 2025 [01:24]
|
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
1) on the test 0xD83D indeed both functions (TMess::setUTF8() and setUTF8_new(symb) ) work the same. 2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair (U+D800 through U+DFFF picture1) and you need to use the following code and apply the formula to them JAVASCRIPT code1 = 0xD83D; code2 = 0xDE00; codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00); after that code codePoint == U+1F600. it is smile. pic2. setUTF8_new - correct, SYS.strFromCharUTF - wrong. 3) I compared the functions and they differ U+800...U+FFF, U+10000...> . Another example is U+1FA00 pic3. 4) function charCodeAt(0,"UTF-8"); is correct in 0...10FFF JAVASCRIPT output1 = setUTF8_new(codePoint); codePoint1 = output1.charCodeAt(0,"UTF-8"); I used this code for tests JAVASCRIPT function setUTF8_new(symb){ if (symb <= 0x7F) { return SYS.strFromCharCode(symb); } else if (symb <= 0x7FF) { return SYS.strFromCharCode( 0xC0 | (symb >> 6), 0x80 | (symb & 0x3F) ); } else if (symb <= 0xFFFF) { return SYS.strFromCharCode( 0xE0 | (symb >> 12), 0x80 | ((symb >> 6) & 0x3F), 0x80 | (symb & 0x3F) ); } else if (symb <= 0x10FFFF) { return SYS.strFromCharCode( 0xF0 | (symb >> 18), // first 3 bits 0x80 | ((symb >> 12) & 0x3F), // next 6 bits 0x80 | ((symb >> 6) & 0x3F), // next 6 bits 0x80 | (symb & 0x3F) // last 6 bits ); } } // test for surrogate pairs code1 = 0xD83D; code2 = 0xDE00; codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00); codePoint = 0x1FA00; codePoint = 0x800; //output1 = setUTF8(codePoint); output1 = SYS.strFromCharUTF(codePoint); output2 = setUTF8_new(codePoint); // test in loop /* for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) { if (codePoint >= 0xD800 && codePoint <= 0xDFFF) continue; // surrogate pairs //output1 = setUTF8(codePoint); output1 = SYS.strFromCharUTF(codePoint); output2 = setUTF8_new(codePoint); if (output1 != output2) //break; console += "U+" + codePoint.toString(16) + "\n"; } */ // test passed charCodeAt(0,"UTF-8"); /*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) { if (codePoint >= 0xD800 && codePoint <= 0xDFFF) continue; // surrogate pairs output1 = setUTF8_new(codePoint); codePoint1 = output1.charCodeAt(0,"UTF-8"); if (codePoint != codePoint1) break; } */ codePointHex = "U+" + codePoint.toString(16) + " bin:[" + codePoint.toString(2) + "]"; ArInt1s = "len=" + output1.length + "; "; ArInt1s += " hex:["; for(i = 0; i < output1.length; i++){ ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ","; } ArInt1s += "]; dec:["; for(i = 0; i < output1.length; i++){ ArInt1s += output1.charCodeAt(i).toString(10) + ","; } ArInt1s += "]; bin:["; for(i = 0; i < output1.length; i++){ ArInt1s += output1.charCodeAt(i).toString(2,8) + ","; } ArInt1s += "]"; ArInt2s = "len=" + output2.length + "; "; ArInt2s += "hex:["; for(i = 0; i < output2.length; i++){ ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ","; } ArInt2s += "]; dec:["; for(i = 0; i < output2.length; i++){ ArInt2s += output2.charCodeAt(i).toString(10) + ","; } ArInt2s += "]; bin:["; for(i = 0; i < output2.length; i++){ ArInt2s += output2.charCodeAt(i).toString(2,8) + ","; } ArInt2s += "]"; if (output1 == output2){ ArInt2s = "equal"; } |
Повідомлення створено: 17. 04. 2025 [09:09]
|
|
roman
Roman Savochenko
![]() Moderator Contributor Developer
![]() Зареєстрован(а) с: 12.12.2007
Повідомлення: 3788
|
"arccis" wrote: 2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair (U+D800 through U+DFFF picture1) and you need to use the following code and apply the formula to them Yes, I see, that is a real problem and I have not struck in that since have not used such codes. But that is easily fixed for the same code: JAVASCRIPT string TMess::setUTF8( uint32_t symb ) { string rez; if(symb < 0x80) rez += (char)symb; else for(int iCh = 5, iSt = -1; iCh >= 0; iCh--) { if(iSt < iCh && (symb>>((iCh-1)*6+(7-iCh)))) iSt = iCh; if(iCh == iSt) rez += (char)((0xFF<<(7-iCh))|(symb>>(iCh*6))); else if(iCh < iSt) rez += (char)(0x80|(0x3F&(symb>>(iCh*6)))); } return rez; } So, now I have: U+1F600 = F09F9880 U+1FA00 = F09FA880 U+D83D = EDA0BD Learn, learn and learn better than work, work and work.
|
Повідомлення створено: 17. 04. 2025 [11:20]
|
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits. and offers code without a loop. JAVASCRIPT string setUTF8(int32_t symb) { string rez; if (symb < 0x80) rez += (char)symb; else if (symb < 0x800) { rez += (char)(0xC0 | (symb >> 6)); rez += (char)(0x80 | (symb & 0x3F)); } else if (symb < 0x10000) { rez += (char)(0xE0 | (symb >> 12)); rez += (char)(0x80 | ((symb >> 6) & 0x3F)); rez += (char)(0x80 | (symb & 0x3F)); } else if (symb < 0x110000) { rez += (char)(0xF0 | (symb >> 18)); rez += (char)(0x80 | ((symb >> 12) & 0x3F)); rez += (char)(0x80 | ((symb >> 6) & 0x3F)); rez += (char)(0x80 | (symb & 0x3F)); } return rez; } also the chat said that 6 bytes is not allowed by RFC 3629. and is considered unsafe, but code here: JAVASCRIPT std::string encodeUTF8_6Byte(uint32_t symb) { std::string result; if (symb < 0x80) { result += (char)symb; } else if (symb < 0x800) { result += (char)(0xC0 | (symb >> 6)); result += (char)(0x80 | (symb & 0x3F)); } else if (symb < 0x10000) { result += (char)(0xE0 | (symb >> 12)); result += (char)(0x80 | ((symb >> 6) & 0x3F)); result += (char)(0x80 | (symb & 0x3F)); } else if (symb < 0x200000) { result += (char)(0xF0 | (symb >> 18)); result += (char)(0x80 | ((symb >> 12) & 0x3F)); result += (char)(0x80 | ((symb >> 6) & 0x3F)); result += (char)(0x80 | (symb & 0x3F)); } else if (symb < 0x4000000) { result += (char)(0xF8 | (symb >> 24)); result += (char)(0x80 | ((symb >> 18) & 0x3F)); result += (char)(0x80 | ((symb >> 12) & 0x3F)); result += (char)(0x80 | ((symb >> 6) & 0x3F)); result += (char)(0x80 | (symb & 0x3F)); } else if (symb <= 0x7FFFFFFF) { result += (char)(0xFC | (symb >> 30)); result += (char)(0x80 | ((symb >> 24) & 0x3F)); result += (char)(0x80 | ((symb >> 18) & 0x3F)); result += (char)(0x80 | ((symb >> 12) & 0x3F)); result += (char)(0x80 | ((symb >> 6) & 0x3F)); result += (char)(0x80 | (symb & 0x3F)); } else { // Invalid range result = "?"; } return result; } |
Повідомлення створено: 17. 04. 2025 [12:37]
|
|
roman
Roman Savochenko
![]() Moderator Contributor Developer
![]() Зареєстрован(а) с: 12.12.2007
Повідомлення: 3788
|
"arccis" wrote: old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits. And that is corresponding with https://en.wikipedia.org/wiki/UTF-8 , which is referring to RFC 3629, so you fix the article! :) Due to in two and three bytes that well be: JAVASCRIPT U+800: 100000 000000 — Last code for two bytes is U+07FF 11? 100000 10 000000 U+10000: 10000 000000 000000 — Last code for three bytes is U+FFFF 111? 10000 10 000000 10 000000 "arccis" wrote: ... and offers code without a loop. And it offers to the biomass be stupid!? :) Learn, learn and learn better than work, work and work.
|
Повідомлення створено: 17. 04. 2025 [13:08]
|
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
it was an elegant function until Arkadii decided to draw smileys in scada. :) |