[BugFixed]
utf8
Автор |
Повідомлення |
Повідомлення створено: 15. 04. 2025 [23:01]
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results.
function setUTF8(symb){ //this function work like strFromCharUTF in 0...10FFF
rez = "";
if(symb < 0x80)
rez = SYS.strFromCharCode(symb);
else for( iCh = 5, iSt = -1; iCh >= 0; iCh--) {
if(iSt < iCh && (symb>>(iCh*6))) iSt = iCh;
if(iCh == iSt) rez += SYS.strFromCharCode((0xFF<<(7-iCh))|(symb>>(iCh*6)));
else if(iCh < iSt) rez += SYS.strFromCharCode(0x80|(0x3F&(symb>>(iCh*6))));
}
return rez;
}
function setUTF8_new(symb){
if (symb <= 0x7F) {
return SYS.strFromCharCode(symb);
} else if (symb <= 0x7FF) {
return SYS.strFromCharCode(
0xC0 | (symb >> 6),
0x80 | (symb & 0x3F)
);
} else if (symb <= 0xFFFF) {
return SYS.strFromCharCode(
0xE0 | (symb >> 12),
0x80 | ((symb >> 6) & 0x3F),
0x80 | (symb & 0x3F)
);
} else if (symb <= 0x10FFFF) {
return SYS.strFromCharCode(
0xF0 | (symb >> 18), // first 3 bits
0x80 | ((symb >> 12) & 0x3F), // next 6 bits
0x80 | ((symb >> 6) & 0x3F), // next 6 bits
0x80 | (symb & 0x3F) // last 6 bits
);
}
}
// test for surrogate pairs
/*
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);
output1 = setUTF8(codePoint);
//output2 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
*/
// test in loop
for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
continue; // surrogate pairs
//output1 = setUTF8(codePoint);
output1 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
if (output1 != output2) break;
}
// test passed charCodeAt(0,"UTF-8");
/*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
continue; // surrogate pairs
output1 = setUTF8_new(codePoint);
codePoint1 = output1.charCodeAt(0,"UTF-8");
if (codePoint != codePoint1) break;
}
*/
ArInt1s = "dec:[";
for(i = 0; i < output1.length; i++){
ArInt1s += output1.charCodeAt(i).toString(10) + ",";
}
ArInt1s += "]; hex:[";
for(i = 0; i < output1.length; i++){
ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ",";
}
ArInt1s += "]";
ArInt2s = "dec:[";
for(i = 0; i < output2.length; i++){
ArInt2s += output2.charCodeAt(i).toString(10) + ",";
}
ArInt2s += "]; hex:[";
for(i = 0; i < output2.length; i++){
ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ",";
}
ArInt2s += "]";
if (output1 == output2){
ArInt2s = "equal";
}
PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended.
|
Повідомлення створено: 16. 04. 2025 [11:27]
|
roman
Roman Savochenko
Moderator Contributor Developer
Зареєстрован(а) с: 12.12.2007
Повідомлення: 3769
|
"arccis" wrote:
I tested my library and decided to add decoding and encoding of surrogate pairs like \uD83D\uDE00. It turned out that the SYS.strFromCharUTF function does not work correctly on 3-byte and 4-byte characters. I found the TMess::setUTF8 function in the code, ported it to JavaLikeCalc. I checked it on the entire range (0x0... 0x10FFFF) that it works the same as c++. Then I implemented another function and compared the results.
And what a problem with encoding in three bytes say for 0xD83D, where TMess::setUTF8() returns — 0xEDA0BD. And that is corresponded to https://en.wikipedia.org/wiki/UTF-8 and is written as:
0xD83D: 1101 100000 111101
0xEDA0BD: 1110 1101 10 100000 10 111101
"arccis" wrote:
PS As I understand it, utf8 can be no more than 4 bytes. 6 bytes is an outdated option (before 2003) and is not recommended.
When you don't use such U-Codes, you will not get 6 bytes.
Learn, learn and learn better than work, work and work.
|
Повідомлення створено: 17. 04. 2025 [01:24]
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
1) on the test 0xD83D indeed both functions (TMess::setUTF8() and setUTF8_new(symb) ) work the same.
2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair (U+D800 through U+DFFF picture1) and you need to use the following code and apply the formula to them
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);
after that code codePoint == U+1F600. it is smile. pic2. setUTF8_new - correct, SYS.strFromCharUTF - wrong.
3) I compared the functions and they differ U+800...U+FFF, U+10000...> . Another example is U+1FA00 pic3.
4) function charCodeAt(0,"UTF-8"); is correct in 0...10FFF
output1 = setUTF8_new(codePoint);
codePoint1 = output1.charCodeAt(0,"UTF-8");
I used this code for tests
function setUTF8_new(symb){
if (symb <= 0x7F) {
return SYS.strFromCharCode(symb);
} else if (symb <= 0x7FF) {
return SYS.strFromCharCode(
0xC0 | (symb >> 6),
0x80 | (symb & 0x3F)
);
} else if (symb <= 0xFFFF) {
return SYS.strFromCharCode(
0xE0 | (symb >> 12),
0x80 | ((symb >> 6) & 0x3F),
0x80 | (symb & 0x3F)
);
} else if (symb <= 0x10FFFF) {
return SYS.strFromCharCode(
0xF0 | (symb >> 18), // first 3 bits
0x80 | ((symb >> 12) & 0x3F), // next 6 bits
0x80 | ((symb >> 6) & 0x3F), // next 6 bits
0x80 | (symb & 0x3F) // last 6 bits
);
}
}
// test for surrogate pairs
code1 = 0xD83D;
code2 = 0xDE00;
codePoint = 0x10000 + ((code1 - 0xD800) << 10) + (code2 - 0xDC00);
codePoint = 0x1FA00;
codePoint = 0x800;
//output1 = setUTF8(codePoint);
output1 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
// test in loop
/*
for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
continue; // surrogate pairs
//output1 = setUTF8(codePoint);
output1 = SYS.strFromCharUTF(codePoint);
output2 = setUTF8_new(codePoint);
if (output1 != output2) //break;
console += "U+" + codePoint.toString(16) + "\n";
}
*/
// test passed charCodeAt(0,"UTF-8");
/*for(codePoint = 0x0; codePoint <= 0x10FFFF; codePoint++) {
if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
continue; // surrogate pairs
output1 = setUTF8_new(codePoint);
codePoint1 = output1.charCodeAt(0,"UTF-8");
if (codePoint != codePoint1) break;
}
*/
codePointHex = "U+" + codePoint.toString(16) + " bin:[" + codePoint.toString(2) + "]";
ArInt1s = "len=" + output1.length + "; ";
ArInt1s += " hex:[";
for(i = 0; i < output1.length; i++){
ArInt1s += "0x" + output1.charCodeAt(i).toString(16) + ",";
}
ArInt1s += "]; dec:[";
for(i = 0; i < output1.length; i++){
ArInt1s += output1.charCodeAt(i).toString(10) + ",";
}
ArInt1s += "]; bin:[";
for(i = 0; i < output1.length; i++){
ArInt1s += output1.charCodeAt(i).toString(2,8) + ",";
}
ArInt1s += "]";
ArInt2s = "len=" + output2.length + "; ";
ArInt2s += "hex:[";
for(i = 0; i < output2.length; i++){
ArInt2s += "0x" + output2.charCodeAt(i).toString(16) + ",";
}
ArInt2s += "]; dec:[";
for(i = 0; i < output2.length; i++){
ArInt2s += output2.charCodeAt(i).toString(10) + ",";
}
ArInt2s += "]; bin:[";
for(i = 0; i < output2.length; i++){
ArInt2s += output2.charCodeAt(i).toString(2,8) + ",";
}
ArInt2s += "]";
if (output1 == output2){
ArInt2s = "equal";
}
Вкладений файл
pic 1.png (Тип файлу: image/png, Розмір: 18.32 кілобайтів) — 34 завантажень
pic2.png (Тип файлу: image/png, Розмір: 30.73 кілобайтів) — 35 завантажень
pic3.png (Тип файлу: image/png, Розмір: 29.93 кілобайтів) — 36 завантажень
U 800.png (Тип файлу: image/png, Розмір: 27.38 кілобайтів) — 35 завантажень
pic5.png (Тип файлу: image/png, Розмір: 23.62 кілобайтів) — 37 завантажень
|
Повідомлення створено: 17. 04. 2025 [09:09]
|
roman
Roman Savochenko
Moderator Contributor Developer
Зареєстрован(а) с: 12.12.2007
Повідомлення: 3769
|
"arccis" wrote:
2) but 0xD83D actually can't be converted directly. it is part of a surrogate pair (U+D800 through U+DFFF picture1) and you need to use the following code and apply the formula to them
Yes, I see, that is a real problem and I have not struck in that since have not used such codes.
But that is easily fixed for the same code:
string TMess::setUTF8( uint32_t symb )
{
string rez;
if(symb < 0x80) rez += (char)symb;
else for(int iCh = 5, iSt = -1; iCh >= 0; iCh--) {
if(iSt < iCh && (symb>>((iCh-1)*6+(7-iCh)))) iSt = iCh;
if(iCh == iSt) rez += (char)((0xFF<<(7-iCh))|(symb>>(iCh*6)));
else if(iCh < iSt) rez += (char)(0x80|(0x3F&(symb>>(iCh*6))));
}
return rez;
}
So, now I have:
U+1F600 = F09F9880
U+1FA00 = F09FA880
U+D83D = EDA0BD
Learn, learn and learn better than work, work and work.
|
Повідомлення створено: 17. 04. 2025 [11:20]
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits. and offers code without a loop.
string setUTF8(int32_t symb) {
string rez;
if (symb < 0x80)
rez += (char)symb;
else if (symb < 0x800) {
rez += (char)(0xC0 | (symb >> 6));
rez += (char)(0x80 | (symb & 0x3F));
} else if (symb < 0x10000) {
rez += (char)(0xE0 | (symb >> 12));
rez += (char)(0x80 | ((symb >> 6) & 0x3F));
rez += (char)(0x80 | (symb & 0x3F));
} else if (symb < 0x110000) {
rez += (char)(0xF0 | (symb >> 18));
rez += (char)(0x80 | ((symb >> 12) & 0x3F));
rez += (char)(0x80 | ((symb >> 6) & 0x3F));
rez += (char)(0x80 | (symb & 0x3F));
}
return rez;
}
also the chat said that 6 bytes is not allowed by RFC 3629. and is considered unsafe, but code here:
std::string encodeUTF8_6Byte(uint32_t symb) {
std::string result;
if (symb < 0x80) {
result += (char)symb;
} else if (symb < 0x800) {
result += (char)(0xC0 | (symb >> 6));
result += (char)(0x80 | (symb & 0x3F));
} else if (symb < 0x10000) {
result += (char)(0xE0 | (symb >> 12));
result += (char)(0x80 | ((symb >> 6) & 0x3F));
result += (char)(0x80 | (symb & 0x3F));
} else if (symb < 0x200000) {
result += (char)(0xF0 | (symb >> 18));
result += (char)(0x80 | ((symb >> 12) & 0x3F));
result += (char)(0x80 | ((symb >> 6) & 0x3F));
result += (char)(0x80 | (symb & 0x3F));
} else if (symb < 0x4000000) {
result += (char)(0xF8 | (symb >> 24));
result += (char)(0x80 | ((symb >> 18) & 0x3F));
result += (char)(0x80 | ((symb >> 12) & 0x3F));
result += (char)(0x80 | ((symb >> 6) & 0x3F));
result += (char)(0x80 | (symb & 0x3F));
} else if (symb <= 0x7FFFFFFF) {
result += (char)(0xFC | (symb >> 30));
result += (char)(0x80 | ((symb >> 24) & 0x3F));
result += (char)(0x80 | ((symb >> 18) & 0x3F));
result += (char)(0x80 | ((symb >> 12) & 0x3F));
result += (char)(0x80 | ((symb >> 6) & 0x3F));
result += (char)(0x80 | (symb & 0x3F));
} else {
// Invalid range
result = "?";
}
return result;
}
Вкладений файл
U 10000.png (Тип файлу: image/png, Розмір: 28.58 кілобайтів) — 32 завантажень
U 800.png (Тип файлу: image/png, Розмір: 27.38 кілобайтів) — 34 завантажень
|
Повідомлення створено: 17. 04. 2025 [12:37]
|
roman
Roman Savochenko
Moderator Contributor Developer
Зареєстрован(а) с: 12.12.2007
Повідомлення: 3769
|
"arccis" wrote:
old and new functions will fail U+800 and U+10000 tests. length is now determined correctly, but the first byte is different. gpt chat refers to RFC 3629 limits.
And that is corresponding with https://en.wikipedia.org/wiki/UTF-8 , which is referring to RFC 3629, so you fix the article! :)
Due to in two and three bytes that well be:
U+800: 100000 000000 — Last code for two bytes is U+07FF
11? 100000 10 000000
U+10000: 10000 000000 000000 — Last code for three bytes is U+FFFF
111? 10000 10 000000 10 000000
"arccis" wrote:
... and offers code without a loop.
And it offers to the biomass be stupid!? :)
Learn, learn and learn better than work, work and work.
|
Повідомлення створено: 17. 04. 2025 [13:08]
|
arccis
Arkadii Kysil
Автор теми
Зареєстрован(а) с: 18.11.2023
Повідомлення: 8
|
it was an elegant function until Arkadii decided to draw smileys in scada. :)
|
|
|