Length of a string
What you see is not always what you get!. The length of "๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ" is 21. Let us explore why is it 21 and how to get 3.
TL;DR
'๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ'.length
is 21 instead of 3 because JS gives length UTF-16 code units and icons are a combination of more than one of such code units. Use Intl.Segmenter
to get the length of rendered graphemes.
console.log("๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ".length); // 21 - W
console.log(getVisibleLength("๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ")); // 3 - How can we get this?
What is the .length
?
The
length
data property of a string contains the length of the string in UTF-16 code units. - MDN
I always thought we used utf-8
encoding, mostly because we use to set <meta charset="UTF-8">
in our HTML file.
๐กDid you know, JS engines use UTF-16 encoding and not UTF-8?
const logItemsWithlength = (...items) =>
console.log(items.map((item) => `${item}:${item.length}`));
logItemsWithlength("A", "a", "ร", "โ", "โ");
// ['A:1', 'a:1', 'ร:1', 'โ:1', 'โ:1']
In the above example. A
, a
, and ร
can be represented using utf-8
encoding and hence in length is 1, irrespective if you check utf-8 or utf-16 encoding.
โ
and โ
needs utf-16
(if it was utf-8, its length would be 2)
But since all the characters could be represented using utf-16, the length for each character is 1.
Length of Icons
logItemsWithlength("๐ง", "๐ฆ", "๐", "๐", "๐ฅ", "๐");
// ['๐ง:2', '๐ฆ:2', '๐:2', '๐:2', '๐ฅ:2', '๐:2']
The above icon needs two code points of UTF-16 to be represented, and hence the length of all the icons is 2.
Encoding values for the icon - ๐ง
UTF-8 Encoding: 0xF0 0x9F 0xA7 0x98
UTF-16 Encoding: 0xD83E 0xDDD8
UTF-32 Encoding: 0x0001F9D8
Icons With different colors
While using reactions in multiple apps, we have seen the same icons with different colors, are they different icons or the same icons with some CSS magic?
Irrespective of the approach, the length should be now 2, right? After all, two codepoints of utf-16 encoding (basically utf-32 encoding) have a lot of possible spaces to accommodate different colors.
logItemsWithlength("๐ง", "๐ง๐ปโโ๏ธ");
// ['๐ง:2', '๐ง๐ปโโ๏ธ:7']
Why is the icon in blue have a length of 7?
Icons are like words!
console.log("๐ฉโ๐ฉโ๐ฆโ๐ฆ".length); // 11
console.log([..."๐ฉโ๐ฉโ๐ฆโ๐ฆ"]);
// ['๐ฉ', 'โ', '๐ฉ', 'โ', '๐ฆ', 'โ', '๐ฆ']
Icons, like words in English, are composed of multiple icons. And this can make the icons of variable length.
How do you split these?
console.log("๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ".length); // 21
console.log("๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ".split(""));
// ['\uD83D', '\uDC69', 'โ', '\uD83D', '\uDC69', 'โ', '\uD83D', '\uDC66', 'โ', '\uD83D', '\uDC66', '\uD83C', '\uDF26', '๏ธ', '\uD83E', '\uDDD8', '\uD83C', '\uDFFB', 'โ', 'โ', '๏ธ']
Since JS uses utf-16 encoding, splitting would give you those codepoints and is not useful.
Introducing Intl.Segmenter
The
Intl.Segmenter
object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string. - MDN
const segmenterEn = new Intl.Segmenter("en");
[...segmenterEn.segment("๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ")].forEach((seg) => {
console.log(`'${seg.segment}' starting at index ${seg.index}`);
});
// '๐ฉโ๐ฉโ๐ฆโ๐ฆ' starting at index 0
// '๐ฆ๏ธ' starting at index 11
// '๐ง๐ปโโ๏ธ' starting at index 14
Getting the visible length of a string
Using the segmenter API, we could split the text based on the graphemes and get the visible length of the string.
Since the output of .segment()
is iterable, we will collect that in an array and return its length.
function getVisibleLength(str, locale = "en") {
return [...new Intl.Segmenter(locale).segment(str)].length;
}
console.log("๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ".length); // 21
console.log(getVisibleLength("๐ฉโ๐ฉโ๐ฆโ๐ฆ๐ฆ๏ธ๐ง๐ปโโ๏ธ")); // 3