Length of a string

Length of a string

What you see is not always what you get!. The length of "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ" is 21. Let us explore why is it 21 and how to get 3.

ยท

3 min read

TL;DR

'๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ'.length is 21 instead of 3 because JS gives length UTF-16 code units and icons are a combination of more than one of such code units. Use Intl.Segmenter to get the length of rendered graphemes.

console.log("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ".length); // 21  - W
console.log(getVisibleLength("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ")); // 3 - How can we get this?

What is the .length?

The length data property of a string contains the length of the string in UTF-16 code units. - MDN

I always thought we used utf-8 encoding, mostly because we use to set <meta charset="UTF-8"> in our HTML file.

๐Ÿ’กDid you know, JS engines use UTF-16 encoding and not UTF-8?

const logItemsWithlength = (...items) =>
  console.log(items.map((item) => `${item}:${item.length}`));
logItemsWithlength("A", "a", "ร€", "โ‡", "โ‡Ÿ");
// ['A:1', 'a:1', 'ร€:1', 'โ‡:1', 'โ‡Ÿ:1']

In the above example. A, a, and ร€ can be represented using utf-8 encoding and hence in length is 1, irrespective if you check utf-8 or utf-16 encoding.

โ‡ and โ‡Ÿ needs utf-16 (if it was utf-8, its length would be 2)

But since all the characters could be represented using utf-16, the length for each character is 1.

Length of Icons

logItemsWithlength("๐Ÿง˜", "๐ŸŒฆ", "๐Ÿ˜‚", "๐Ÿ˜ƒ", "๐Ÿฅ–", "๐Ÿš—");
// ['๐Ÿง˜:2', '๐ŸŒฆ:2', '๐Ÿ˜‚:2', '๐Ÿ˜ƒ:2', '๐Ÿฅ–:2', '๐Ÿš—:2']

The above icon needs two code points of UTF-16 to be represented, and hence the length of all the icons is 2.

Encoding values for the icon - ๐Ÿง˜

  • UTF-8 Encoding: 0xF0 0x9F 0xA7 0x98

  • UTF-16 Encoding: 0xD83E 0xDDD8

  • UTF-32 Encoding: 0x0001F9D8

Icons With different colors

While using reactions in multiple apps, we have seen the same icons with different colors, are they different icons or the same icons with some CSS magic?

Irrespective of the approach, the length should be now 2, right? After all, two codepoints of utf-16 encoding (basically utf-32 encoding) have a lot of possible spaces to accommodate different colors.

logItemsWithlength("๐Ÿง˜", "๐Ÿง˜๐Ÿปโ€โ™‚๏ธ");
//  ['๐Ÿง˜:2', '๐Ÿง˜๐Ÿปโ€โ™‚๏ธ:7']

Why is the icon in blue have a length of 7?

Icons are like words!

console.log("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".length); // 11
console.log([..."๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ"]);
// ['๐Ÿ‘ฉ', 'โ€', '๐Ÿ‘ฉ', 'โ€', '๐Ÿ‘ฆ', 'โ€', '๐Ÿ‘ฆ']

Icons, like words in English, are composed of multiple icons. And this can make the icons of variable length.

How do you split these?

console.log("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ".length); // 21
console.log("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ".split(""));
// ['\uD83D', '\uDC69', 'โ€', '\uD83D', '\uDC69', 'โ€', '\uD83D', '\uDC66', 'โ€', '\uD83D', '\uDC66', '\uD83C', '\uDF26', '๏ธ', '\uD83E', '\uDDD8', '\uD83C', '\uDFFB', 'โ€', 'โ™‚', '๏ธ']

Since JS uses utf-16 encoding, splitting would give you those codepoints and is not useful.

Introducing Intl.Segmenter

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string. - MDN

const segmenterEn = new Intl.Segmenter("en");
[...segmenterEn.segment("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ")].forEach((seg) => {
  console.log(`'${seg.segment}' starting at index ${seg.index}`);
});
// '๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ' starting at index 0
// '๐ŸŒฆ๏ธ' starting at index 11
// '๐Ÿง˜๐Ÿปโ€โ™‚๏ธ' starting at index 14

Getting the visible length of a string

Using the segmenter API, we could split the text based on the graphemes and get the visible length of the string.

Since the output of .segment() is iterable, we will collect that in an array and return its length.

function getVisibleLength(str, locale = "en") {
  return [...new Intl.Segmenter(locale).segment(str)].length;
}
console.log("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ".length); // 21
console.log(getVisibleLength("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ๐ŸŒฆ๏ธ๐Ÿง˜๐Ÿปโ€โ™‚๏ธ")); // 3

References

ย