In Elements, the native String type on all target platforms uses UTF16 encoding. UTF16 is a great middle-ground, because 16-bits are enough to represent most common Unicode code points, including not just Latin letters, symbols and accents, but also most other commonly used character sets such as Greek, Cyrillic, or most Asian languages.
But, it is important to keep in mind that Unicode code points are 24 bits, so there is still a wide range of characters that cannot be expressed in a single 16-bit char. Like its sibling-encoding UTF-8, UTF-16 uses Surrogate Pairs to encode this.
Surrogate Pairs
What this means is that a range of 16-bit values, D800
-DBFF
and DC00
- DFFF
, is reserved, and whenever a Unicode character does not fit into 16 bits (or falls into the reserved range), it is encoded as two 16-bit values of said range.
Consider this string: "Fun
π Times!"
. If we look at the individual 16-bit Chars, we see this:
46,75,6E,20,D83D,DE03,20,54,69,6D,65,73,21
Note the two values D83D
and DE03
β these are a surrogate pair, and they decode to the Unicode char 1F603
, which was too large to fit into 16 bit, of course.
So while this string reports a length()
of 13, it really only contains 12 unicode code points β which matches what we see visually.
Luckily, Elements RTL adds a few really helpful APIs on its String type that help you deal with this.
First, there's ToUnicodeCodePoints
, which returns an array of merged, 32-bit Unicode code points. In this case, predictably, it will return this:
46,75,6E,20,1F603,20,54,69,6D,65,73,21
Note how all the "regular" characters are untouched, but the surrogate pair has been merged. There's a couple more handy helper functions.
IsIndexInsideOfASurrogatePair
returns if an index into the string is in the middle of a surrogate pair. For the above example, it would return true for 5
only. This method can be useful for string manipulations. For example, imagine you were about to insert a character, or split the string, at index 5. That would be a bad idea, as it would break the surrogate pair and result in an invalid UTF-16 string.
I use this, for example, in the Fire code editor. Say the cursor is left of the emoji, and you press Right. It used to be that you'd end up in the middle of the emoji, and could type a space to break it apart. Not anymore, if IsIndexInsideOfASurrogatePair
is true, the cursor moves by two indices, so that you're now to the right of the emoji (same goes for pressing Delete, etc).
With that, you might think you're all set and equipped to deal with Unicode β but that's really only half of the story. Enter Joined Characters.
Joined Characters
Consider this string: "WTF
π€·πΌββοΈ?"
.
Based on the above, you'd probably know what to expect β four regular characters, and maybe a surrogate pair for the Emoji, right? Wrong. Let's look at the character values:
57,54,46,D83E,DD37,D83C,DFFC,200D,2640,FE0F,3F
WTF, indeed, right? This time, we have not two but seven UTF-16 chars occupied by the single Emoji. What's going on here? First, lets unpack the surrogate pairs by calling ToUnicodeCodePoints
. That, predictably, gives us:
57,54,46,1F937,1F3FC,200D,2640,FE0F,3F
Of the seven chars, two surrogate pairs got merged, and three remained as they are. What's going on here?
It turns out, in Unicode, a single code point (note how I have avoided saying characters, until now) does not necessarily represent a distinct character. Certain code points can combine to perform specific characters. This is used in a manner of cases (for example combining accents with a regular letter), but very common in Emoji β in this example to affect sex and skin tone. What looks like a "medium light skinned women shrugging" actually is the code point for "person shrugging", with a modifier for skin tone, and a modifier for sex:
1F937
(Person shrugging)1F3FC
(Skin Color)200D
(Zero Width Joiner)2640
(Female Sign)FE0F
Variation Selector-16, An invisible codepoint which specifies that the preceding character should be displayed with emoji presentation.
Luckily, once again Elements RTL exposes helper functions to deal with this. Similar to
, theres a ToUnicodeCodePoints
ToUnicodeCharacters
method that processes all the joining, and returns this:
`W,T,F,
π€·πΌββοΈ,?
Since Unicode characters can't be expressed by a single hex value, this method returns a list of strings, each one containing all the code points that make up an individual character.
There is also IsIndexInsideOfAJoinedUnicodeCharacter
which, again, lets you know if a given string index falls within a joined character, and β because unlike surrogate pairs, joined characters don't have a well-known length of just two, we have Β StartIndexOfJoinedUnicodeCharacterAtIndex
and IndexAfterJoinedUnicodeCharacterCoveringIndex
that allow you to find the beginning or the end of a character (including, of course, accounting for surrogate pairs).
Another common example for joined characters are the flag Emoji. Consider "Bon bini na π¨πΌ"
, which after expanding surrogate pairs expands to:
42,6F,6E,20,62,69,6E,69,20,6E,61,20,1F1E8,1F1FC
Unicode actually reserves 26 code points, 1F1E6
-$1F1FF
, as "regional indicators". Each of the 26 code points represents a letter A thru Z, and any flag can be represented by combining the two letters of the country code, CW
in this case.
ToUnicodeCharacters
of course, handles this fine:
`B,o,n, ,b,i,n,i, ,n,a, ,
π¨πΌ
Unicode: It's not as Simple as it Seems ;)
So that's a peek behind UTF-16 and Unicode, and some of the APIs that Elements RTL provides to help you work with Unicode data more safely.
If you want to look at what's going on behind the scenes, I recommend to check out the String.Unicode.pas of the Elements RTL source code, as well as the accompanying test file, which explores a lot of corner cases.