Bytes and code points

Anne van Kesteren

@annevk

Unicode

Unicode assigns unique numbers to characters used all over the world, these numbers are known as code points.

Room for OVER 9000 code points: U+0000 to U+10FFFF.

To be clear, that is over a million code points.

9 U+0039 DIGIT NINE

€ U+20AC EURO SIGN

💩 U+1F4A9 PILE OF POO

Lots of unassigned code points. (If Klingon becomes widespread enough at some point it can easily be added.)

Lots of gotchas.

A U+0041 LATIN CAPITAL LETTER A

Α U+0391 GREEK CAPITAL LETTER ALPHA

Å U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE

Å U+0041 LATIN CAPITAL LETTER A
U+030A COMBINING RING ABOVE

But let’s not get into that.

Glyphs: A, A, A.

Glyphs represent one or more characters as distinct graphical unit.

Contrasted with HTML and CSS, characters are HTML (semantics), glyphs are CSS (presentation).

Watch out with icon fonts.

Encodings

There are many ways to represent a code point.

Usually a code point is represented in bytes per the rules of an encoding.

0x80 U+20AC EURO SIGN in windows-1252

0xAC 0x20 U+20AC EURO SIGN in utf-16le

0x20 0xAC U+20AC EURO SIGN in utf-16be

0xE2 0x82 0xAC U+20AC EURO SIGN in utf-8

Only utf-8, utf-16 (both variants), and gb18030 can represent all code points. Other encodings are more limited.

Lets take a quick look at the Encoding Standard.

Within a language a code point can typically also be escaped.

€ U+20AC EURO SIGN in HTML

"\u20AC" U+20AC EURO SIGN in JavaScript

"\0020AC" U+20AC EURO SIGN in CSS

Note that the code point here is represented using code points in the U+0000 to U+007F (ASCII) range.

Biting bytes

Your chosen encoding leaks in two places: HTML forms and the URL query string.

Source: /€?€
Encoding: windows-1252
URL: /%E2%82%AC?%80

What happens to code points that cannot be represented in the encoding?

☺

For form submission the consensus is that ☺ turns into ☺.

The sequence &#, code point as decimal number, ; is required by the HTML standard.

In such a scenario you cannot distinguish a user entering ☺ from a user entering ☺.

For the URL query component the standard currently requires ? to be used. Browsers are all over the map.

URLs are actually harder. They are context-dependent. In CSS or XMLHttpRequest using U+20AC in the query component will always use utf-8, i.e. ?%E2%82%AC.

What I am trying to say is this:
Always use utf-8!

Not convinced?

The utf-8 encoding is ingrained in the platform.

XMLHttpRequest .send(…) & .responseType = "json"

WebSocket protocol

Web Workers

Application cache manifests

URLs as mentioned and everything new.

What I am trying to say is this:
Always use utf-8!

JavaScript’s code units

JavaScript and the DOM use the utf-16 encoding, but they are not handling surrogates.

Surrogates are annoying and exist because a group agreed that 16 bits must be enough for all code points.

IPv4 anyone?

Every code point in U+10000 to U+10FFFF requires surrogates when expressing it in utf-16.

There are leading surrogates (U+D800 to U+DBFF) and trailing surrogates (U+DC00 to U+DFFF) and they appear as pair.

A code point is 0x10000 + (LS − 0xD800) × 0x400 + (TS − 0xDC00).

This is why "💩".length is 2.
(U+1F4A9 PILE OF POO again.)

Similarly String.fromCharCode() and str.charCodeAt() work with code units, not code points.

function cp_str (cp) {
  if(cp < 0x10000)
    return String.fromCharCode(cp)
  cp -= 0x10000
  return String.fromCharCode(↩
    0xD800 + (cp >> 10),↩
    0xDC00 + (cp & 0x3FF))
}

Be wary of surrogates in JavaScript, use utf-8 anywhere you can, and thanks for listening!

Questions?
(Or contact me later via annevankesteren.nl or @annevk.)