My Favorite Bugs: Invalid Surrogate Pairs

5/16/2026 at 4:24:48 PM

A CRDT library working at the code unit level? Ouch. Of course that’s going to go wrong, it was inevitable.

As for using extended grapheme clusters, it sounds a little bit iffy—maybe possible to use correctly, maybe not, because they’re not stable over time. That style of thing has created some fascinating bugs, like (a few years ago) index corruption in PostgreSQL due to collation changes.

Unicode scalar values are technically-safe: you can’t introduce invalid Unicode. But you can definitely still end up with nonsense.

> We made emoji an atomic node type.

That avoids problems for emoji, but leaves the underlying hazard untouched. I imagine it could still theoretically occur with other text, probably CJK. But probably only theoretically.

> This splits by grapheme clusters rather than code units. No orphaned surrogates, no split emoji. It's what .slice() should have been doing all along, but of course UTF-16 predates emoji by decades.

I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough). But the concept of multiple scalars contributing to the logical unit was always inevitable.

by chrismorgan

5/16/2026 at 5:10:17 PM

> I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough)

Surely certain people did know, but those people weren't in a position to do anything about it.

Specifically, there were surely people who knew that because historical Chinese place names, Japanese nicknames, and so on, were not included in the original "Unicode" (it wasn't called UCS-2 yet) it was insufficient for complete expression of Asian languages.

There were also many people who objected to Han unification, which is a different problem.

But all of these objections were discarded because of the overwhelming mandate for a fixed-width encoding. The original "Unicode" was conceived as a "16-bit" initiative. Its 16-bit-ness was an essential aspect of the design and the Unicode Consortium did what they had to do to fit all scripts and characters "in modern use" into 16 bits.

From the Wikipedia article on Han Unification[1]:

> Some of the controversy stems from the fact that the very decision of performing Han unification was made by the initial Unicode Consortium, which at the time was a consortium of North American companies and organizations (most of them in California), but included no East Asian government representatives. The initial design goal was to create a 16-bit standard, and Han unification was therefore a critical step for avoiding tens of thousands of character duplications.

[1] https://en.wikipedia.org/wiki/Han_unification

by rectang

5/16/2026 at 6:36:54 PM

Han unification predates Unicode by about a decade; most of the early work in Unicode largely consists of copy-pasting the Japanese and Chinese governments' standards for unified CJK ideographs. Indeed, read some of the early histories of Han unification (e.g., https://www.unicode.org/versions/Unicode16.0.0/core-spec/app...), and you'll notice that there's a lot of liasoning with East Asian technology groups in East Asian cities going on. I don't think any East Asian government representatives would have actually objected to Han unification!

It's also worth noting that the original goal of Unicode wasn't to be able to faithfully represent all text, but rather to faithfully represent existing character sets. Only later do you get the impetus to actually include everything, as people become a lot less tolerant of "computer can't actually represent <X>" scenarios. Note too that a lot of the Han unification criticisms basically fall into the same bucket as, say, Medievalists, who want to preserve certain details of their source texts more faithfully than was the norm for computer systems in the 1980s.

by jcranmer

5/17/2026 at 11:25:53 AM

There was never an adequate safety margin for anything but immediate (less than five year horizon) use—even at Unicode 1.1 it was more than half full, and they knew they weren’t done. And yet all kinds of major companies put all their eggs in that basket, and then doubled down with the monstrosity that is UTF-16, rather than backing out and going with UTF-8 instead, even though I strongly suspect it would have been easier for everyone involved in most cases, compared to the whole wchar shemozzle. Instead it took Windows twenty-five years to bridge the gap with a UTF-8 codepage (65001) that actually worked.

by chrismorgan

5/16/2026 at 5:02:32 PM

>I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

Yeah, I think that's fair. I didn't really think this through as I was writing it.

I'm not even so sure "ending up with nonsense" here is the worst outcome. It might be unavoidable with this approach and if that had been the only problem this bug might have been less memorable.

The real problem—which I mention didn't articulate/emphasize particularly well—was that these invalid surrogate pairs were getting passed into `encodeURIComponent` somewhere deep in the stack and choking catastrophically on them. That was the "real" bug at the end of the day, but the invalid surrogate pairs and the way they were getting created on the way were a fun journey to untangle.

by georgemandis

5/16/2026 at 6:41:43 PM

> UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough).

ISO 10646 (“Universal Coded Character Set”) planned for 31-bit code points from the start (128 groups of 256 planes of 256 rows of 256 cells, with UCS-4 as a four-byte encoding), around 1989. Unicode, on the other hand, was a parallel effort initiated by Xerox and Apple a few years earlier, with more pragmatic aims, defining a 16-bit character set (but no encoding) that would allow round-tripping of existing character sets. For Unicode 1.1, it was decided to align it with ISO 10646 and make it coincide with the latter’s first plane (the BMP) and UCS-2. In Unicode 2.0, surrogate pairs and the UTF-16 encoding were introduced to allow future expansion to additional planes, in a way that would be compatible with existing implementations. Only with Unicode 3.1 in 2001, five years after Unicode 2.0 and ten years after Unicode 1.0, were actual characters assigned beyond the BMP.

History is complicated; aims, incentives, and constraints change over time.

by layer8

5/16/2026 at 8:10:01 PM

> I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

Maybe a simpler argument against this idea is that the definition of an extended grapheme cluster changes between versions of Unicode. The relevant standard is on its 47th revision (not all of which change extended grapheme clusters, but many do): https://www.unicode.org/reports/tr29/

by ucarion

5/16/2026 at 2:56:45 PM

Just noticed this is getting some traffic! It's a little buried in the post, but I made an interactive tool for exploring surrogate pairs as part of this:

- https://george.mand.is/invalid-surrogate-pairs/

I thought it was something that's easier to play with and feel than necessarily just read about.

by georgemandis

5/16/2026 at 2:41:14 PM

Once I ran into this it became hard to treat strings “normally” in any situation or, alternatively, I’d force hard encoding requirements in the domain. Regardless, handling grapheme clusters properly is hard and easy to get wrong.

I recently ported a program from python to rust and the original author used string regexes. Input and output document encoding mattered but the characters that needed to be matched were always lower ASCII. The python program could have used binary regexes, but instead forced an input encoding (UTF-8) and made the user choose an output encoding. When the input comes from an unknown process or legacy data, however, you don’t always get the luxury of assuming the encoding. Switching to binary regexes and ignoring encoding altogether simplified logic, eliminated classes of errors, and made the program work in scenarios it couldn’t earlier. Getting rid of the last decoding/encoding code gave me so much relief, especially when all of the whacky encoding tests I had already written continued to work.

by jonhohle

5/16/2026 at 3:46:30 PM

You are reminding me we also circled an issue at one point where a backend system in Python needed to agree on the same character count length of a piece of content was the client (JavaScript). Another place Intl.Segmenter would've helped.

If I'm remembering correctly, we briefly explored a solution where we told Python "This is a UTF-16LE encoded string" so the count would match, but I think we learned/realized the endianness is actually dictated by the client's machine (Going from memory here). Ultimately we just changed the solution so the client was the source of truth about lengths and counts.

These threads are surfacing all kinds of things I forgot about and didn't add in that blog post. Maybe I need to write another, haha.

by georgemandis

5/16/2026 at 5:59:32 PM

I have a similar, Unicode-related “favorite bug”.

We were expanding our product to a new language that used non-ASCII code points. Part of the system involved invoking binaries using text as input.

Locally, everything worked great. Once deployed, we got corrupted text output. As soon as we SSH’d on to the server to inspect, everything started working again.

It turns out that SSH servers can modify the LANG environment variable. The default value on our servers didn’t support Unicode, but LANG was updated as soon as we connected via ssh. It was a head scratcher for sure.

by dimes

5/16/2026 at 8:18:26 PM

I hit a similar problem in an application that was performing non-unicode-aware string operations on user-submitted text in a typescript codebase. The data couldn't be processed by an external service that was expecting valid Unicode. My fix was using toWellFormed: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

by denzen

5/16/2026 at 4:30:07 PM

Windows allows unmatched surrogate pairs in filenames, invalid for UTF-16. Likewise, Linux allows invalid UTF-8 byte sequences in filenames.

Because invalid UTF-16 strings could show up in places within Windows, someone made a UTF-8 variant called "WTF-8", which allows unmatched surrogate pairs to survive a round trip.

by Dwedit

5/16/2026 at 7:51:24 PM

Indeed, Linux allows anything but "/" and "\0" in filenames. Those days its reasonable to refuse utf8 filenames. But one must always validate first!

by bombela

5/17/2026 at 10:08:18 AM

> Indeed, Linux allows anything but "/" and "\0" in filenames.

For what it’s worth, NT allows any 16-bit quantity but L'\\' (0x005C) in filenames (even nulls); it’s the Win32 layer on top of it that imposes all the other weird restrictions and mappings.

by mananaysiempre

5/17/2026 at 1:24:56 PM

The NT Object Namespace itself indeed has no restrictions on filename characters except for "\", but once you reach a real filesystem like NTFS or FAT, the forbidden characters continue to be blocked.

https://projectzero.google/2016/02/the-definitive-guide-on-w...

by Dwedit

5/16/2026 at 3:09:17 PM

Writing property tests on functions that work with strings is a good way to find lots of Unicode issues.

by skybrian

5/16/2026 at 3:10:31 PM

Damn, I’ve never really had to deal with Unicode all that much.

Was already bad enough that instead of bytes, we have to worry about code points. Now even that isn’t enough?

It would have been expensive, but all characters should have been fixed size 64bit values.

by BobbyTables2

5/16/2026 at 3:53:44 PM

> It would have been expensive, but all characters should have been fixed size 64bit values

You're making the same mistake that numerous people made before you: thinking that it's as simple as using arrays of large enough numbers. First they thought that two bytes per symbol would be enough, then four. Spoiler alert: it wasn't. And eight won't work either.

by usrnm

5/16/2026 at 4:59:43 PM

Why wouldn't 8 be enough? Surely 18,446,744,070,000,001,024 characters is enough for every writing system in the world.

by 201984

5/16/2026 at 5:20:17 PM

Because that's not how Unicode works. It's not simply a table mapping numbers to all possible symbols. Unicode is full of special codepoints that have no meaning on their own, they serve as modifiers to other symbols and a single visible symbol can be formed by an arbitrary (in theory) long combimation of such codepoints. It doesn't matter how you encode it, it simply doesn't work as "codepoint -> symbol" and indexing in a unicode string is never O(1) and cannot be made O(1). Could we use a simple table approach? Maybe. But it wouldn't be Unicode

by usrnm

5/16/2026 at 11:24:40 PM

I actually wonder if the combinatoral explosion of attempting to enumerate every possible character combination would exceed 2^64 bits. My intuition is that it might, and also such a system would be unworkably unwieldy. The size of the spec document would also suffer from the combinatoral explosion. Imagine a system that tries to encode a unique entry for every possible Zalgo character.

Also, literally nobody wants to use 64 bit values to encode ASCII values. Even in our world of insanely large storage that would be breathtakingly wasteful.

by jandrese

5/16/2026 at 4:20:42 PM

UnicodeV6 - 128 bits per character!

by bombcar

5/16/2026 at 3:50:27 PM

> It would have been expensive, but all characters should have been fixed size 64bit values.

It would have been a non-starter, and then we'd all be dealing with Shift-JIS, BIG5, and FSM knows how many different codepages to this day. UTF-8 is about as elegant as it gets, though Java and JS still managed to fuck that up too (they both encode every codepoint outside the BMP as surrogate pairs in UTF-8)

by chuckadams

5/16/2026 at 4:24:58 PM

That's called CESU-8. https://www.unicode.org/reports/tr26/tr26-4.html

by dasyatidprime

5/16/2026 at 4:34:25 PM

> Java and JS […] both encode every codepoint outside the BMP as surrogate pairs in UTF-8

I can’t comment on Java, but JS I know reasonably well and I can’t think of any place it uses CESU-8.

by chrismorgan

5/16/2026 at 5:52:22 PM

Java doesn’t either.

by layer8

5/16/2026 at 3:42:34 PM

I had an emoji cut in half problem in Dart. I was a bit surprised because I thought substring operations worked on characters. It only caused an invalid Unicode symbol though so not too bad.

by impure

5/16/2026 at 5:55:46 PM

> I thought substring operations worked on characters

"character" turns out to be too vague an idea to correspond to some specific fact about the software. If you co-worker says his Uncle is "conservative" does he mean like "Believes Right To Work laws are a good idea" conservative or "Believes Joe Biden is a Communist" conservative ?

https://en.wikipedia.org/wiki/Character_(symbol) gives you some idea about this rabbit hole. Suffice to say, no, you can't have operations on "characters" until you've nailed down exactly what it was you meant by that.

by tialaramex

5/16/2026 at 6:11:10 PM

And here's a UTF-8 Playground: https://utf8-playground.netlify.app

by vishnuharidas

5/16/2026 at 2:47:58 PM

it's good to know about surrogate pairs in unicode. It was new to me too when being part of tracking down incomplete unicode flags in the (excellent) phanpy mastodon client.

Author went for Intl.Segmenter too: https://github.com/cheeaun/phanpy/issues/1491

by wupatz

5/16/2026 at 3:20:01 PM

My recollection (that I didn't add to the story): I don't think Intl.Segmenter had great browser support then (2022). Even if it had it still wasn't a quick/obvious fix for our problem with where it was occurring in our stack. But I do remember looking at it then.

by georgemandis

5/16/2026 at 10:09:01 PM

One of the reasons why I love Swift. It has proper, unescapable Unicode String support.

by frizlab

5/16/2026 at 3:26:23 PM

Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a "good luck" situation depending on the runtime?

by agus4nas

5/16/2026 at 3:38:54 PM

Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.

[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.

by amluto

5/16/2026 at 4:05:16 PM

> surrogates, regardless of whether they’re paired, are invalid in UTF-8

Java did not get the memo. Since the char type is fixed at 16 bits, it uses surrogates to encode everything outside the BMP, regardless of the encoding.

by chuckadams

5/17/2026 at 4:22:05 AM

If you use the string methods that work with code points instead of chars, you rarely if ever have to deal with surrogate pairs in Java.

by shawn_w

5/16/2026 at 9:15:01 PM

It depends on the language and/or used libraries. E.g. in Go, the problem does not exist, because it uses UTF-32; Rust uses UTF-8, but it makes sure that you can't cut a string between bytes that belong to the same character.

Fun Java/macos quirk: macos normalizes file names, so you can't have two files called ü in the same directory by writing ü as a single character and as composing characters. But unfortunately, this only happens on write, not on read, so if you type an ü on a German keyboard (produces a single character) into the Java source code file when writing a file name, the file will be saved with the decomposed name instead, but when trying to open the file, it will not be found when trying to open it with the single character name.

by RedNifre

5/16/2026 at 3:39:14 PM

The language handled it fine. It will generally just show replacement characters (�) for combos that don't map to anything.

It was really `encodeURIComponent` that didn't handle it gracefully.

If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"):

encodeURIComponent("\uD83E\uDD20")

If you give it an invalid surrogate pair, it will throw an actual error:

encodeURIComponent("\uDD20\uD83E")

by georgemandis

5/16/2026 at 4:41:24 PM

No, the language did not handle it fine. It allowed an invalid Unicode string to exist. This is basically a UTF-16 affliction—nothing that does UTF-16 validates, whereas almost everything that does UTF-8 does validate. encodeURIComponent deals with UTF-8, so of course it throws.

by chrismorgan

5/16/2026 at 5:26:24 PM

I'm realizing `encodeURIComponent` is actually part of the ECMA spec! I thought it was something provided by the browser like `window` or `document`. I withdraw my "the language handled it fine" comment, haha.

Before I'd looked that up I was going to say: I feel like "don't allow an invalid Unicode string to exist all" feels like a separate/bigger problem to me from "handling it fine" when they do get created. To the extent I can hand JavaScript an invalid combination of code units in a variety of other scenarios, returning a � felt fine.

e.g. // valid String.fromCodePoint(0xd83e, 0xdd20) // invalid, but "�" is ... fine? String.fromCodePoint(0xdd20, 0xd83e)

by georgemandis

5/17/2026 at 11:51:43 AM

In Rust, an invalid Unicode string simply cannot exist (* unless you use unsafe, but all bets are off then). An important part of this is that the code unit, the scalar value and the string are three different types (u8, char, str). Iteration must decide if it wants to go by code unit or by scalar value (… or by extended grapheme cluster, but that’s not provided in std).

JavaScript’s problems start with not having separate code unit or scalar value types. Sequences of UTF-16 code units, individual UTF-16 code units and scalar values all use the type string. (Code unit and scalar value also both use number in some contexts.)

The first step to fixing JavaScript’s bad semantics would be separating the code unit and scalar value types. If you did that… the changes required to support strict strings are perhaps surprisingly small. Even migrating to UTF-8 semantics is not very hard then.

Unfortunately, JavaScript seems very determined to do stupid things and allow stupid things and then do more stupid things with the stupid things it foolishly allowed.

by chrismorgan

5/16/2026 at 8:08:05 PM

[flagged]

by raymondchau

5/16/2026 at 3:57:25 PM

In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 for historical reasons, because at some point before Unicode, 16 bit was deemed enough (ucs-2). utf-16 run length encodes Unicode 32 codepoints into one or two code units. Splitting in a middle of a codepoints produces one invalid half string, and one semantically different half string.

emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries.

by bombela

5/16/2026 at 4:25:26 PM

> Unicode code points are 32 bit

21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding).

> at some point before Unicode

No, in the early days of Unicode.

> run length encodes

Um… what? RLE is a data compression thing, UTF-16 has nothing to do with it.

by chrismorgan

5/16/2026 at 5:37:41 PM

> 21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding).

Although, conveniently this means that UTF-8 bytes 0xF8 through 0xFF are always nonsense so the third party Rust type `ColdString` uses leading bytes 0xF8 through 0xFF in its 8 bytes of representation to indicate "I am an inline UTF-8 string, but, the UTF-8 starts in the next byte with a total length of N bytes" where N = byte - 0xF8

This leaves the continuation marker bits alone so ColdString can use those in that front byte to indicate "I am not actually inline data, I'm a pointer, rotate me so these indicator bits are my LSB and zero out them out to make me a 4 byte aligned pointer".

Which leaves all other 8 bytes values for the valid UTF-8 strings, which all begin with either ASCII or a byte between 0xC2 and 0xF4 inclusive.

by tialaramex

5/16/2026 at 5:12:45 PM

>> Unicode code points are 32 bit

> 21-bit, actually

Less than that. https://en.wikipedia.org/wiki/Code_point#In_character_encodi...:

“The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 2¹⁶) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112”

That makes it log(1,114,112)/log(2) bit. That’s about 20,09.

(https://www.unicode.org/versions/Unicode17.0.0/ assigns 159,801 of them to characters)

by Someone

5/17/2026 at 9:03:15 AM

Sorry, I was thinking of 0x1FFFFF as the end, but it’s 0x10FFFF. Forgetful.

by chrismorgan

5/16/2026 at 7:21:27 PM

Don't know what you are being down voted (or my grand parent comment for that matter). You are very correct.

by bombela

5/16/2026 at 7:18:33 PM

If you are going to be pedantic, go all the way. 2^21 is 0 to 2_097_151. Unicode codepoint range is 0 to 1_114_111, slightly more than 2^20 (0 to 1_048_575).

I would argue that Unicode v2 onward; circa 1991 (Unicode Consortium and the ISO/IEC working together); is what anybody knows as Unicode with the 0 to 1_114_111 codepoints easily manipulated as a 32 bit value.

I meant variable length encoding, RLE encodes a number of successive repetition indeed.

by bombela