Unicode

Some years have passed since ASCII was invented, so let’s move a few decades forward. By then, ASCII had already become a widely used standard for electronic communication, and most devices were familiar with the mapping. However, with the growth of international technology, new challenges emerged. There are more languages and alphabets to cover, and the 128 positions in the ASCII table were simply not enough. Moreover, the list was now closed, which made things worse. To make matters more complicated, some older systems expected 7-bit text and treated bytes above 0x7F as something suspicious or vendor-specific.

The Problem: Internationalization

As technology expanded globally, the need to represent more languages and symbols became crucial. The solution to extend the character set had to be both efficient and compatible with older systems. Devices couldn’t easily be updated, so any change had to be seamless and largely invisible to users. The last thing anyone wanted was a massive update that would require replacing hardware or forcing businesses to change everything at once.

Big Changes, Invisible to Users

To solve this problem, a smart solution was needed that would extend the character mapping while maintaining backward compatibility with existing devices. The challenge was to ensure that new machines could handle a broader range of symbols, while older machines could still function properly.

The goal was simple: make sure everything “just works.” This required solving the issue without interrupting normal operations. No one wanted to think about how or why this change took place, it simply had to be seamless for users.

Neat Solution and Perfect Hack

The answer came in two layers. Unicode defined a large, shared character set. UTF-8, designed in 1992, gave us a practical way to encode those characters while keeping plain ASCII text unchanged.

UTF-8 uses one to four bytes to represent characters, depending on the value. It remains backward-compatible with ASCII, so the first 128 characters are identical to the ASCII set. Modern Unicode contains far more than the old 128 ASCII positions and keeps growing, but UTF-8 still uses the same one-to-four-byte shape.

If you’re curious about specific Unicode values, you can always check unicode.org for the full character set. But for now, let’s explore the different byte lengths used in UTF-8.

1 Byte

The 1-byte encoding is fully backward-compatible with ASCII. The leading bit is 0, and the remaining 7 bits carry the ASCII value. This allows for 128 values, exactly as defined in ASCII.

Bytes	Binary (Prefix)	Bits	Maximum Value
1	`0` _ _ _ _ _ _ _	7	127

2 Bytes

The 2-byte encoding offers significant expansion. The first byte starts with 110, and the continuation byte starts with 10. Together, they carry 11 useful bits, covering Unicode values from U+0080 to U+07FF.

Bytes	Binary (Prefix)	Bits	Maximum Value
2	`110` _ _ _ _ _	11	2047

3 Bytes

With 3 bytes, UTF-8 covers most of the Basic Multilingual Plane, from U+0800 to U+FFFF, except values reserved for UTF-16 surrogate mechanics. This is where many common scripts and symbols live.

Bytes	Binary (Prefix)	Bits	Maximum Value
3	`1110` _ _ _ _	16	65535

4 Bytes

The 4-byte encoding covers supplementary planes, from U+10000 to U+10FFFF. This range includes many emoji, historic scripts, rare CJK characters, and specialized symbols. It is also the upper bound of valid UTF-8 today.

Bytes	Binary (Prefix)	Bits	Maximum Value
4	`11110` _ _ _	21	1,114,111

Summary

In conclusion, UTF-8 is one of the most elegant solutions in the history of computer science. It provides backward compatibility with ASCII while enabling support for a massive number of characters across multiple languages. Today, UTF-8 is the default choice for text interchange on the web and in many modern systems.

Although we live in the 21st century, the spirit of the ASCII era still influences the way we interact with technology. UTF-8 has stood the test of time because it solved a hard migration problem without asking the whole world to reboot at the same time.