Emoji Dissector - Unicode debugger

Enter Unicode:

or pick an example:

Loading UnicodeData.txt...

References/tools:
Wikipedia - input, character list, planes
Unicode consortium - Emoji list (has CLDR names/keywords), Emoji data, UnicodeData.txt
qaz.wtf - converter, other tools
eeemo.net - Zalgo generator

A quick introduction to Unicode

Initially, computers only supported a very basic character set, consisting of 128 or 256 distinct character codes, sufficient for the English language. Later, Codepages were introduced, which allowed introducing language specific characters (using codes 128-255) in additon to the characters common to all systems (0-127). However, only one code page could be active at any time, and the same numeric value was mapped to different characters depending on the code page. When a document was interpreted in the wrong code page, language-specific characters would be garbled, and the number of supported characters was limited. Similar concepts under different names and with custom character maps were introduced on different systems, making it hard to exchange data.

Unicode addresses this problem by allowing a much larger range of codes, providing one universal way to express all languages and characters (sort of). Each of these numbers is called a codepoint (or code point), which roughly (but not exactly) corresponds to a "character" (see below). These code points are usually identified with a hexadecimal number. When you see a hex number above, it is one of these codepoints.

The codepoints are commonly separated into the following planes (ranges of code points):

Basic Multilingual Plane (BMP), ranging from 0x0000 - 0xFFFF and including latin characters with diacritics like the German ä, CJK (Chinese, Japanese, Korean) symbols, some basic symbols, etc. (but not the Emoji). It also contains a reserved range of surrogates, which don't encode actual characters but are used for UTF-16 encoding, see below.
Supplementary Multilingual Plane (SMP), ranging from 0x10000 - 0x1FFFF. Among other things, this contains many historical scripts, and the Emoji
two planes for additional CJK ideographs (0x20000 - 0x3FFFF), one plane for special use, and additional planes for "private use" where organizations can define internally used symbols

Encodings

The large range of Unicode codepoints cannot be expressed as a single byte anymore. Multiple encodings (ways to express those numbers) exist:

UTF-8 is by far the most common encoding for stored and transmitted data. It encodes code points up to 0x7F (decimal: 127) as one byte, code points from 0x80 - 0x7FF as two bytes, 0x800 - 0xFFFF (i.e. up to the end of the BMP) as three bytes, and beyond that as 4 bytes.
UCS-2 (deprecated) encodes each codepoint as two bytes, but as a result, it can only express the first 65536 codepoints - the Basic Multilingual Plane.
UTF-16 is very similar to UCS-2, can express higher code points as a pair of surrogates, resulting in 2 bytes for codepoints from the BMP, but 4 bytes for higher code points.
UTF-32 encodes each code point as 4 bytes. This makes it the only currently valid fixed-width encoding and simple to interpret, but it also consumes the most space.

The differences and limitations of these encodings are still sometimes visible even in major software products (e.g. bugs that are triggered if you use emoji or other characters beyond the BMP). While the encoded bytes sometimes match the codepoint value (e.g. in UCS-2/UTF-16 for codepoints up to 0xFFFF, codepoint 0x1234 will be encoded as 0x1234), this is not always the case. This website always talks about the codepoint values, not the encoded form! For example, if you look at a text file containing ä (0x00E4) with a hex editor, you will most likely find the UTF-8 encoded form 0xc3a4. To complicate matters more, the byte order (endianness) is not specified for UCS-2, UTF-16 and UTF-32, i.e. there are two variants of each of these (one would express this codepoint as 0x00E4, the other as 0xE400). As you can see, correctly encoding and decoding these can be tricky. Leave this to existing libraries - there are many, often security-critical mistakes that can be made.

Variable-length encodings like UTF-8 or UTF-16 save space, but can be more expensive to decode and process. For example, it is impossible to determine the length of a string (in codepoints) from the length of the encoded form. It's also no longer possible to skip directly to a certain index (i.e. the n-th codepoint) in a string.

Unicode terms

Character can be confusing, and should probably not be used when dealing with Unicode on a technical level. As you will see below, what you can see as one "character" (grapheme) on the screen can be one or multiple codepoints, and some characters can even be encoded either as one or as multiple codepoints!
Codepoint is what comes closest to a "character" from a programmer's perspective in most cases, and is the closest equivalent to "one byte" in an ASCII string. It's one number representing an entry in the Unicode "character table" (but it may be e.g. a combining character, or something invisible). Note that some code points are reserved for usage for special purposes, and don't actually represent a character.
Code unit is the actual encoded unit that is being written. For UTF-8 this is a byte, for UTF-16 it's two bytes together. Multiple code units may be needed to encode one code point.
Grapheme cluster, also known simply as Grapheme, is what comes closest to to a "character" from a human perspective. For example, ä would be one grapheme, regardless if it is encoded as one codepoint, or a combination of a with a combining umlaut 0x0308 (compare the two using the buttons above!)
Glyph is the graphical representation of a character, which Unicode neither defines nor cares about.
~~Rune~~ is not a Unicode term, but the name Go (Golang the programming language) uses for a codepoint, because following common conventions seems to be against the principles of Golang.

Understanding the differences between these terms is key to understanding Unicode!

Programming language Unicode handling

Understand the terms above before you continue. When truncating strings, ensure you do not split a code point (generating an invalid encoding) in languages that work on code units! In general, support for graphemes is lacking and you will need additional libraries to avoid splitting them.

In Python 3, regular strings are Unicode strings, internally stored using 1, 2 or 4 bytes per "character" (codepoint) based on the highest code point (you don't have to care too much about that). String length, iteration and indexing is based on codepoints. Byte strings exist and you need to explicitly encode/decode to get from a regular to a byte string or vice versa. Howto

Golang strings are mostly treated as byte strings, which are expected (but not required) to contain valid UTF-8. Length and indexing is based on bytes (in the UTF-8 encoded form, aka code units). Iteration is based on Unicode codepoints (called "runes") with the returned index indicating the start byte (i.e. it is non-contiguous during iteration) and invalid codepoints replaced with the 0xFFFD replacement character. Use the unicode/utf8 package for e.g. string length in codepoints. Howto

JavaScript mostly works on two-byte (UTF-16) code units, e.g. for string length, indexing, charAt, charCodeAt. This is equivalent to code points for the BMP, but gives you individual surrogate pairs for codepoints beyond the BMP. The string iterator for (const c of somestring) works on codepoints (but invalid surrogates are returned individually). codePointAt(n) indexes using UTF-16 code units, and returns Unicode code points (but invalid surrogates, including if you point it at the second surrogate in a pair, are returned individually).

The examples, explained

The example buttons you can find above show some peculiarities that make Unicode so complicated. This section explains them

Composition - the difference between ä and ä

The two buttons that look like ä look exaclty the same, represent the same grapheme, but one of them is a single code point while the other consists of two code points, encoding a letter and a combining character. Both are valid forms (composed and decomposed). A normalized version of both forms exists, and it is possible to automatically convert between those. Software that ignores this runs at risk of considering two strings different when they really are the same, which can lead to security issues or other bugs: For example, if a filesystem or database normalizes a filename or value, it may no longer match what is recorded in some other place.

The pair of buttons with Korean text (감기) are another example of (de)composition

Controlling line breaks and hyphenation

The 3 € pro Gerät example shows off two invisible characters (as well as an Umlaut and Euro sign). The space between 3 and € is a no-break space, i.e. an automatic line break will never separate the number from the currency symbol. A soft hyphen invisibly separates the syllables of the word Gerät, indicating that software displaying the text can insert a line break while showing a hyphen. Many other such invisible characters exist, e.g. the Zero-width Space (which will allow a line break without a hyphen).

Codepoint naming

Most codepoints have names, but some just have numbers or identifiers. In case of the CJK (Chinese, Japanese, Korean) ideographs (symbols) this is likely both due to the large number of them, and because some of them are reused across multiple langauges. Take a look at 丣 and the hieroglyps!

Pseudofonts

There are sets of characters that are just a variantion of latin alphabet letters. In particular, mathematicians love variables so much that they often run out of letters. When they're tired of (or have exhausted) the Greek alphabet (both uppercase and lowercase), they sometimes turn to letters with additional marks added, bold letters, Fraktur, etc. These characters can be used to create "fonts", bold text etc. in environments that don't support custom formatting but do support Unicode. A converter showing more examples is linked in the tools list. The 𝕱𝖗𝖆𝖐𝖙𝖚𝖗 and pǝʇɹǝʌuı examples show the different ways in which these pseudo-fonts work.

Unusual scripts

Unicode does not only cover modern languages, and the Egyptean Hieroglyps contain depictions that can be understood (though perhaps not with the same meaning as they had in Ancient Egypt) even without a degree in Egyptology. Since this makes them prone to creative use, some operating systems may censor either the names or the glyphs themselves. If you are using an operating system that displays Unicode correctly, you can find a pictoral representation of a German idiom.

As is shown next, Unicode also supports other ancient scripts. A particularly hilarious example cited in the Unicode standard is listed here. Look up the full story!

Arabic numerals, at least in the Western world, usually refers to Western Arabic numerals, 1234567890. However, Eastern Arabic numerals are still widely used in some parts of the world (and supported by Unicode, of course), with some glyphs looking confusingly similar to Western Arabic ones with a different meaning! So if you want to be pedantic, and a form asks for "arabic numerals" but doesn't do proper input validation, Unicode allows you to use either!

Emoji

The Emoji come from various sources (dating back to Wingdings, old feature phones, and Japanese mobile carrier standards, which explains some Emoji that are very Japan-specific), and are strewn across multiple blocks and planes of the Unicode code space. Over time, they developed from simple, single codepoint emoji like the famous Pile of Poo into an incredibly complex standard, and some noteworhty examples are shown here.

Skin colors are represented by adding one of 5 skin color modifiers in specific places, which works both for simple emoji like the waving hand and composited emoji like the latter examples. Since these are modifiers, they do not require a Zero Width Joiner, unlike the next example. Not all emoji support skin color: Look at emoji-sequences.txt or emoji-test.txt in the latest release of the Emoji Data (see Unicode Consortium links above) to see which of them do.

As the set of Emoji has grown over time with newer versions of the Unicode standard, some logically similar Emoji are expressed in different ways. For example, some jobs like doctors are expressed with a MAN or WOMAN emoji, combined with a STAFF OF AESCULAPIUS symbol using a Zero Width Joiner. However, a dedicated emoji codepoint already existed for some jobs, so for those, the existing Unicode for the job is instead combined with a gender symbol to generate a gendered emoji. In general, these representations are somewhat backwards compatible: Software that isn't aware of the combination will render the glyphs for the individual characters, which doesn't look as nice, but is understandable.

Family emojis are similarly combined with Zero Width Joiners, and while skin colors are supported, some devices or programs will render these as four faces next to each other. Take a look how these render on different devices! Mobile devices tend to be better at handling the more advanced features of Unicode, especially when it comes to emojis.

To trigger this legacy behavior, a ridiculous overuse of various modifiers/components starting with the WOMAN WITH BUNNY EARS emoji (which has earned its own codepoint in Unicode!) exists as a separate example. It's technically correct, but... it's unlikely that a device maker will implement each specific combination of hair, skin and gender for this unique emoji. The Unicode consortium has released standards indicating which combinations should be commonly supported - look at emoji-zwj-sequences.txt or emoji-test.txt in the latest release of the Emoji Data (see Unicode Consortium links above) to see which of them do.

Flags are interesting, and again show how similar-looking concepts may be expressed very differently on a technical level in Unicode. For example, nation flags are expressed using regional indicator codepoints spelling out the country code, while the pirate and trans flag are combinations of a flag with a symbol (again maintaining backwards compatibility by causing old clients to render a human-understandable equivalent).

You may have noted that some emoji contain the Variation Selector: Some codepoints can be rendered either as a colored graphical emoji or a black and white, more text-like outline. This can (sometimes) be controlled with the variation selector. The default rendering depends on the codepoint and sometimes also on the software used. The example shows a smiley and a flag with a "render as text" variation selector, without a variation selector, and with a "rendder as emoji" variation selector. This is not the only use of the variation selectors, but explaining all of them is beyond this site - after all, there is a reason the Unicode standard is over 1000 pages.

In addition to regular emoji, there are various ASCII art style "drawings" like the shruggie or the Lenny face, now empowered with all the other characters made easily accessible through Unicode. Now you can see how those are built!

Bigger than expected

Zalgo is a particular form of abuse of combining diacritics. You can add an astonishing number of them to a single character, often causing it to overflow outside the arranged space, covering other text and creating an impression of dangerous corruption, and disrupt e.g. chat rooms. It's a lot less impressive once you know how it works, but if you work with user generated content, you should design your HTML and CSS so that content from one user cannot cover other user's content.

Ligatures are used to make text look nicer: For specific combination of letters (e.g fi), simply placing the standard letter forms next to each other doesn't look good, so special decorative combinations have been developed (if you wonder why you sometimes can't copy the letter f from a PDF file... that's why). Computer systems normally use these automatically, but it's also possible to specify them explicitly in Unicode. Ligatures can also get more complex and especially in Arabic, may represent important words or phrases, written in a calligraphic manner.

While we're talking about large glyphs: Certain glyphs are very... imposing, either by design (e.g. the Full Block "Block Element", actually meant for text-based graphics, which can be used to make it appear as if text was censored) or simply because that's what they look like in an ancient writing system, e.g. the cuneiform numerals. This shows that it is dangerous to make assumptions about character widths or line heights.

A project by Jan Schejbal