What is Unicode?
In a nutshell
Unicode (a.k.a The Unicode Standard) is an encoding standard maintained by the Unicode Consortium that defines a unique code point for ~149,000 characters. It covers 161 modern and historic scripts, symbols, emoji, and non-visual / formatting codes such as the enter and delete keys.
A unicode code point is a number like U+0041
or U+1F600
that can represents a Latin character like S
or an emoji like 😊.
Here are some examples of character to code point mappings:
Character | Code point |
A | U+0041 |
B | U+0042 |
a | U+0061 |
b | U+0062 |
Ā | U+0100 |
[ | U+005B |
ช | U+0E0A |
DEL | U+007F |
😊 | U+1F60A |
The U+
prefix just identifies the code point as a Unicode. The number after the prefix is in hexadecimal.
For example, the 😊 character’s code point is U+1F60A
. The 1F60A
hexadecimal portion of the code point maps to 128522
in decimal.
This means the 😊 emoji can be stored digitally as the number 128552. Thanks to the standardisation of Unicode, a computer will know that the number 128552 should be displayed with the 😊 emoji. This kind of mapping between characters and numbers makes it easier to encode all characters as numbers (and ultimately binary) and then store or transmit them digitally.
You can use this website to browse through all the Unicode characters. For example, here‘s the Unicode info for the Latin character “a”. You can use this website to convert characters to their decimal representations.
Get Unicode code point for a character using JavaScript
Most programming languages let you convert between characters and unicode code points. For example, JavaScript’s String.prototype.charCodeAt()
method lets you get the UTF-16 code unit of any character:
'a'.charCodeAt(0) // 97
'8'.charCodeAt(0) // 98
function strToUnicode(str) {
return str.split('').map(char => char.charCodeAt(0))
}
strToUnicode('hello') // [104, 101, 108, 108, 111]
Other notes
- Unicode is implemented in most modern operating systems and programming languages.
- TODO: What’s the difference between UTF-8, UTF-16, UTF-32?
Sources
Thanks for your comment 🙏. Once it's approved, it will appear here.
Leave a comment