sajad torkamani

In a nutshell

Unicode (a.k.a The Unicode Standard) is an encoding standard maintained by the Unicode Consortium that defines a unique code point for ~149,000 characters. It covers 161 modern and historic scripts, symbols, emoji, and non-visual / formatting codes such as the enter and delete keys.

A unicode code point is a number like U+0041 or U+1F600 that can represents a Latin character like S or an emoji like 😊.

Here are some examples of character to code point mappings:

CharacterCode point
AU+0041
BU+0042
aU+0061
bU+0062
ĀU+0100
[U+005B

U+0E0A
DELU+007F
😊U+1F60A

The U+ prefix just identifies the code point as a Unicode. The number after the prefix is in hexadecimal.

For example, the 😊 character’s code point is U+1F60A. The 1F60A hexadecimal portion of the code point maps to 128522 in decimal.

This means the 😊 emoji can be stored digitally as the number 128552. Thanks to the standardisation of Unicode, a computer will know that the number 128552 should be displayed with the 😊 emoji. This kind of mapping between characters and numbers makes it easier to encode all characters as numbers (and ultimately binary) and then store or transmit them digitally.

You can use this website to browse through all the Unicode characters. For example, here‘s the Unicode info for the Latin character “a”. You can use this website to convert characters to their decimal representations.

Get Unicode code point for a character using JavaScript

Most programming languages let you convert between characters and unicode code points. For example, JavaScript’s String.prototype.charCodeAt() method lets you get the UTF-16 code unit of any character:

'a'.charCodeAt(0) // 97
'8'.charCodeAt(0) // 98

function strToUnicode(str) {
  return str.split('').map(char => char.charCodeAt(0))
}

strToUnicode('hello') // [104, 101, 108, 108, 111]

Other notes

Sources

Tagged: Computing

Leave a comment

Your email address will not be published. Required fields are marked *