If you are new to Go, you probably saw the word rune being used. But would you be able to precise what it is?
Runes in Go
A rune in Go is essentially a synonym to the type int32 which by convention is set to an Unicode code point. A code point is a numerical value that can represent single characters but can also have other meanings, such as formatting. With UTF-8 encoding, different code points are encoded as sequences from one to four bytes long.
For example, the rune literal ‘a’ is the ASCII code 97 or Unicode U+0061. In summary, a rune in Go is:
- a synonym to the type int32
- A type, with keyword rune aliased to the type int32
- A Unicode codepoint
- A character
Rune Literals
Another important point to remember is that Go code is encoded as UTF-8, meaning that string literals will use encoding by default and can be written as a character within single quotes.
And as we'll see later, Go also accepts any ASCII character as well as Unicode code points either directly or with numeric escapes. For example, a rune whose value is less than 256 can be written with a single hexadecimal escape (e.g., '\x41' for 'A') but \u or \U escape must be used for higher values.
ASCII, Unicode and UTF
And since we're talking ASCII and Unicode, let's understand why we should understand how they differ.
ASCII
ASCII (abbreviated from American Standard Code for Information Interchange) is a character encoding standard for electronic communication. It's development started in the 60's and still widely used today.
But ASCII is limited to only128 characters (or 7 bits with code points ranging from 0 to 127), which means that it only contains enough to hold un-accented letters, numbers and a few other characters, leaving out accents and most of the characters used by Eastern languages.
Unicode
For that reason, a new standard called Unicode was created as a superset of ASCII and defines over 140k characters (but capable of more than a million code points), more than sufficient to handle most of the characters in all languages present in the world plus new if necessary.
Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16, UCS-2 and GB18030, which's standardized in China and implements Unicode fully, while not an official Unicode standard.
UTF-8
Today, UTF-8 is the most common encoding on the internet. UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go, and is now a Unicode standard. It uses between 1 and 4 bytes to represent each rune but only one byte for ASCII characters, and 2 or 3 bytes for runes. The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.
Unicode Standard Notation
Unicode has the standard notation for codepoint, starting with U+, followed by its codepoint in hexadecimal. For example, U+1F600 represents the Unicode character 😀. To get the Unicode value in Go, use the %U format.
Printing Runes
Runes are usually printed with the following formats:
- %c: to print the character
- %q: to print the character within quotes
- %U: to print the value of the character in Unicode notation (U+<value>)
For example:
But other formats can also be used, including:
- %b: base 2
- %o: base 8
- %d: base 10
- %x: base 16, with lower-case letters for a-f
Conclusion
On this post we learned about runes in Go. Runes are essentially an alias for int32 and is equivalent to int32 in all ways. It is used, by convention, to distinguish character values from integer values.