Strings in Go

Learn the most important aspects of Strings in Go

On a previous posts we discussed Runes and Variables. Today, let's continue our study of Go's basic types by learning more about Strings in Go. Since Strings in Go as not as obvious as in your favorite programming language, we recommend to explore this article at your own pace.

Declaring Strings in Go

You probably know how to declare variables in Go. Declaring strings is as simple as:

s := "Hello Gopher" // or...
var s2 string

A string value can be written as a string literal, a sequence of bytes enclosed in double quotes. Strings in Go can also contain UTF characters:

s2 := "Hello 😀"

We can also treat Strings as arrays to access parts of it. For example:

fmt.Println(s[:5]) // "Hello"
fmt.Println(s[:]) // "Gopher"
fmt.Println(s[:]) // "Hello Gopher"

Concatenating Strings

Concatenating Strings in Go is similar to Java, JavaScript and Python as Go also utilizes the + operator:

s3 := "Goodbye, " + s[5:] // "Goodbye, Gopher"

String concatenation is an expensive operation. Avoid using it in loops as it will impact the performance of your application.

Comparing Strings

Strings may be compared with operators like == and <. And since the comparison is done byte by byte, the result is a sweet natural lexicographic ordering:

name1 := "john"
name2 := "smith"
fmt.Println(name1 == name2) // false
fmt.Println(name1 > name2) // false

Substrings

Go also allows easy access to access parts of your string. For example:

s[3] - returns the value located on the 3rd position the array

s[5:] - returns a substring from position 5 until the end

s[:5] - returns a substring from position 0 to 5

s[2:5] - returns a substring from position 2 to 5

String Length

If you thought that Go's built-in len function returns the length of a string, you're incorrect. As per the official documentation, len over strings returns the number of bytes in the string (not the number of characters). So if your variable contained any UTF-8 character, it would fail. For example:

s := "Hello, 😊"

fmt.Println(len(s)) // returns 11. Did you expect 8?

To solve the above problem, we should resort to the package encoding/utf8:

s := "Hello, 😊"
fmt.Println(utf8.RuneCountInString(s)) // yes! now we have an 8!

Loops over Strings

As per the above, loops over strings should use range instead of len. Example:

// incorrect as it fails for UTF-8 strings

for i := 0; i < len(s); i++ {

fmt.Printf("%d %q\n", i, s[i])

}

// correct

for i, r := range s {

fmt.Printf("%d\t%q\t%d\n", i, r, r)

}

Immutability

Another important concept of Strings in Go is that they are immutable. By that, it means that once assigned, the byte sequence contained in a string value cannot be changed:

s[7] = 'a' // compiler error: cannot assign to s[7]

But, as expected, a string can be reassigned another value:

s = "Hello Again"

Escape Sequences

Within a double-quoted string literal, escape sequences that begin with a backslash (\) can be used to insert arbitrary byte values into the string. The most common are:

\a - “alert” or bell
\b - backspace
\f - form feed
\n - newline
\r - carriage return
\t - tab
\v - vertical tab
\' - single quote (only in the rune literal '\'')
\" - double quote (only within "..." literals)
\\ - backslash

Runes, ASCII, Unicode and UTF

And since we're talking Go Strings, Runes, ASCII and Unicode, let's review a little about these topics.

ASCII

ASCII (American Standard Code for Information Interchange) is a character encoding standard created in the 60's and still widely used. ASCII's only supports 128 characters such as un-accented letters, numbers and a few other characters.

Unicode

Due to ASCII's limitations, Unicode was created as a superset of it. Today it defines over 140k characters (but capable of more than a million code points), more than sufficient to handle most of the characters and symbols present in the world. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings.

UTF-8

Today, UTF-8 is the most common encoding on the internet. UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go. It uses between 1 and 4 bytes to represent each rune but only one byte for ASCII characters, and 2 or 3 bytes for runes. The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.

Unicode Standard Notation

Unicode has the standard notation for codepoint, starting with U+, followed by its codepoint in hexadecimal. For example, U+1F600 represents the Unicode character 😀. To get the Unicode value in Go, use the %U verb.

Printing Runes

Runes are usually printed with the following verbs:

%c: to print the character
%q: to print the character within quotes
%U: to print the value of the character in Unicode notation (U+<value>)

For example:

ascii := 'a'

unicode := '😀'

newline := '\n'

fmt.Printf("%d %[1]c %[1]q\n", ascii) // 97 a 'a'

fmt.Printf("%d %[1]c %[1]q\n", unicode) // 22269 😀 '😀'

fmt.Printf("%U\n", unicode) // U+1F600

fmt.Printf("%d %[1]q\n", newline) // 10 '\n'

Other formats can also be used, including:

%b: base 2
%o: base 8
%d: base 10
%x: base 16, with lower-case letters for a-f

Raw String Literals

A raw string literal is written using backticks (`). Within raw string literals, no escape sequences are processed; the contents are taken literally. For example:

s := `{
"name": "john"
}`
fmt.Println(s)

Prints:

{
"name": "john"
}

Standard Library Support

Strings are also widely supported by Go's standard library. The most important packages for manipulating strings are: bytes, strings, strconv, and unicode. We'll study them in future posts but feel free to explore and learn more about them at your own pace.

Conclusion

On this post we learned a little more about Strings in Go. Since manipulating Strings is an essential part of a programmer's life, understanding their particularities is important to master the Go programming language.

To summarize, here are some important particularities that you should know:

strings in Go are immutable sequence of bytes
strings in Go can contain human-readable or any data including bytes
text strings in Go are conventionally interpreted as UTF-8-encoded sequences of Unicode code points (runes)
as Go files (which are always encoded in UTF-8) Go text strings are conventionally interpreted as UTF-8 and can include Unicode code points in string literals
strings in Go accept either ASCII characters as well as Unicode code points
a rune whose value is less than 256 can be written with a single hexadecimal escape (e.g., '\x41' for 'A') but \u or \U escape must be used for higher values

Tuesday, January 26, 2021