On a previous posts we discussed Runes and Variables. Today, let's continue
our study of Go's basic types by learning more about Strings in Go. Since
Strings in Go as not as obvious as in your favorite programming language, we
recommend to explore this article at your own pace.
Declaring Strings in Go
You probably know how to declare variables in Go. Declaring strings is as simple as:
s := "Hello Gopher" // or...
var s2 string
A string value can be written as a string literal, a sequence of bytes
enclosed in double quotes. Strings in Go can also contain UTF characters:
s2 := "Hello 😀"
We can also treat Strings as arrays to access parts of it. For example:
fmt.Println(s[:5]) // "Hello"
fmt.Println(s[:]) // "Gopher"
fmt.Println(s[:])
// "Hello Gopher"
Concatenating Strings
Concatenating Strings in Go is similar to Java, JavaScript and Python as Go
also utilizes the + operator:
s3 := "Goodbye, " + s[5:] // "Goodbye, Gopher"
String concatenation is an expensive operation. Avoid using it in loops as
it will impact the performance of your application.
Comparing Strings
Strings may be compared with
operators
like ==
and
<.
And since the comparison is done byte
by byte, the result is a sweet natural lexicographic ordering:
name1 := "john"
name2 := "smith"
fmt.Println(name1 == name2) // false
fmt.Println(name1 > name2) // false
Substrings
Go also allows easy access to access parts of your string. For example:
s[3] - returns the value located on the 3rd position the array
s[5:] - returns a substring from position 5 until the end
s[:5] - returns a substring from position 0 to 5
s[2:5] - returns a substring from position 2 to 5
String Length
If you thought that Go's
built-in len function returns the length of a string, you're incorrect. As per the official
documentation, len over strings returns
the number of bytes in the string (not the number of characters). So if
your variable contained any UTF-8 character, it would fail. For example:
s := "Hello, 😊"
fmt.Println(len(s)) // returns 11. Did you expect 8?
To solve the above problem, we should resort to the package
encoding/utf8:
s := "Hello, 😊"
fmt.Println(utf8.RuneCountInString(s)) // yes! now we
have an 8!
Loops over Strings
As per the above, loops over strings should use range instead of len. Example:
// incorrect as it fails for UTF-8 strings
for i := 0; i < len(s); i++ {
fmt.Printf("%d %q\n", i, s[i])
}
// correct
for i, r := range s {
fmt.Printf("%d\t%q\t%d\n", i, r, r)
}
Immutability
Another important concept of Strings in Go is that they are immutable. By
that, it means that once assigned, the byte sequence contained in a string
value cannot be changed:
s[7] = 'a' // compiler error: cannot assign to s[7]
But, as expected, a string can be reassigned another value:
s = "Hello Again"
Escape Sequences
Within a double-quoted string literal, escape sequences that begin with a
backslash (\) can be used to insert arbitrary byte values into the string. The
most common are:
- \a - “alert” or bell
- \b - backspace
-
\f - form feed
-
\n - newline
-
\r - carriage return
-
\t - tab
-
\v - vertical tab
-
\' - single quote (only in the rune literal '\'')
-
\" - double quote (only within "..." literals)
-
\\ - backslash
Runes, ASCII, Unicode and UTF
And since we're talking Go Strings, Runes, ASCII and Unicode, let's review a little about these topics.
ASCII
ASCII (American Standard
Code for Information Interchange) is a character encoding standard created in
the 60's and still widely used.
ASCII's only supports 128
characters such as un-accented letters, numbers and a few other characters.
Unicode
Due to ASCII's limitations, Unicode
was created as a superset of it. Today it defines over 140k characters (but
capable of more than a million code points), more than sufficient to handle
most of the characters and symbols present in the world. The Unicode standard defines
Unicode Transformation Formats (UTF)
UTF-8,
UTF-16, and
UTF-32, and several other
encodings.
UTF-8
Today, UTF-8 is the most common encoding on the internet. UTF-8 was
invented by
Ken Thompson and
Rob Pike, two of the
creators of Go. It uses between 1 and 4 bytes to represent each rune but only
one byte for ASCII characters, and 2 or 3 bytes for runes. The first 128
Unicode code points represent the ASCII characters, which means that any ASCII
text is also a UTF-8 text.
Unicode Standard Notation
Unicode has the standard notation for codepoint, starting with U+, followed by
its codepoint in hexadecimal. For example, U+1F600 represents the Unicode
character 😀. To get the Unicode value in Go, use the %U verb.
Printing Runes
Runes are usually printed with the following verbs:
- %c: to print the character
-
%q: to print the character
within quotes
-
%U: to print the value of the
character in Unicode notation (U+<value>)
For example:
ascii := 'a'
unicode := '😀'
newline := '\n'
fmt.Printf("%d %[1]c %[1]q\n", ascii) // 97 a 'a'
fmt.Printf("%d %[1]c %[1]q\n", unicode) // 22269 😀 '😀'
fmt.Printf("%U\n", unicode)
// U+1F600
fmt.Printf("%d %[1]q\n", newline) // 10 '\n'
Other formats can also be used, including:
- %b: base 2
- %o: base 8
- %d: base 10
-
%x: base 16, with lower-case
letters for a-f
Raw String Literals
A raw string literal is written using backticks (`). Within raw string
literals, no escape sequences are processed; the contents are taken literally.
For example:
s := `{
"name": "john"
}`
fmt.Println(s)
Prints:
{
"name": "john"
}
Standard Library Support
Strings are also widely supported by
Go's standard library. The most
important packages for manipulating strings are:
bytes,
strings,
strconv, and
unicode. We'll study them in
future posts but feel free to explore and learn more about them at your own
pace.
Conclusion
On this post we learned a little more about Strings in Go. Since manipulating
Strings is an essential part of a programmer's life, understanding their
particularities is important to master the Go programming language.
To summarize, here are some important particularities that you should know:
- strings in Go are immutable sequence of bytes
-
strings in Go can contain human-readable or any data including bytes
-
text strings in Go are conventionally interpreted as UTF-8-encoded sequences
of Unicode code points (runes)
-
as Go files (which are always encoded in UTF-8) Go text strings are
conventionally interpreted as UTF-8 and can include Unicode code points in string literals
-
strings in Go accept either ASCII characters as well as Unicode code points
-
a rune whose value is less than 256 can be written with a single hexadecimal
escape (e.g., '\x41' for 'A') but \u or \U escape must be used for higher
values
See Also