Showing posts with label Strings. Show all posts

Tuesday, February 2, 2021

Raw String Literals in Go

Learn what are Raw String Literals in Go and how to use them

On a previous posts we discussed Runes, Variables and Strings in Go. Today, let's continue our study of Go Strings by learning about raw string literals.

Raw String Literals

Raw string literals are character sequences between back quotes, as in `foo`. Within raw string literals, no escape sequences are processed; the contents are taken literally. For example:

s := `{
"name": "john"
}`
fmt.Println(s)

Prints:

{
"name": "john"
}

Particularities of raw string literals

Here are a few particularities of raw string literals:

within the quotes any character can appear except from back quote (`)
the value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes
the string may contain newlines
carriage return characters ('\r') are discarded
no escape sequences are processed; the contents are taken literally
a raw string literal may spread over several lines in the source code

When to use raw string literals

Due to its preservation of format and disabled escaping nature, raw string literals are convenient in multiple contexts, including:

writing regular expressions
HTML templates
JSON literals
command usage messages
XML
YAML
and other formats that require preservation of the formatting

We'll see uses of raw string literals for each of the above cases. Keep tuned!

Conclusion

On this post we learned a little more about Strings in Go by reviewing raw string literals. Since manipulating Strings is an essential part of a programmer's life, understanding their particularities is important to master the Go programming language.

Source

Go Language Specification

Tuesday, January 26, 2021

Strings in Go

Learn the most important aspects of Strings in Go

On a previous posts we discussed Runes and Variables. Today, let's continue our study of Go's basic types by learning more about Strings in Go. Since Strings in Go as not as obvious as in your favorite programming language, we recommend to explore this article at your own pace.

Declaring Strings in Go

You probably know how to declare variables in Go. Declaring strings is as simple as:

s := "Hello Gopher" // or...
var s2 string

A string value can be written as a string literal, a sequence of bytes enclosed in double quotes. Strings in Go can also contain UTF characters:

s2 := "Hello 😀"

We can also treat Strings as arrays to access parts of it. For example:

fmt.Println(s[:5]) // "Hello"
fmt.Println(s[:]) // "Gopher"
fmt.Println(s[:]) // "Hello Gopher"

Concatenating Strings

Concatenating Strings in Go is similar to Java, JavaScript and Python as Go also utilizes the + operator:

s3 := "Goodbye, " + s[5:] // "Goodbye, Gopher"

String concatenation is an expensive operation. Avoid using it in loops as it will impact the performance of your application.

Comparing Strings

Strings may be compared with operators like == and <. And since the comparison is done byte by byte, the result is a sweet natural lexicographic ordering:

name1 := "john"
name2 := "smith"
fmt.Println(name1 == name2) // false
fmt.Println(name1 > name2) // false

Substrings

Go also allows easy access to access parts of your string. For example:

s[3] - returns the value located on the 3rd position the array

s[5:] - returns a substring from position 5 until the end

s[:5] - returns a substring from position 0 to 5

s[2:5] - returns a substring from position 2 to 5

String Length

If you thought that Go's built-in len function returns the length of a string, you're incorrect. As per the official documentation, len over strings returns the number of bytes in the string (not the number of characters). So if your variable contained any UTF-8 character, it would fail. For example:

s := "Hello, 😊"

fmt.Println(len(s)) // returns 11. Did you expect 8?

To solve the above problem, we should resort to the package encoding/utf8:

s := "Hello, 😊"
fmt.Println(utf8.RuneCountInString(s)) // yes! now we have an 8!

Loops over Strings

As per the above, loops over strings should use range instead of len. Example:

// incorrect as it fails for UTF-8 strings

for i := 0; i < len(s); i++ {

fmt.Printf("%d %q\n", i, s[i])

}

// correct

for i, r := range s {

fmt.Printf("%d\t%q\t%d\n", i, r, r)

}

Immutability

Another important concept of Strings in Go is that they are immutable. By that, it means that once assigned, the byte sequence contained in a string value cannot be changed:

s[7] = 'a' // compiler error: cannot assign to s[7]

But, as expected, a string can be reassigned another value:

s = "Hello Again"

Escape Sequences

Within a double-quoted string literal, escape sequences that begin with a backslash (\) can be used to insert arbitrary byte values into the string. The most common are:

\a - “alert” or bell
\b - backspace
\f - form feed
\n - newline
\r - carriage return
\t - tab
\v - vertical tab
\' - single quote (only in the rune literal '\'')
\" - double quote (only within "..." literals)
\\ - backslash

Runes, ASCII, Unicode and UTF

And since we're talking Go Strings, Runes, ASCII and Unicode, let's review a little about these topics.

ASCII

ASCII (American Standard Code for Information Interchange) is a character encoding standard created in the 60's and still widely used. ASCII's only supports 128 characters such as un-accented letters, numbers and a few other characters.

Unicode

Due to ASCII's limitations, Unicode was created as a superset of it. Today it defines over 140k characters (but capable of more than a million code points), more than sufficient to handle most of the characters and symbols present in the world. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings.

UTF-8

Today, UTF-8 is the most common encoding on the internet. UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go. It uses between 1 and 4 bytes to represent each rune but only one byte for ASCII characters, and 2 or 3 bytes for runes. The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.

Unicode Standard Notation

Unicode has the standard notation for codepoint, starting with U+, followed by its codepoint in hexadecimal. For example, U+1F600 represents the Unicode character 😀. To get the Unicode value in Go, use the %U verb.

Printing Runes

Runes are usually printed with the following verbs:

%c: to print the character
%q: to print the character within quotes
%U: to print the value of the character in Unicode notation (U+<value>)

For example:

ascii := 'a'

unicode := '😀'

newline := '\n'

fmt.Printf("%d %[1]c %[1]q\n", ascii) // 97 a 'a'

fmt.Printf("%d %[1]c %[1]q\n", unicode) // 22269 😀 '😀'

fmt.Printf("%U\n", unicode) // U+1F600

fmt.Printf("%d %[1]q\n", newline) // 10 '\n'

Other formats can also be used, including:

%b: base 2
%o: base 8
%d: base 10
%x: base 16, with lower-case letters for a-f

Raw String Literals

A raw string literal is written using backticks (`). Within raw string literals, no escape sequences are processed; the contents are taken literally. For example:

s := `{
"name": "john"
}`
fmt.Println(s)

Prints:

{
"name": "john"
}

Standard Library Support

Strings are also widely supported by Go's standard library. The most important packages for manipulating strings are: bytes, strings, strconv, and unicode. We'll study them in future posts but feel free to explore and learn more about them at your own pace.

Conclusion

On this post we learned a little more about Strings in Go. Since manipulating Strings is an essential part of a programmer's life, understanding their particularities is important to master the Go programming language.

To summarize, here are some important particularities that you should know:

strings in Go are immutable sequence of bytes
strings in Go can contain human-readable or any data including bytes
text strings in Go are conventionally interpreted as UTF-8-encoded sequences of Unicode code points (runes)
as Go files (which are always encoded in UTF-8) Go text strings are conventionally interpreted as UTF-8 and can include Unicode code points in string literals
strings in Go accept either ASCII characters as well as Unicode code points
a rune whose value is less than 256 can be written with a single hexadecimal escape (e.g., '\x41' for 'A') but \u or \U escape must be used for higher values

Tuesday, January 19, 2021

Runes in Go

Understand what's a rune in Go and when to use it

If you are new to Go, you probably saw the word rune being used. But would you be able to precise what it is?

Runes in Go

A rune in Go is essentially a synonym to the type int32 which by convention is set to an Unicode code point. A code point is a numerical value that can represent single characters but can also have other meanings, such as formatting. With UTF-8 encoding, different code points are encoded as sequences from one to four bytes long.

For example, the rune literal ‘a’ is the ASCII code 97 or Unicode U+0061. In summary, a rune in Go is:

a synonym to the type int32
A type, with keyword rune aliased to the type int32
A Unicode codepoint
A character

Rune Literals

Another important point to remember is that Go code is encoded as UTF-8, meaning that string literals will use encoding by default and can be written as a character within single quotes.

And as we'll see later, Go also accepts any ASCII character as well as Unicode code points either directly or with numeric escapes. For example, a rune whose value is less than 256 can be written with a single hexadecimal escape (e.g., '\x41' for 'A') but \u or \U escape must be used for higher values.

ASCII, Unicode and UTF

And since we're talking ASCII and Unicode, let's understand why we should understand how they differ.

ASCII

ASCII (abbreviated from American Standard Code for Information Interchange) is a character encoding standard for electronic communication. It's development started in the 60's and still widely used today.

But ASCII is limited to only128 characters (or 7 bits with code points ranging from 0 to 127), which means that it only contains enough to hold un-accented letters, numbers and a few other characters, leaving out accents and most of the characters used by Eastern languages.

Unicode

For that reason, a new standard called Unicode was created as a superset of ASCII and defines over 140k characters (but capable of more than a million code points), more than sufficient to handle most of the characters in all languages present in the world plus new if necessary.

Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16, UCS-2 and GB18030, which's standardized in China and implements Unicode fully, while not an official Unicode standard.

UTF-8

Today, UTF-8 is the most common encoding on the internet. UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go, and is now a Unicode standard. It uses between 1 and 4 bytes to represent each rune but only one byte for ASCII characters, and 2 or 3 bytes for runes. The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.

Unicode Standard Notation

Printing Runes

Runes are usually printed with the following formats:

%c: to print the character
%q: to print the character within quotes
%U: to print the value of the character in Unicode notation (U+<value>)

For example:

ascii := 'a'

unicode := '😀'

newline := '\n'

fmt.Printf("%d %[1]c %[1]q\n", ascii) // 97 a 'a'

fmt.Printf("%d %[1]c %[1]q\n", unicode) // 22269 😀 '😀'

fmt.Printf("%U\n", unicode) // U+1F600

fmt.Printf("%d %[1]q\n", newline) // 10 '\n'

But other formats can also be used, including:

%b: base 2
%o: base 8
%d: base 10
%x: base 16, with lower-case letters for a-f

Conclusion

On this post we learned about runes in Go. Runes are essentially an alias for int32 and is equivalent to int32 in all ways. It is used, by convention, to distinguish character values from integer values.

Tuesday, December 15, 2020

Escape Sequences in Go

Want to know more about escape sequences in Go? Read to understand.

Go as other programming languages has the concept of escaping. A escaping sequence is a combination of characters that has a meaning other than the literal characters contained therein. An escaping sequence commonly uses a escape character which in Go is the \ character. Let's understand more about escaping in Go's context.

Why escape?

Escaping is a commonly used technique that developers use resort to escaping code to:

encode commands or special data which cannot be directly represented by the alphabet.
represent characters which cannot be typed in the current context, or would have an undesired interpretation

Escaping in Go

As other programming languages, Go utilizes the backslash (\) character to escape. What to use next? Depends on what you need. Here are some useful values to get you started:

Escape Sequence	Value
\\	the \ character
\'	the ' character
\"	the " character
\?	the ? character
\a	an alert
\b	backspace
\f	form feed
\n	a new line
\r	carriage return
\t	an horizontal tab
\xFF	hexadecimal "FF"

Examples

So let's check some examples:

package main

import (

"fmt"

)

func main() {

dec := 22

octal := 033

hex := 0xFF

fmt.Printf("Decimal %v, Hex: %v, Octal: %v\n", dec, hex, octal)

fmt.Println("Some\ttab")

fmt.Println("A quote: \"")

fmt.Println("What\nabout\nline\nbreaks")

}

Decimal 22, Hex: 255, Octal: 27

Some tab

A quote: "

What

about

line

breaks

Conclusion

On this article we learned about escaping. Escaping in Go is very similar to other programming languages and is extensively used to encode commands or special data which cannot be directly represented by the alphabet or to represent characters which cannot be typed in the current context, or would have an undesired interpretation.

Reference

Go Spec

Tuesday, February 2, 2021

Raw String Literals in Go

Raw String Literals

Particularities of raw string literals

When to use raw string literals

Conclusion

Source

See Also

Tuesday, January 26, 2021

Strings in Go

Declaring Strings in Go

Concatenating Strings

Comparing Strings

Substrings

String Length

Loops over Strings

Immutability

Escape Sequences

Runes, ASCII, Unicode and UTF

ASCII

Unicode

UTF-8

Unicode Standard Notation

Printing Runes

Raw String Literals

Standard Library Support

Conclusion

See Also

Tuesday, January 19, 2021

Runes in Go

Runes in Go

Rune Literals

ASCII, Unicode and UTF

ASCII

Unicode

UTF-8

Unicode Standard Notation

Printing Runes

Conclusion

See Also

Tuesday, December 15, 2020

Escape Sequences in Go

Why escape?

Escaping in Go

Examples

Conclusion

Reference

See Also

Featured Article

Pointers in Go

Popular Posts