Showing posts with label Strings. Show all posts
Showing posts with label Strings. Show all posts

Tuesday, February 2, 2021

Raw String Literals in Go

Learn what are Raw String Literals in Go and how to use them

On a previous posts we discussed Runes, Variables and Strings in Go. Today, let's continue our study of Go Strings by learning about raw string literals.

    Raw String Literals

    Raw string literals are character sequences between back quotes, as in `foo`. Within raw string literals, no escape sequences are processed; the contents are taken literally. For example:

    s := `{
    "name": "john"
    }`
    fmt.Println(s)

    Prints:

    {
    "name": "john"
    }

    Particularities of raw string literals

    Here are a few particularities of raw string literals:

    • within the quotes any character can appear except from back quote (`)
    • the value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes
    • the string may contain newlines
    • carriage return characters ('\r') are discarded
    • no escape sequences are processed; the contents are taken literally
    • a raw string literal may spread over several lines in the source code

    When to use raw string literals

    Due to its preservation of  format and disabled escaping nature, raw string literals are convenient in multiple contexts, including:

    • writing regular expressions
    • HTML templates
    • JSON literals
    • command usage messages
    • XML
    • YAML
    • and other formats that require preservation of the formatting
    We'll see uses of raw string literals for each of the above cases. Keep tuned!

    Conclusion

    On this post we learned a little more about Strings in Go by reviewing raw string literals. Since manipulating Strings is an essential part of a programmer's life, understanding their particularities is important to master the Go programming language.

      Source

      See Also

      Tuesday, January 26, 2021

      Strings in Go

      Learn the most important aspects of Strings in Go

      On a previous posts we discussed Runes and Variables. Today, let's continue our study of Go's basic types by learning more about Strings in Go. Since Strings in Go as not as obvious as in your favorite programming language, we recommend to explore this article at your own pace.

      Declaring Strings in Go

      You probably know how to declare variables in Go. Declaring strings is as simple as:

      s := "Hello Gopher" // or...
      var s2 string

      A string value can be written as a string literal, a sequence of bytes enclosed in double quotes. Strings in Go can also contain UTF characters:

      s2 := "Hello 😀"

      We can also treat Strings as arrays to access parts of it. For example:

      fmt.Println(s[:5]) // "Hello"
      fmt.Println(s[:])  // "Gopher"
      fmt.Println(s[:])  // "Hello Gopher"

      Concatenating Strings

      Concatenating Strings in Go is similar to Java, JavaScript and Python as Go also utilizes the + operator:
      s3 := "Goodbye,  " + s[5:] // "Goodbye, Gopher"
      String concatenation is an expensive operation. Avoid using it in loops as it will impact the performance of your application.

      Comparing Strings

      Strings may be compared with operators like == and <. And since the comparison is done byte by byte, the result is a sweet natural lexicographic ordering:  

      name1 := "john"
      name2 := "smith"
      fmt.Println(name1 == name2) // false
      fmt.Println(name1 > name2)  // false

      Substrings

      Go also allows easy access to access parts of your string. For example:

      s[3] - returns the value located on the 3rd position the array
      s[5:] - returns a substring from position 5 until the end
      s[:5] - returns a substring from position 0 to 5
      s[2:5] - returns a substring from position 2 to 5

      String Length

      If you thought that Go's built-in len function returns the length of a string, you're incorrect. As per the official documentation, len over strings returns the number of bytes in the string (not the number of characters). So if your variable contained any UTF-8 character, it would fail. For example:

      s := "Hello, 😊"
      fmt.Println(len(s)) // returns 11. Did you expect 8?

      To solve the above problem, we should resort to the package encoding/utf8:

      s := "Hello, 😊"
      fmt.Println(utf8.RuneCountInString(s)) // yes! now we have an 8!

      Loops over Strings

      As per the above, loops over strings should use range instead of len. Example:

      // incorrect as it fails for UTF-8 strings
      for i := 0; i < len(s); i++ {
          fmt.Printf("%d %q\n", i, s[i])
      }

      // correct
      for i, r := range s {
          fmt.Printf("%d\t%q\t%d\n", i, r, r)
      }

      Immutability

      Another important concept of Strings in Go is that they are immutable. By that, it means that once assigned, the byte sequence contained in a string value cannot be changed:

      s[7] = 'a'   // compiler error: cannot assign to s[7]

      But, as expected, a string can be reassigned another value:

      s = "Hello Again"

      Escape Sequences

      Within a double-quoted string literal, escape sequences that begin with a backslash (\) can be used to insert arbitrary byte values into the string. The most common are:

      • \a - “alert” or bell
      • \b - backspace
      • \f - form feed
      • \n - newline
      • \r - carriage return
      • \t - tab
      • \v - vertical tab
      • \' - single quote (only in the rune literal '\'')
      • \" - double quote (only within "..." literals)
      • \\ - backslash

      Runes, ASCII, Unicode and UTF

      And since we're talking Go Strings, Runes, ASCII  and Unicode, let's review a little about these topics.

      ASCII

      ASCII (American Standard Code for Information Interchange) is a character encoding standard created in the 60's and still widely used. ASCII's only supports 128 characters such as un-accented letters, numbers and a few other characters.

      Unicode

      Due to ASCII's limitations, Unicode was created as a superset of it. Today it defines over 140k characters (but capable of more than a million code points), more than sufficient to handle most of the characters and symbols present in the world. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings.

      UTF-8

      Today, UTF-8 is the most common encoding on the internet. UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go. It uses between 1 and 4 bytes to represent each rune but only one byte for ASCII characters, and 2 or 3 bytes for runes. The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.

      Unicode Standard Notation

      Unicode has the standard notation for codepoint, starting with U+, followed by its codepoint in hexadecimal. For example, U+1F600 represents the Unicode character 😀. To get the Unicode value in Go, use the %U verb.

      Printing Runes

      Runes are usually printed with the following verbs:

      • %c: to print the character
      • %q: to print the character within quotes
      • %U: to print the value of the character in Unicode notation (U+<value>) 

      For example:

      ascii := 'a'
      unicode := '😀'
      newline := '\n'
      fmt.Printf("%d %[1]c %[1]q\n", ascii)   // 97 a 'a'
      fmt.Printf("%d %[1]c %[1]q\n", unicode) // 22269 😀 '😀'
      fmt.Printf("%U\n", unicode)             // U+1F600
      fmt.Printf("%d %[1]q\n", newline)       // 10 '\n'

      Other formats can also be used, including:

      • %b: base 2
      • %o: base 8
      • %d: base 10
      • %x: base 16, with lower-case letters for a-f

      Raw String Literals

      A raw string literal is written using backticks (`). Within raw string literals, no escape sequences are processed; the contents are taken literally. For example:

      s := `{
      "name": "john"
      }`
      fmt.Println(s)

      Prints:

      {
      "name": "john"
      }

      Standard Library Support

      Strings are also widely supported by Go's standard library. The most important packages for manipulating strings are: bytes, strings, strconv, and unicode. We'll study them in future posts but feel free to explore and learn more about them at your own pace.

      Conclusion

      On this post we learned a little more about Strings in Go. Since manipulating Strings is an essential part of a programmer's life, understanding their particularities is important to master the Go programming language.

      To summarize, here are some important particularities that you should know:

      • strings in Go are immutable sequence of bytes
      • strings in Go can contain human-readable or any data including bytes
      • text strings in Go are conventionally interpreted as UTF-8-encoded sequences of Unicode code points (runes)
      • as Go files (which are always encoded in UTF-8) Go text strings are conventionally interpreted as UTF-8 and can include Unicode code points in string literals
      • strings in Go accept either ASCII characters as well as Unicode code points
      • a rune whose value is less than 256 can be written with a single hexadecimal escape (e.g., '\x41' for 'A') but \u or \U escape must be used for higher values

      See Also

      Tuesday, January 19, 2021

      Runes in Go

      Understand what's a rune in Go and when to use it

      If you are new to Go, you probably saw the word rune being used. But would you be able to precise what it is?

      Runes in Go

      A rune in Go is essentially a synonym to the type int32 which by convention is set to an Unicode code point. A code point is a numerical value that can represent single characters but can also have other meanings, such as formatting. With UTF-8 encoding, different code points are encoded as sequences from one to four bytes long.


      For example, the rune literal ‘a’ is the ASCII code 97 or Unicode U+0061. In summary, a rune in Go is:

      • a synonym to the type int32
      • A type, with keyword rune aliased to the type int32
      • A Unicode codepoint
      • A character

      Rune Literals

      Another important point to remember is that Go code is encoded as UTF-8, meaning that string literals will use encoding by default and can be written as a character within single quotes.

      And as we'll see later, Go also accepts any ASCII character as well as Unicode code points either directly or with numeric escapes. For example, a rune whose value is less than 256 can be written with a single hexadecimal escape (e.g., '\x41' for 'A') but \u or \U escape must be used for higher values.

      ASCII, Unicode and UTF

      And since we're talking ASCII  and Unicode, let's understand why we should understand how they differ.

      ASCII

      ASCII (abbreviated from American Standard Code for Information Interchange) is a character encoding standard for electronic communication. It's development started in the 60's and still widely used today. 

      But ASCII is limited to only128 characters (or 7 bits with code points ranging from 0 to 127), which means that it only contains enough to hold un-accented letters, numbers and a few other characters, leaving out accents and most of the characters used by Eastern languages.

      Unicode

      For that reason, a new standard called Unicode was created as a superset of ASCII and defines over 140k characters (but capable of more than a million code points), more than sufficient to handle most of the characters in all languages present in the world plus new if necessary.

      Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16, UCS-2 and GB18030, which's standardized in China and implements Unicode fully, while not an official Unicode standard.

      UTF-8

      Today, UTF-8 is the most common encoding on the internet. UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go, and is now a Unicode standard. It uses between 1 and 4 bytes to represent each rune but only one byte for ASCII characters, and 2 or 3 bytes for runes. The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.

      Unicode Standard Notation

      Unicode has the standard notation for codepoint, starting with U+, followed by its codepoint in hexadecimal. For example, U+1F600 represents the Unicode character 😀. To get the Unicode value in Go, use the %U format.

      Printing Runes

      Runes are usually printed with the following formats:

      • %c: to print the character
      • %q: to print the character within quotes
      • %U: to print the value of the character in Unicode notation (U+<value>) 

      For example:

      ascii := 'a'
      unicode := '😀'
      newline := '\n'
      fmt.Printf("%d %[1]c %[1]q\n", ascii)   // 97 a 'a'
      fmt.Printf("%d %[1]c %[1]q\n", unicode) // 22269 😀 '😀'
      fmt.Printf("%U\n", unicode)             // U+1F600
      fmt.Printf("%d %[1]q\n", newline)       // 10 '\n'

      But other formats can also be used, including:

      • %b: base 2
      • %o: base 8
      • %d: base 10
      • %x: base 16, with lower-case letters for a-f

      Conclusion

      On this post we learned about runes in Go. Runes are essentially an alias for int32 and is equivalent to int32 in all ways. It is used, by convention, to distinguish character values from integer values.

      See Also

      Tuesday, December 15, 2020

      Escape Sequences in Go

      Want to know more about escape sequences in Go? Read to understand.

      Go as other programming languages has the concept of escaping. A escaping sequence is a combination of characters that has a meaning other than the literal characters contained therein. An escaping sequence commonly uses a escape character which in Go is the \ character. Let's understand more about escaping in Go's context.

      Why escape?

      Escaping is a commonly used technique that developers use resort to escaping code to:

      • encode commands or special data which cannot be directly represented by the alphabet.
      • represent characters which cannot be typed in the current context, or would have an undesired interpretation

      Escaping in Go

      As other programming languages, Go utilizes the backslash (\) character to escape. What to use next? Depends on what you need. Here are some useful values to get you started:

      Escape Sequence Value
      \\ the \ character
      \' the ' character
      \" the " character
      \? the ? character
      \a an alert
      \b backspace
      \f form feed
      \n a new line
      \r carriage return
      \t an horizontal tab
      \xFF hexadecimal "FF"

      Examples

      So let's check some examples:

      package main

      import (
      "fmt"
      )

      func main() {
      dec := 22
      octal := 033
      hex := 0xFF
      fmt.Printf("Decimal %v, Hex: %v, Octal: %v\n", dec, hex, octal)
      fmt.Println("Some\ttab")
      fmt.Println("A quote: \"")
      fmt.Println("What\nabout\nline\nbreaks")
      }

      Decimal 22, Hex: 255, Octal: 27
      Some tab
      A quote: "
      What
      about
      line
      breaks

      Conclusion

      On this article we learned about escaping. Escaping in Go is very similar to other programming languages and is extensively used to encode commands or special data which cannot be directly represented by the alphabet or to represent characters which cannot be typed in the current context, or would have an undesired interpretation.

      Reference

      See Also

      Any comment about this page? Please contact us on Twitter

      Featured Article

      Pointers in Go

      Understand all about declaring and using pointers in Go If you're coming from Python, Java, JavaScript, C# and others, tal...

      Popular Posts