Golang and Unicode

First what’s a string in go ?

In go a string is an immutable slice of byte.

In go every string you declare is UTF-8 encoded (but the one you get from user may have different encoding; you must absolutely know the encoding of a string you get).

Some definition

Unicode define every letter in the world as code points. UTF-8 is a manner to encode all this code point (there is many : UTF-16, UTF-32, UTF-7), UTF-8 is the one used for best space efficiency : a char can be 1 byte (with ASCII table compatibility) or more (up to 4); comparing to UTF-16 where the minimum size is 2 bytes; UTF-32 is best for speed in string manipulation (knowing that every code points is always 4 bytes enable random access : string slicing is O(1) in UTF-32 and O(n) in UTF-8) … and UTF-7 is a bizarre Unicode unofficial encoding used only by imap, let’s not worry about it.

Go use UTF-8 internally, the unicode “code point” is referred as “rune” in Go.

Usage

You can convert a string to its rune equivalent :

// Convert str to slice of runes ( = slice of unicode code point)
runes := []rune(str)

// substring to keep the five first char :
five_first_runes := runes[:5]

Runes can be cut with no worry for example : 你们好 can be cut without risk of cutting a character that is more than 1 byte long in half.

You should always use runes to manipulate safely string !

If we have done str[:5] it will have failed badly if the string where “你们好” → all these Chinese char is 3 bytes long so cutting at 5 we will not get 5 char + the string will be incorrect because the char 们 will be cut.

Cut a multi-byte char in half and will result in some � wich means : the character is unknown or the encoding is wrong.

UTF-8 and for range loop

The for range loop manage UTF-8, so we can use it to perform a substring without allocating additional memory for a runes slice :

// Use for range loop to do a simple substring on UTF-8 string
func UTF8substring(str string, start int, stop int) string {
  startRuneIndex := 0 // keep the rune index we want to start
  z := 0 // z is the current index
  for j := range str {
    if z == start {
      startRuneIndex = j // Mark index we want to start
    }
    if z == stop {
      return str[startRuneIndex:j] // Stop here: we have reached stop index
    }

    // because j will be increased depending on str content :
    // → +1 if char is 1 byte long, +2 if char is 2 bytes long etc
    // so we update z to count char
    z++
  }

  return str[startRuneIndex:] // stop is bigger than str size

  // usage :
  s := “你们好” // a string declared in GO is always UTF-8 encoded
  fmt.Println(UTF8substring(s, 0, 2)) // Will print 你们
  fmt.Println(UTF8substring(s, 1, 3)) // Will print 们好
}

This is a specific behavior of for range, and it don’t apply to for loop that will just loop on the byte slice.

And if I have another encoding (ie not UTF-8) as user entry ?

Good question.

The better to do is to use x/text to convert anything to UTF-8.

Example converting a Windows-1250 :


// Use https://godoc.org/golang.org/x/text/encoding +
// https://godoc.org/golang.org/x/text/encoding/charmap
// to decode windows 1250 to UTF-8 … and encode if you really want to do this

// Decode windows-1250
func DecodeWindows1250(to_decode []byte) string {
  decoder := charmap.Windows1250.NewDecoder()
  out, _ := decoder.Bytes(to_decode)
  return string(out)
}

// Encode to windows-1250
func EncodeWindows1250(inp string) []byte {
  encoder := charmap.Windows1250.NewEncoder()
  out, _ := encoder.String(inp)
  return out
}

Use tools to manage encoding : see x/text

Note there is also a go-charset (I didn’t use it in this example) Ref

The last trap with unicode : normalization

The problem is unicode has 2 ways to define a charractere for exemple “french” in french is “français” you will notice this charcatere : ç that is not in ASCII table. This char can be represented in NFC (Normalization Form Canonical Composition) or NFD (Normalization Form Canonical Decomposition). The difference is :

  • In NFC ç is simply one code point U+00E7

  • In NFD ç is composed with 2 code point : U+0043 and U+0327

This problem have a good description in the unicode standard

See the Go blog for solution : text normalization

More reading on charset