How to Remove Invalid UTF-8 Characters from a String in Go

Created
Modified

Using ToValidUTF8 Function

The strings.ToValidUTF8() function returns a copy of the string s with each run of invalid UTF-8 byte sequences replaced by the replacement string, which may be empty.

The following example should cover whatever you are trying to do:

package main

import (
  "fmt"
  "strings"
)

func main() {
  s := "a\xc5bd"

  s = strings.ToValidUTF8(s, "")
  fmt.Printf("%q\n", s)
}
"abd"

Using Map function

In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:

package main

import (
  "fmt"
  "strings"
  "unicode/utf8"
)

func main() {
  s := "a\xc5bd"

  valid := func(r rune) rune {
    if r == utf8.RuneError {
      return -1
    }
    return r
  }
  s = strings.Map(valid, s)
  fmt.Printf("%q\n", s)
}
"abd"

Using Range

For example,

package main

import (
  "fmt"
  "unicode/utf8"
)

func ToValid(s string) string {

  if utf8.ValidString(s) {
    return s
  }

  v := make([]rune, 0, len(s))
  for i, r := range s {
    if r == utf8.RuneError {
      _, size := utf8.DecodeLastRuneInString(s[i:])
      if size == 1 {
        continue
      }
    }

    v = append(v, r)
  }

  return string(v)
}

func main() {
  s := "a\xc5b\x8ad"

  s = ToValid(s)
  fmt.Printf("%q\n", s)
}
"abd"

Related Tags