How to Remove Invalid UTF-8 Characters from a String in Go
Created
Modified
Using ToValidUTF8 Function
The strings.ToValidUTF8()
function returns a copy of the string s with each run of invalid UTF-8 byte sequences replaced by the replacement string, which may be empty.
The following example should cover whatever you are trying to do:
package main
import (
"fmt"
"strings"
)
func main() {
s := "a\xc5bd"
s = strings.ToValidUTF8(s, "")
fmt.Printf("%q\n", s)
}
"abd"
Using Map function
In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:
package main
import (
"fmt"
"strings"
"unicode/utf8"
)
func main() {
s := "a\xc5bd"
valid := func(r rune) rune {
if r == utf8.RuneError {
return -1
}
return r
}
s = strings.Map(valid, s)
fmt.Printf("%q\n", s)
}
"abd"
Using Range
For example,
package main
import (
"fmt"
"unicode/utf8"
)
func ToValid(s string) string {
if utf8.ValidString(s) {
return s
}
v := make([]rune, 0, len(s))
for i, r := range s {
if r == utf8.RuneError {
_, size := utf8.DecodeLastRuneInString(s[i:])
if size == 1 {
continue
}
}
v = append(v, r)
}
return string(v)
}
func main() {
s := "a\xc5b\x8ad"
s = ToValid(s)
fmt.Printf("%q\n", s)
}
"abd"