normalize latin-1 and utf-8 variant encodings like the builtin tokenizer does
3 files changed