Stripping Invalid UTF-8 - Space Vatican

In Ruby 1.9 strings aren’t just dumb collections of bytes - there is an associated encoding which tells ruby how to interpret those bytes. With some encodings, everything goes: any byte sequence is legal (although I suppose statistical analysis could show it to be unlikely, given the language), but with encodings such as UTF-8 some sequences are invalid. If you’ve got a string with such a sequence then at some point you’ll get the dreaded “invalid byte sequence in UTF-8” error message (you can use valid_encoding? to test the validity of a string’s encoding rather than waiting for it to blowup).

Most of the time this is because the string isn’t UTF-8 in the first place, so you need to tell ruby the correct encoding using force_encoding (either because you know the encoding or by using a gem such as charguess). A stack overflow question got me thinking about another case: you have a string that is mostly UTF-8 but which has been mangled in some way. The best case scenario is obviously working out the mangling and reversing it, but sometimes you might just want to cut your losses and salvage what is there.

String’s encode method already takes a hash of options that controls what happens when a transcode can’t be done perfectly, for example

"Caf\xc3\xa9".force_encoding('UTF-8').encode('ascii', :undef => :replace, :replace => '?')
#=> "Caf?"

Plain ascii can’t represent é, so it is replaced by a question mark. We can also deal with invalid UTF-8 sequences in this was

"Caf\xa9".force_encoding('UTF-8').encode('ascii', :invalid => :replace, :replace => '?')
 #=> "Caf?"

But what if we aren’t interested in changing encoding? One might naïvely try

"Caf\xa9".force_encoding('UTF-8').encode('UTF-8', :invalid => :replace, :replace => '?')

but this doesn’t actually do anything on MRI. If you look in the str_transcode0 function in transcode.c you can see that when the source encoding and the destination encoding are the same ruby doesn’t actually do anything.

"Caf\xa9".force_encoding('UTF-8').
          encode('ascii', :invalid => :replace, :replace => '?', :undef => :replace).
          encode('UTF-8')

produces the right output in the this particular case but would also mangle any valid non ascii sequences. You can however do

"Le Caf\xc3\xa9 \xa9".force_encoding('UTF-8').
      encode('UTF-16', :invalid => :replace, :replace => '?').
      encode('UTF-8')
#=> "Le Café ?"

We first encode into UTF-16, which requires an actual conversion, so the invalid sequence stuff is used. We then convert back to UTF-8. Since UTF-8 and UTF-16 are just different representations of the same Unicode character set we don’t lose any information other than the invalid sequences.

This does feel a little hacky. It would be nicer to just make one call to encode! rather than having to go a slightly circuitous route. Rubyspec doesn’t seem to define what should happen in this case, although jruby 1.7 has what I would consider the more helpful behaviour.