Update: I think I figured this out. See bottom.

Quick semi-technical explanation: a character encoding is the way that letters and glyphs are represented in a computer. For example, you might decide that in order to convert a chunk of binary data into letters, you'd first split it into byte-long chunks. Since a byte has 256 possible values, you'd then map each of these to different glyphs, such as capital letter A or m dash.

I've been served pretty well by the belief that character encodings were things I'd never have to actually worry about myself; they'd always be so low-level that the operating system, or web server, or database server would just take care of it.

Well, not today.

It turns out that frassle pages that are filled with weird characters—like spaces and a Euro symbol where a dash should be—should show up correctly now, because I've configured my web server to serve them as UTF-8. The reasoning behind this, however, makes no sense.

Here is the experimental data. The first two rows are the old and new configuration on the server you're using right now. The only difference is that HTTP responses are now sent with a UTF-8 charset header. The last 4 rows are testing I've done on my own development server.

The columns from and to describe how the webserver is re-coding each page. Where these say none, I have tested disabling the recoding.

Machine
DB Coding
from
to
served
Result
production ISO-8859-1 UTF-8? ISO-8859-1 UTF-8 Good
" " " " ISO-8859-1 BAD
development UTF-8 (Unicode) UTF-8 ISO-8859-1 UTF-8 Good
" " " " ISO-8859-1 BAD
" " none none UTF-8 BAD
" " none none ISO-8859-1 BAD

Why, oh why does the page get converted to ISO-8859-1 and then get properly processed only as UTF-8? The conversion isn't just faking it; the length of the string actually changes as you'd expect in a conversion.

I'm guessing that something is wrong with the way I've been loading data into my DB. But what possible format could I actually be storing the data in such that treating it as UTF-8 and converting to ISO-8859-1 would produce correct UTF-8 encoded data?

Based on this experience, I suggest UTF-8 be renamed WTF-8. The new acronym has a structure that reminds experienced Unicode users of the original, yet includes a helpful hint to newbies that they are about to enter a world of pain and frustration.


Update later that evening: Joel on Software has an essay on this that has helped to clear my mind. Check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Still later: I think I know why the conversion didn't work. The text I was testing with was picked up by my aggregator. Since I was in the "what you don't know can't hurt you" camp when I wrote the aggregator, it simply took the text from RSS (it was UTF-8) and dumped it into my database (pretending it's ISO-8859-1). Something like this must have happened:
  1. Start with UTF-8 text.
  2. Put it through a totally inappropriate filter, like ISO-8859-1 to UTF-8.
  3. Now you have some incoherent garbage.
  4. But if you run that garbage through a UTF-8 to ISO-8859-1 filter, you end up with your original text, in UTF-8.

This can explain the mystery of valid UTF-8 pouring out of a UTF-8 to ISO-8859-1 filter. I still need to find out where the first, invalid filter is; and I need to make the aggregator encoding-aware. But at least there might be a plausible explanation.

Jeez, I feel so dirty having written all that software without even thinking about character encodings.