Still trying to untangle my Quixote site that's having problems with foreign characters. It's a scientific environment so people paste text with the degree symbol from Word documents, and the curly quotes come along too, sigh. Worse, things come from unknown character sets because Word on Windows is different from Word on Mac; other stuff comes from FileMaker which uses different characters, etc. We've decided a wrong character is acceptable but exceptions are not. Then I also had a problem with MySQL truncating input at the first non-ASCII character, but I've got that fixed. Now the problem is htmltext + Cheetah + str(). I made a Cheetah filter that smartly escapes non-htmltext values, and it's used throughout my application, some thirty templates. from Cheetah.Filters import Filter from quixote.html import htmlescape, htmltext class HtmltextFilter(Filter): """Safer than WebSafe: escapes values that aren't htmltext instances.""" def filter(self, val, **kw): val = htmlescape(val) if isinstance(val, htmltext): return str(val) # Cheetah > 1.0rc1 compatibility. else: return val In this case it's trying to filter U"A\xa0B" retrieved from the database. That's "AB" with the degree symbol in between. Voila: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 1: ordinal not in range(128) OK, let's try returning Unicode instead. return unicode(val, 'latin1') # Cheetah > 1.0rc1 compatibility. TypeError: coercing to Unicode: need string or buffer, htmltext found Darn it, why didn't htmltext subclass str!!! Peeking into the htmltext implementation, it stores the actual value in an attribute ..s: return unicode(val.s, 'latin1') # Cheetah > 1.0rc1 compatibility. TypeError: decoding Unicode is not supported How about this? return unicode(val.s) # Cheetah > 1.0rc1 compatibility. UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 412: ordinal not in range(128) F**k! OK, the trick I used in TurboGears: return unicode(val.s, 'latin1').encode('latin1') # Cheetah > 1.0rc1 compatibility. TypeError: decoding Unicode is not supported So how *do* you convert an htmltext object containing a non-ASCII character to either str or unicode? And how do you output it? >>> print htmltext(U"A\xa0B") UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 1: ordinal not in range(128) >> print htmltext("A\xa0B") UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 1: ordinal not in range(128) >>> print htmltext("A\xa0B") I tried sys.setdefaultencoding("latin1") but that has to be done in the 'site' module; it's not available within a program. Another problem is, all my controller methods instantiate a template and "return str(t)" it. I'll have to change that to "return unicode(t)" or "return t.respond()" or something; I'm not sure what. Plus who knows how many htmltext objects are used as placeholder values; e.g., Quixote forms. So it looks like I'll have to make changes all over my program. If you've been wondering why I've been making such a big deal the past few months about smart escaping in Cheetah and making Cheetah deal with Unicode, and whether/how the WebSafe filter needs to be made Unicode-friendly, this is why. It came to a head this past couple weeks as people started posting reports with the degree symbol, and a set of notifications that come in as email started using "MASCULINE ORDINAL INDICATOR" ("\xba") instead of the proper "DEGREE SIGN" ("\xb0") because they both look like a circle on some Windows screens, and that made another program choke because I was converting one to ASCII ("degrees") and didn't know about the other. Sigh. -- Mike Orr