For the first time, I have a Quixote application that requires internationalization/localization. I wanted to share some notes about my experiences with Unicode and PTL, and the patches I'm using to make it work. In the application code, the interface is expressed in English. Our first translation is from English to French, but I need to allow for possible translations into other, non-European languages. (As an aside: I've also written a simple gettext() substitute to facilitate my translations. Its translation function returns Unicode instances. For example: >>> import xlate >>> xlate.set_locale('fr') >>> u = _('Next Answer') >>> u u'Prochaine R\xe9ponse' I use the translation function throughout my app, primarily in PTL modules. The key point is that it returns Unicode instances.) I don't want to juggle multiple encodings, so I'm settling on UTF-8 as a de-facto standard encoding for my application. The problem =========== Perhaps unsurprisingly, the 'htmltext' implementations have not played very nicely with Unicode strings. Examples of expressions that will raise exceptions: from quixote.html import htmltext phrase = u'Prochaine r\xe9ponse' htmltext(phrase) # will fail htmltext('%s') % phrase # will fail Both expressions will fail with "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 11: ordinal not in range(128)". The failure occurs in both the C and Python implementations of htmltext. Needless to say, this is a show-stopper of a problem. I needed to get Unicode expressions into my PTL functions in a sensible, straightforward manner. My workaround ============= I've managed to work around the problem, though not very elegantly. What I did: - assume UTF-8 encoding as a de facto standard. All request handlers that return text must ensure that the response Content-Type is set to "text/*; charset=UTF-8" where "*" is the appropriate subtype (e.g. text/html). - patch quixote/html.py to ensure that the C implementation of htmltext is not used. (This will be necessary until I get around to fixing the C implementation.) Raising an ImportError at the right spot does the job. - patch quixote/_py_htmltext.py so that the classes 'htmltext' and '_QuoteWrapper' both anticipate Unicode input, and handle it by encoding it into UTF-8 text. (_QuoteWrapper was already Unicode-aware, but it chose to encode into Latin-1, which seems a short-sighted if expedient solution.) The two patches appear below. These three changes have the expected consequence of allowing Unicode strings to work as expected within PTL functions (and other places where htmltext() is used). To be fair, my tests have been rather limited (only my English-to-French translations), but should be representative. Problems with my approach ========================= - although assuming UTF-8 is a reasonable standard (since all Unicode characters can be expressed in UTF-8), it's not optimal for all possible uses. Someone would likely prefer or require a different encoding. Nonetheless, assuming UTF-8 is better than assuming ASCII. - I need to explicitly set the encoding on each textual response. Not too big a deal, since most responses call some kind of header() function, where the content-type can be set. But in cases where non-UTF-8 text is to be returned, the behaviour I've introduced may be surprising. (For example, returning a text file encoded in Latin-1 will result in some garbage characters if the UTF-8 content-type header isn't overridden.) But at least it's not magical: with an underlying assumption of UTF-8, such problems become obvious and can be fixed easily. - other problems? Nothing I've encountered yet, but I'm sure it can't be this simple. * * * I would be interested in feedback from my fellow Quixote users! Have you encountered the same problem, and solved it in a better fashion? Ideally, I would like to see Quixote "fixed" so that it handles Unicode more gracefully, but I'm too focussed on my current app to know what would qualify as an appropriate general solution. -- Graham The patches =========== --- F:\quixote\html.py Wed Dec 03 18:30:30 2003 +++ html.py Sat Jul 31 02:04:45 2004 @@ -59,8 +59,9 @@ import urllib from types import UnicodeType try: + raise ImportError # faster C implementation from quixote._c_htmltext import htmltext, htmlescape, _escape_string, \ TemplateIO except ImportError: --- F:\quixote\_py_htmltext.py Wed Dec 03 18:30:30 2003 +++ _py_htmltext.py Sat Jul 31 02:08:21 2004 @@ -42,12 +42,14 @@ using entities. """ __slots__ = ['s'] def __init__(self, s): + if isinstance(s, unicode): + s = s.encode('utf-8') self.s = str(s) # XXX make read-only #def __setattr__(self, name, value): # raise AttributeError, 'immutable object' @@ -158,12 +160,14 @@ # helper for htmltext class __mod__ __slots__ = ['value', 'escape'] def __init__(self, value, escape): self.value = value + if isinstance(self.value, unicode): + self.value = self.value.encode('utf-8') self.escape = escape def __str__(self): return self.escape(str(self.value)) def __repr__(self): @@ -187,13 +191,13 @@ already a 'htmltext' object then the HTML markup characters \", <, >, and & are first escaped. """ if classof(s) is htmltext: return s elif isinstance(s, UnicodeType): - s = s.encode('iso-8859-1') + s = s.encode('utf-8') else: s = str(s) # inline _escape_string for speed s = s.replace("&", "&") # must be done first s = s.replace("<", "<") s = s.replace(">", ">")