Well if you assume utf8, then there is a simpler workaround. The idea is that if you assume utf-8 encoding you can also immediately encode the unicode returned from gettext. Here is a wrapper to make that easy. import gettext class EncodingTranslationsProxy(gettext.NullTranslations): def __init__(self, trans, encoding='utf-8'): self.trans = trans self.__encoding = encoding def __getattr__(self, name): return getattr(self.trans, name) def gettext(self, message): return self.trans.ugettext(message).encode(self.__encoding) def translation(domain, locale, language): return EncodingTranslationsProxy(gettext.translation(domain, locale, languages=[language])) Regards, Maas > For the first time, I have a Quixote application that requires > internationalization/localization. I wanted to share some notes about > my experiences with Unicode and PTL, and the patches I'm using to make > it work. > > In the application code, the interface is expressed in English. Our > first translation is from English to French, but I need to allow for > possible translations into other, non-European languages. > > (As an aside: I've also written a simple gettext() substitute to > facilitate my translations. Its translation function returns Unicode > instances. For example: > > >>> import xlate > >>> xlate.set_locale('fr') > >>> u = _('Next Answer') > >>> u > u'Prochaine R\xe9ponse' > > I use the translation function throughout my app, primarily in PTL > modules. The key point is that it returns Unicode instances.) > > I don't want to juggle multiple encodings, so I'm settling on UTF-8 as > a de-facto standard encoding for my application. > > > The problem > =========== > > Perhaps unsurprisingly, the 'htmltext' implementations have not played > very nicely with Unicode strings. Examples of expressions that will > raise exceptions: > > from quixote.html import htmltext > phrase = u'Prochaine r\xe9ponse' > htmltext(phrase) # will fail > htmltext('%s') % phrase # will fail > > Both expressions will fail with "UnicodeEncodeError: 'ascii' codec > can't encode character u'\xe9' in position 11: ordinal not in > range(128)". The failure occurs in both the C and Python > implementations of htmltext. > > Needless to say, this is a show-stopper of a problem. I needed to get > Unicode expressions into my PTL functions in a sensible, > straightforward manner. > > > My workaround > ============= > > I've managed to work around the problem, though not very > elegantly. What I did: > > - assume UTF-8 encoding as a de facto standard. All request handlers > that return text must ensure that the response Content-Type is set to > "text/*; charset=UTF-8" where "*" is the appropriate subtype > (e.g. text/html). > > - patch quixote/html.py to ensure that the C implementation of > htmltext is not used. (This will be necessary until I get around to > fixing the C implementation.) Raising an ImportError at the right spot > does the job. > > - patch quixote/_py_htmltext.py so that the classes 'htmltext' and > '_QuoteWrapper' both anticipate Unicode input, and handle it by > encoding it into UTF-8 text. (_QuoteWrapper was already Unicode-aware, > but it chose to encode into Latin-1, which seems a short-sighted > if expedient solution.) > > The two patches appear below. > > These three changes have the expected consequence of allowing Unicode > strings to work as expected within PTL functions (and other places > where htmltext() is used). To be fair, my tests have been rather > limited (only my English-to-French translations), but should be > representative. > > > Problems with my approach > ========================= > > - although assuming UTF-8 is a reasonable standard (since all Unicode > characters can be expressed in UTF-8), it's not optimal for all > possible uses. Someone would likely prefer or require a different > encoding. Nonetheless, assuming UTF-8 is better than assuming > ASCII. > > - I need to explicitly set the encoding on each textual response. Not > too big a deal, since most responses call some kind of header() > function, where the content-type can be set. But in cases where > non-UTF-8 text is to be returned, the behaviour I've introduced may be > surprising. (For example, returning a text file encoded in Latin-1 > will result in some garbage characters if the UTF-8 content-type > header isn't overridden.) But at least it's not magical: with an > underlying assumption of UTF-8, such problems become obvious and > can be fixed easily. > > - other problems? Nothing I've encountered yet, but I'm sure it can't > be this simple. > > > * * * > > I would be interested in feedback from my fellow Quixote users! Have > you encountered the same problem, and solved it in a better fashion? > Ideally, I would like to see Quixote "fixed" so that it handles > Unicode more gracefully, but I'm too focussed on my current app to > know what would qualify as an appropriate general solution. > > -- Graham > > > > The patches > =========== > > > --- F:\quixote\html.py Wed Dec 03 18:30:30 2003 > +++ html.py Sat Jul 31 02:04:45 2004 > @@ -59,8 +59,9 @@ > import urllib > from types import UnicodeType > try: > + raise ImportError > # faster C implementation > from quixote._c_htmltext import htmltext, htmlescape, > _escape_string, \ > TemplateIO > except ImportError: > > > > --- F:\quixote\_py_htmltext.py Wed Dec 03 18:30:30 2003 > +++ _py_htmltext.py Sat Jul 31 02:08:21 2004 > @@ -42,12 +42,14 @@ > using entities. > """ > > __slots__ = ['s'] > > def __init__(self, s): > + if isinstance(s, unicode): > + s = s.encode('utf-8') > self.s = str(s) > > # XXX make read-only > #def __setattr__(self, name, value): > # raise AttributeError, 'immutable object' > > @@ -158,12 +160,14 @@ > # helper for htmltext class __mod__ > > __slots__ = ['value', 'escape'] > > def __init__(self, value, escape): > self.value = value > + if isinstance(self.value, unicode): > + self.value = self.value.encode('utf-8') > self.escape = escape > > def __str__(self): > return self.escape(str(self.value)) > > def __repr__(self): > @@ -187,13 +191,13 @@ > already a 'htmltext' object then the HTML markup characters \", > <, >, > and & are first escaped. > """ > if classof(s) is htmltext: > return s > elif isinstance(s, UnicodeType): > - s = s.encode('iso-8859-1') > + s = s.encode('utf-8') > else: > s = str(s) > # inline _escape_string for speed > s = s.replace("&", "&") # must be done first > s = s.replace("<", "<") > s = s.replace(">", ">") > > _______________________________________________ > Quixote-users mailing list > Quixote-users@mems-exchange.org > http://mail.mems-exchange.org/mailman/listinfo/quixote-users >