Re: [Quixote-users] Work-around for Unicode in PTL

Work-around for Unicode in PTL
2004-07-31
Graham Fawcett
2004-07-31
Skip Montanaro
2004-08-01
Maas-Maarten Zeeman
Re: Work-around for Unicode in PTL
2004-08-01
Graham Fawcett
Work-around for Unicode in PTL
Maas-Maarten Zeeman
2004-08-01
Well if you assume utf8, then there is a simpler workaround. The idea is
that if you assume utf-8 encoding you can also immediately encode the
unicode returned from gettext. Here is a wrapper to make that easy.

import gettext

class EncodingTranslationsProxy(gettext.NullTranslations):
    def __init__(self, trans, encoding='utf-8'):
        self.trans = trans
        self.__encoding = encoding

    def __getattr__(self, name):
        return getattr(self.trans, name)

    def gettext(self, message):
        return self.trans.ugettext(message).encode(self.__encoding)


def translation(domain, locale, language):
    return EncodingTranslationsProxy(gettext.translation(domain,
                                                         locale,

languages=[language]))

Regards,

Maas

> For the first time, I have a Quixote application that requires
> internationalization/localization. I wanted to share some notes about
> my experiences with Unicode and PTL, and the patches I'm using to make
> it work.
>
> In the application code, the interface is expressed in English. Our
> first translation is from English to French, but I need to allow for
> possible translations into other, non-European languages.
>
> (As an aside: I've also written a simple gettext() substitute to
> facilitate my translations. Its translation function returns Unicode
> instances. For example:
>
>     >>> import xlate
>     >>> xlate.set_locale('fr')
>     >>> u = _('Next Answer')
>     >>> u
>     u'Prochaine R\xe9ponse'
>
> I use the translation function throughout my app, primarily in PTL
> modules. The key point is that it returns Unicode instances.)
>
> I don't want to juggle multiple encodings, so I'm settling on UTF-8 as
> a de-facto standard encoding for my application.
>
>
> The problem
> ===========
>
> Perhaps unsurprisingly, the 'htmltext' implementations have not played
> very nicely with Unicode strings. Examples of expressions that will
> raise exceptions:
>
>     from quixote.html import htmltext
>     phrase = u'Prochaine r\xe9ponse'
>     htmltext(phrase)                                 # will fail
>     htmltext('%s') % phrase                          # will fail
>
> Both expressions will fail with "UnicodeEncodeError: 'ascii' codec
> can't encode character u'\xe9' in position 11: ordinal not in
> range(128)". The failure occurs in both the C and Python
> implementations of htmltext.
>
> Needless to say, this is a show-stopper of a problem. I needed to get
> Unicode expressions into my PTL functions in a sensible,
> straightforward manner.
>
>
> My workaround
> =============
>
> I've managed to work around the problem, though not very
> elegantly. What I did:
>
> - assume UTF-8 encoding as a de facto standard. All request handlers
> that return text must ensure that the response Content-Type is set to
> "text/*; charset=UTF-8" where "*" is the appropriate subtype
> (e.g. text/html).
>
> - patch quixote/html.py to ensure that the C implementation of
> htmltext is not used. (This will be necessary until I get around to
> fixing the C implementation.) Raising an ImportError at the right spot
> does the job.
>
> - patch quixote/_py_htmltext.py so that the classes 'htmltext' and
> '_QuoteWrapper' both anticipate Unicode input, and handle it by
> encoding it into UTF-8 text. (_QuoteWrapper was already Unicode-aware,
> but it chose to encode into Latin-1, which seems a short-sighted
> if expedient solution.)
>
> The two patches appear below.
>
> These three changes have the expected consequence of allowing Unicode
> strings to work as expected within PTL functions (and other places
> where htmltext() is used). To be fair, my tests have been rather
> limited (only my English-to-French translations), but should be
> representative.
>
>
> Problems with my approach
> =========================
>
> - although assuming UTF-8 is a reasonable standard (since all Unicode
> characters can be expressed in UTF-8), it's not optimal for all
> possible uses. Someone would likely prefer or require a different
> encoding. Nonetheless, assuming UTF-8 is better than assuming
> ASCII.
>
> - I need to explicitly set the encoding on each textual response. Not
> too big a deal, since most responses call some kind of header()
> function, where the content-type can be set. But in cases where
> non-UTF-8 text is to be returned, the behaviour I've introduced may be
> surprising. (For example, returning a text file encoded in Latin-1
> will result in some garbage characters if the UTF-8 content-type
> header isn't overridden.) But at least it's not magical: with an
> underlying assumption of UTF-8, such problems become obvious and
> can be fixed easily.
>
> - other problems? Nothing I've encountered yet, but I'm sure it can't
> be this simple.
>
>
> * * *
>
> I would be interested in feedback from my fellow Quixote users! Have
> you encountered the same problem, and solved it in a better fashion?
> Ideally, I would like to see Quixote "fixed" so that it handles
> Unicode more gracefully, but I'm too focussed on my current app to
> know what would qualify as an appropriate general solution.
>
> -- Graham
>
>
>
> The patches
> ===========
>
>
>   --- F:\quixote\html.py    Wed Dec 03 18:30:30 2003
>   +++ html.py    Sat Jul 31 02:04:45 2004
>   @@ -59,8 +59,9 @@
>    import urllib
>    from types import UnicodeType
>     try:
>   +    raise ImportError
>        # faster C implementation
>        from quixote._c_htmltext import htmltext, htmlescape,
> _escape_string, \
>            TemplateIO
>    except ImportError:
>
>
>
>   --- F:\quixote\_py_htmltext.py    Wed Dec 03 18:30:30 2003
>   +++ _py_htmltext.py    Sat Jul 31 02:08:21 2004
>   @@ -42,12 +42,14 @@
>        using entities.
>        """
>
>        __slots__ = ['s']
>
>        def __init__(self, s):
>   +        if isinstance(s, unicode):
>   +            s = s.encode('utf-8')
>            self.s = str(s)
>
>        # XXX make read-only
>        #def __setattr__(self, name, value):
>        #    raise AttributeError, 'immutable object'
>
>   @@ -158,12 +160,14 @@
>        # helper for htmltext class __mod__
>
>        __slots__ = ['value', 'escape']
>
>        def __init__(self, value, escape):
>            self.value = value
>   +        if isinstance(self.value, unicode):
>   +            self.value = self.value.encode('utf-8')
>            self.escape = escape
>
>        def __str__(self):
>            return self.escape(str(self.value))
>
>        def __repr__(self):
>   @@ -187,13 +191,13 @@
>        already a 'htmltext' object then the HTML markup characters \",
> <, >,
>        and & are first escaped.
>        """
>        if classof(s) is htmltext:
>            return s
>        elif isinstance(s,  UnicodeType):
>   -        s = s.encode('iso-8859-1')
>   +        s = s.encode('utf-8')
>        else:
>            s = str(s)
>        # inline _escape_string for speed
>        s = s.replace("&", "&") # must be done first
>        s = s.replace("<", "<")
>        s = s.replace(">", ">")
>
> _______________________________________________
> Quixote-users mailing list
> Quixote-users@mems-exchange.org
> http://mail.mems-exchange.org/mailman/listinfo/quixote-users
>