[Quixote-users] Work-around for Unicode in PTL

Work-around for Unicode in PTL
2004-07-31
Graham Fawcett
2004-07-31
Skip Montanaro
2004-08-01
Maas-Maarten Zeeman
Re: Work-around for Unicode in PTL
2004-08-01
Graham Fawcett
Work-around for Unicode in PTL
Graham Fawcett
2004-07-31
For the first time, I have a Quixote application that requires
internationalization/localization. I wanted to share some notes about
my experiences with Unicode and PTL, and the patches I'm using to make
it work.

In the application code, the interface is expressed in English. Our
first translation is from English to French, but I need to allow for
possible translations into other, non-European languages.

(As an aside: I've also written a simple gettext() substitute to
facilitate my translations. Its translation function returns Unicode
instances. For example:

     >>> import xlate
     >>> xlate.set_locale('fr')
     >>> u = _('Next Answer')
     >>> u
     u'Prochaine R\xe9ponse'

I use the translation function throughout my app, primarily in PTL
modules. The key point is that it returns Unicode instances.)

I don't want to juggle multiple encodings, so I'm settling on UTF-8 as
a de-facto standard encoding for my application.


The problem
===========

Perhaps unsurprisingly, the 'htmltext' implementations have not played
very nicely with Unicode strings. Examples of expressions that will
raise exceptions:

     from quixote.html import htmltext
     phrase = u'Prochaine r\xe9ponse'
     htmltext(phrase)                                 # will fail
     htmltext('%s') % phrase                          # will fail

Both expressions will fail with "UnicodeEncodeError: 'ascii' codec
can't encode character u'\xe9' in position 11: ordinal not in
range(128)". The failure occurs in both the C and Python
implementations of htmltext.

Needless to say, this is a show-stopper of a problem. I needed to get
Unicode expressions into my PTL functions in a sensible,
straightforward manner.


My workaround
=============

I've managed to work around the problem, though not very
elegantly. What I did:

- assume UTF-8 encoding as a de facto standard. All request handlers
that return text must ensure that the response Content-Type is set to
"text/*; charset=UTF-8" where "*" is the appropriate subtype
(e.g. text/html).

- patch quixote/html.py to ensure that the C implementation of
htmltext is not used. (This will be necessary until I get around to
fixing the C implementation.) Raising an ImportError at the right spot
does the job.

- patch quixote/_py_htmltext.py so that the classes 'htmltext' and
'_QuoteWrapper' both anticipate Unicode input, and handle it by
encoding it into UTF-8 text. (_QuoteWrapper was already Unicode-aware,
but it chose to encode into Latin-1, which seems a short-sighted
if expedient solution.)

The two patches appear below.

These three changes have the expected consequence of allowing Unicode
strings to work as expected within PTL functions (and other places
where htmltext() is used). To be fair, my tests have been rather
limited (only my English-to-French translations), but should be
representative.


Problems with my approach
=========================

- although assuming UTF-8 is a reasonable standard (since all Unicode
characters can be expressed in UTF-8), it's not optimal for all
possible uses. Someone would likely prefer or require a different
encoding. Nonetheless, assuming UTF-8 is better than assuming
ASCII.

- I need to explicitly set the encoding on each textual response. Not
too big a deal, since most responses call some kind of header()
function, where the content-type can be set. But in cases where
non-UTF-8 text is to be returned, the behaviour I've introduced may be
surprising. (For example, returning a text file encoded in Latin-1
will result in some garbage characters if the UTF-8 content-type
header isn't overridden.) But at least it's not magical: with an
underlying assumption of UTF-8, such problems become obvious and
can be fixed easily.

- other problems? Nothing I've encountered yet, but I'm sure it can't
be this simple.


* * *

I would be interested in feedback from my fellow Quixote users! Have
you encountered the same problem, and solved it in a better fashion?
Ideally, I would like to see Quixote "fixed" so that it handles
Unicode more gracefully, but I'm too focussed on my current app to
know what would qualify as an appropriate general solution.

-- Graham



The patches
===========


   --- F:\quixote\html.py       Wed Dec 03 18:30:30 2003
   +++ html.py  Sat Jul 31 02:04:45 2004
   @@ -59,8 +59,9 @@
    import urllib
    from types import UnicodeType
     try:
   +    raise ImportError
        # faster C implementation
        from quixote._c_htmltext import htmltext, htmlescape, _escape_string, \
            TemplateIO
    except ImportError:



   --- F:\quixote\_py_htmltext.py       Wed Dec 03 18:30:30 2003
   +++ _py_htmltext.py  Sat Jul 31 02:08:21 2004
   @@ -42,12 +42,14 @@
        using entities.
        """

        __slots__ = ['s']

        def __init__(self, s):
   +        if isinstance(s, unicode):
   +            s = s.encode('utf-8')
            self.s = str(s)

        # XXX make read-only
        #def __setattr__(self, name, value):
        #    raise AttributeError, 'immutable object'

   @@ -158,12 +160,14 @@
        # helper for htmltext class __mod__

        __slots__ = ['value', 'escape']

        def __init__(self, value, escape):
            self.value = value
   +        if isinstance(self.value, unicode):
   +            self.value = self.value.encode('utf-8')
            self.escape = escape

        def __str__(self):
            return self.escape(str(self.value))

        def __repr__(self):
   @@ -187,13 +191,13 @@
        already a 'htmltext' object then the HTML markup characters \", <, >,
        and & are first escaped.
        """
        if classof(s) is htmltext:
            return s
        elif isinstance(s,  UnicodeType):
   -        s = s.encode('iso-8859-1')
   +        s = s.encode('utf-8')
        else:
            s = str(s)
        # inline _escape_string for speed
        s = s.replace("&", "&") # must be done first
        s = s.replace("<", "<")
        s = s.replace(">", ">")