[Quixote-users] Re: Htmltext and latin-1 characters

Mike Orr  wrote:
>         def filter(self, val, **kw):
>             val = htmlescape(val)
>             if isinstance(val, htmltext):
>                 return str(val)  # Cheetah > 1.0rc1 compatibility.
>             else:
>                 return val
>
> In this case it's trying to filter U"A\xa0B" retrieved from the
> database.  That's "AB" with the degree symbol in between.  Voila:
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
> position 1: ordinal not in range(128)

Right, that's the same as trying str(U"A\xa0B").

> OK, let's try returning Unicode instead.
>
>     return unicode(val, 'latin1')    # Cheetah > 1.0rc1 compatibility.
>
> TypeError: coercing to Unicode: need string or buffer, htmltext found

You already have a unicode string.  Decoding from 'latin1' makes no
sense.  If you had a unicode object instead of a htmltext object,
you would still get an error:

    >>> unicode(U"A\xa0B", "latin1")
    Traceback (most recent call last):
      File "", line 1, in ?
    TypeError: decoding Unicode is not supported

> Darn it, why didn't htmltext subclass str!!!

Because then people who wanted to represent Unicode characters would
be out of luck.  Qpy makes htmltext a subclass of unicode.  That
forces everyone who uses it to correctly handle unicode strings.

> Peeking into the htmltext implementation, it stores the actual
> value in an attribute ..s:
>
>     return unicode(val.s, 'latin1')  # Cheetah > 1.0rc1 compatibility.
>
> TypeError: decoding Unicode is not supported

Same error as my code snippet above.  You already have a unicode
string.  Decoding it makes no sense.

> How about this?
>
>     return unicode(val.s)  # Cheetah > 1.0rc1 compatibility.
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
> position 412: ordinal not in range(128)

That should work, although this the following would do the same
thing since s is already a unicode string:

    return val.s

The UnicodeEncodeError is being raised by some other code, I expect.
See below.

>>>> print htmltext(U"A\xa0B")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
> position 1: ordinal not in range(128)

Does this work for you:

    >>> print U"A\xa0B"

It works for me because:

    >>> import sys
    >>> sys.stdout.encoding
    'UTF-8'

Sometimes stdout is 'ascii' and so you have to manually set the
encoding, eg:

    >>> import sys, codecs
    >>> sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

The other problem you are running into is a bug in Python, IMHO.
You can't print an object that has a __str__ (or __unicode__) method
that returns a unicode string:

    >>> class A:
    ...    def __str__(self):
    ...        return u"\u1234"
    ...
    >>> print A()
    UnicodeEncodeError

another try:

    >>> class A:
    ...    def __unicode__(self):
    ...        return u"\u1234"
    ...
    >>> print A()
    <__main__.A instance at 0xb7dd1f4c>

I'm going to post a patch for the Python bug.  Hopefully it will get
applied for Python 2.5.  The answer you are looking for, I think,
is:

    def filter(self, val, **kw):
        val = htmlescape(val)
        if isinstance(val, htmltext):
            return val.s # Cheetah > 1.0rc1 compatibility.
        else:
            return val

alternatively,

    def filter(self, val, **kw):
        val = htmlescape(val)
        return stringify(val) # from quixote.html

>
>>>> print htmltext("A\xa0B")
> UnicodeEncodeError

Try:

    print stringify(htmltext("A\xa0B")

Again, PyFile_WriteObject cannot print an object that has a unicode
representation.  You need to give PyFile_WriteObject a unicode
string.  It surprises me that on one else is complaining about these
Python bugs.

  Neil