[Quixote-users] Htmltext and latin-1 characters

Still trying to untangle my Quixote site that's having problems with
foreign characters.  It's a scientific environment so people paste
text with the degree symbol from Word documents, and the curly quotes
come along too, sigh.  Worse, things come from unknown character sets
because Word on Windows is different from Word on Mac; other stuff
comes from FileMaker which uses different characters, etc.  We've
decided a wrong character is acceptable but exceptions are not.  Then
I also had a problem with MySQL truncating input at the first
non-ASCII character, but I've got that fixed.

Now the problem is htmltext + Cheetah + str().  I made a Cheetah
filter that smartly escapes non-htmltext values, and it's used
throughout my application, some thirty templates.

    from Cheetah.Filters import Filter
    from quixote.html import htmlescape, htmltext

    class HtmltextFilter(Filter):
        """Safer than WebSafe: escapes values that aren't htmltext instances."""
        def filter(self, val, **kw):
            val = htmlescape(val)
            if isinstance(val, htmltext):
                return str(val)  # Cheetah > 1.0rc1 compatibility.
            else:
                return val

In this case it's trying to filter U"A\xa0B" retrieved from the
database.  That's "AB" with the degree symbol in between.  Voila:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 1: ordinal not in range(128)

OK, let's try returning Unicode instead.

    return unicode(val, 'latin1')    # Cheetah > 1.0rc1 compatibility.

TypeError: coercing to Unicode: need string or buffer, htmltext found

Darn it, why didn't htmltext subclass str!!!  Peeking into the
htmltext implementation, it stores the actual value in an attribute
..s:

    return unicode(val.s, 'latin1')  # Cheetah > 1.0rc1 compatibility.

TypeError: decoding Unicode is not supported

How about this?

    return unicode(val.s)  # Cheetah > 1.0rc1 compatibility.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 412: ordinal not in range(128)

F**k!  OK, the trick I used in TurboGears:

    return unicode(val.s, 'latin1').encode('latin1')  # Cheetah >
1.0rc1 compatibility.

TypeError: decoding Unicode is not supported


So how *do* you convert an htmltext object containing a non-ASCII
character to either str or unicode?  And how do you output it?

>>> print htmltext(U"A\xa0B")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 1: ordinal not in range(128)

>> print htmltext("A\xa0B")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 1: ordinal not in range(128)
>>> print htmltext("A\xa0B")

I tried sys.setdefaultencoding("latin1") but that has to be done in
the 'site' module; it's not available within a program.

Another problem is, all my controller methods instantiate a template
and "return str(t)" it.  I'll have to change that to "return
unicode(t)" or "return t.respond()" or something; I'm not sure what.
Plus who knows how many htmltext objects are used as placeholder
values; e.g., Quixote forms.  So it looks like I'll have to make
changes all over my program.

If you've been wondering why I've been making such a big deal the past
few months about smart escaping in Cheetah and making Cheetah deal
with Unicode, and whether/how the WebSafe filter needs to be made
Unicode-friendly, this is why.  It came to a head this past couple
weeks as people started posting reports with the degree symbol, and a
set of notifications that come in as email started using "MASCULINE
ORDINAL INDICATOR" ("\xba") instead of the proper "DEGREE SIGN"
("\xb0") because they both look like a circle on some Windows screens,
and that made another program choke because I was converting one to
ASCII ("degrees") and didn't know about the other.  Sigh.

--
Mike Orr