Re: [Quixote-users] Re: Htmltext and latin-1 characters

On 6/4/06, Neil Schemenauer  wrote:
> Mike Orr  wrote:
> >         def filter(self, val, **kw):
> >             val = htmlescape(val)
> >             if isinstance(val, htmltext):
> >                 return str(val)  # Cheetah > 1.0rc1 compatibility.
> >             else:
> >                 return val
> >
> > In this case it's trying to filter U"A\xa0B" retrieved from the
> > database.  That's "AB" with the degree symbol in between.  Voila:
> >
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
> > position 1: ordinal not in range(128)

I finally got it to work with:

return val.__str__().encode('latin1', 'xmlcharrefreplace')

The .__str__() is a Cheetah method that does return Unicode, which may
be wrong, but Guido has said str() will be allowed to return Unicode
in a future version to get around some of these problems.  .__str__()
in Cheetah has the side effect of calling the template's main method
so you don't have to hardcode its name, that's why I was using str()
in the first place and then switched to unicode() because I thought
maybe that would work.

I was surprised that XML entities can he higher than 255 but Python
things they can.  I don't really care what the browser displays for it
because we don't really know what the original character was supposed
to be anyway (it was pasted from another application using who knows
what charset that may have been different from the browser's charset
of the person who uploaded it).  I just want the page to be readable
because it contains scientific data people need.

I still don't know why MySQLdb or MySQL or some Python or C library is
truncating the values on insert at the first non-ascii or non-latin-1
character, but I'm just running a conversion function to asciify new
values to sidestep the issue.  I'm also trying SQLAlchemy for my new
application, so maybe it'll do a better job.

--
Mike Orr