durusmail: quixote-users: Foreign characters in form input
Foreign characters in form input
2006-11-16
2006-11-16
Foreign characters in form input
Mike Orr
2006-11-16
I'm trying to figure out how Quixote handles non-ASCII characters in
form input.  Our users tend to paste text from Word documents and
FileMaker databases etc, which often contain:

  - degree symbols
  - "Word-enhanced" characters (curly quotes, long dashes, bullets)
  - Spanish/Portuguese letters (less common)

The source charset is windows-1252 or mac_roman depending on which
platform the document was created on.  I want to use unicode in memory
and utf-8 for display and MySQL.  I thought I would have to guess the
charset or ask the user and then try to convert, but I'm getting
errors.  But when I inspected the form input, to my amazement it was
already unicode.  Is this happening at some lower layer?  My form is
embedded in a utf-8 webpage, so if the data comes back as utf-8 and
something autocoverts it to uncode, that's (almost) OK.  Is this how
HTTP and Quixote work?

In this case, the only remaining problems would be:
  - Will it safely interpret any 8-bit character string the user
pastes in without raising an exception?
  - What if the document's character set is different from the
platform the user is running on?  Of course, both will be different
than utf-8 in any case. Will the browser convert the characters,
reinterpret them as-is, or what?

If I start getting 8-bit strings as form input, I'll have to convert
them using an algorithm to guess the charset, or a companion pulldown
for the user to tell me.
I'll hold off on the details of this for now since I'm still
navigating through the minefield of UnicodeDecodeError and
UnicodeEncodeError in various circumstances.

* * *
By the way, Quixote's error handler can't display an error message
that contains non-ASCII characters.

  File
"/usr/local/lib/python2.4/Quixote-2.4-py2.4-linux-i686.egg/quixote/publish.py",
line 195, in finish_failed_request
    tb)
  File
"/usr/local/lib/python2.4/Quixote-2.4-py2.4-linux-i686.egg/quixote/publish.py",
line 236, in _generate_cgitb_error
    error_file.write(str(util.dump_request(request)))
UnicodeEncodeError: 'ascii' codec can't encode characters in position
41-42: ordinal not in range(128)

The problem is that str() call.  However, if you take it out you get
the same error, because the called method also calls str().  Ayayay!
This happens no matter whether display_exceptions is set to "html",
"plain" or "none".  To see the real error I added "raise" before the
call to publish.finish_failed_request (publish.py line 284).  The real
error was:

  File "./char_conversion_site.py", line 84, in _q_index
    text = text.decode("mac_roman", "replace")
  File "/usr/lib/python2.4/encodings/mac_roman.py", line 22, in decode
    return codecs.charmap_decode(input,errors,decoding_map)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)

(The reason for the error was, 'text' is already unicode so should not
be decoded.)

There don't seem to be non-ASCII characters in that traceback so I'm
not 100% sure why Quixote blew up on it, but it did.

--
Mike Orr 
reply