durusmail: quixote-users: Foreign characters in form input
Foreign characters in form input
2006-11-16
2006-11-16
Foreign characters in form input
David Binger
2006-11-16
On Nov 15, 2006, at 8:51 PM, Mike Orr wrote:

> I'm trying to figure out how Quixote handles non-ASCII characters in
> form input.  Our users tend to paste text from Word documents and
> FileMaker databases etc, which often contain:
>
>  - degree symbols
>  - "Word-enhanced" characters (curly quotes, long dashes, bullets)
>  - Spanish/Portuguese letters (less common)
>
> The source charset is windows-1252 or mac_roman depending on which
> platform the document was created on.  I want to use unicode in memory
> and utf-8 for display and MySQL.  I thought I would have to guess the
> charset or ask the user and then try to convert, but I'm getting
> errors.  But when I inspected the form input, to my amazement it was
> already unicode.  Is this happening at some lower layer?  My form is
> embedded in a utf-8 webpage, so if the data comes back as utf-8 and
> something autocoverts it to uncode, that's (almost) OK.  Is this how
> HTTP and Quixote work?

When the page is utf-8, I think normal browsers send the request
encoded as utf-8, and if quixote's charset is also utf-8, the
form values are decoded to unicode.

>
> In this case, the only remaining problems would be:
>  - Will it safely interpret any 8-bit character string the user
> pastes in without raising an exception?

The browser must figure out what the intended characters are,
so that it can transmit them encoded using utf-8.
Our sites use this pattern and I don't remember ever seeing
an exception in the form value decoding.

>  - What if the document's character set is different from the
> platform the user is running on?  Of course, both will be different
> than utf-8 in any case. Will the browser convert the characters,
> reinterpret them as-is, or what?

If some mechanism on the client side causes the encoded document
to be decoded using the wrong character set and then re-encoded
as utf-8 to send the request, then you've got a mess.



reply