On Nov 15, 2006, at 8:51 PM, Mike Orr wrote: > I'm trying to figure out how Quixote handles non-ASCII characters in > form input. Our users tend to paste text from Word documents and > FileMaker databases etc, which often contain: > > - degree symbols > - "Word-enhanced" characters (curly quotes, long dashes, bullets) > - Spanish/Portuguese letters (less common) > > The source charset is windows-1252 or mac_roman depending on which > platform the document was created on. I want to use unicode in memory > and utf-8 for display and MySQL. I thought I would have to guess the > charset or ask the user and then try to convert, but I'm getting > errors. But when I inspected the form input, to my amazement it was > already unicode. Is this happening at some lower layer? My form is > embedded in a utf-8 webpage, so if the data comes back as utf-8 and > something autocoverts it to uncode, that's (almost) OK. Is this how > HTTP and Quixote work? When the page is utf-8, I think normal browsers send the request encoded as utf-8, and if quixote's charset is also utf-8, the form values are decoded to unicode. > > In this case, the only remaining problems would be: > - Will it safely interpret any 8-bit character string the user > pastes in without raising an exception? The browser must figure out what the intended characters are, so that it can transmit them encoded using utf-8. Our sites use this pattern and I don't remember ever seeing an exception in the form value decoding. > - What if the document's character set is different from the > platform the user is running on? Of course, both will be different > than utf-8 in any case. Will the browser convert the characters, > reinterpret them as-is, or what? If some mechanism on the client side causes the encoded document to be decoded using the wrong character set and then re-encoded as utf-8 to send the request, then you've got a mess.