[opencms-dev] Re: OT: JSP & UTF-8
Christoph Schönfeld
cschoenfeld at sylphen.com
Fri Oct 20 10:43:25 CEST 2006
Hi Joe and fellow list readers,
in the Tomcat implementation of the Java Servlet Specification, the
first call to HttpServletRequest.getParameters() has the side effect
that the GET and POST data is parsed with the encoding in effect for
the HttpServletRequest at that time. The Java Servlet Specification
defines ISO-8859-1 to be the default encoding. The Tomcat implementation
caches the result of that operation after the first call and does not
reparse them when HttpServletRequest.setCharacterEncoding() is called.
Joe, I could imagine that your log settings cause a call to
HttpServletRequest.getParameters() or getParameter(). If that's the case
your call to HttpServletRequest.setCharacterEncoding() has absolutely no
effect because it is made too late.
There are two aspects which make Unicode handling difficult with HTTP.
First, there is no way the server can tell clients the expected input
charset. This is a logical consequence of the stateless nature of HTTP.
But secondly, there is no way the client can tell a server the charset
of the request data which is quite unfortunate. Servers always have to
guess or rely. IMO this is where the specification fails.
To fully support UTF-8, you have to take care to get data output and
input right. Getting output right is easier because it's fully supported
by the HTTP Content-Type header.
I use the following measures successfully with Tomcat:
Output: Make sure that the content sent to the browser actually is what
it declares to be.
1. Use UTF-8 as the JSP contentType. The JSP Specification is not as
precise on the effect of this setting as version 2.0 is: contentType
defines the charset of the HTTP Response. (See JSP.2.10.2 The taglib
Directive on p. 52 in the JSP Specification version 1.2, and JSP.1.10.2
on p. 48 in the JSP Specification version 2.0)
2. Make sure the Content-Type header in your HttpServletResponse is
'text/html; charset="UTF-8"'. (See
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.17)
3. If you create URLs in your HTML pages, use java.net.URLEncoder to
make sure that non-ASCII values cannot get into the parameter values.
4. Optional: consider storing the JSP file itself as UTF-8 and using the
pageEncoding directive to tell the JSP Processor about that.
If you have a servlet sending HTML with <meta http-equiv="contenttype"
content="text/html; charset=UTF-8">, make sure you actually send UTF-8
data. If you use the ServletOutputStream directly, make sure you wrap it
in a OutputStreamWriter initialized with the correct encoding. If you
use HttpServletResponse.getWriter(), make sure you call
setCharacterEncoding() before you call getWriter()!
However, this is only the output aspect: If this is right, the browser
is able to correctly display the UTF-8 data.
Input: The part most poeple forget is to take care that input data sent
back by the browser correct.
1. Use the attribute "accept-charset" in your HTML form tag: <form ...
accept-charset="UTF-8">. This tells the browser to send UTF-8 data. (See
http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset).
This requires a compliant browser and is no guarantee that it will work
but it does with current browsers.
2. If you use Tomcat, use the attribute URIEncoding="UTF-8" in the
Connector element in your server.xml. This makes UTF-8 input also work
for GET parameters (see 3. above).
Please correct me if I am wrong somewhere.
Christoph
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20061020/ed84595a/attachment.htm>
More information about the opencms-dev
mailing list