[opencms-dev] Re: OT: JSP & UTF-8

Christoph Schönfeld cschoenfeld at sylphen.com
Fri Oct 20 10:43:25 CEST 2006


Hi Joe and fellow list readers,

in the Tomcat implementation of the Java Servlet Specification, the 
first call to HttpServletRequest.getParameters() has the side effect 
that the GET and POST data is  parsed with the encoding in effect for 
the HttpServletRequest at that time. The Java Servlet Specification 
defines ISO-8859-1 to be the default encoding. The Tomcat implementation 
caches the result of that operation after the first call and does not 
reparse them when HttpServletRequest.setCharacterEncoding() is called.

Joe, I could imagine that your log settings cause a call to 
HttpServletRequest.getParameters() or getParameter(). If that's the case 
your call to HttpServletRequest.setCharacterEncoding() has absolutely no 
effect because it is made too late.


There are two aspects which make Unicode handling difficult with HTTP. 
First, there is no way the server can tell clients the expected input 
charset. This is a logical consequence of the stateless nature of HTTP. 
But secondly, there is no way the client can tell a server the charset 
of the request data which is quite unfortunate. Servers always have to 
guess or rely. IMO this is where the specification fails.

To fully support UTF-8, you have to take care to get data output and 
input right. Getting output right is easier because it's fully supported 
by the HTTP Content-Type header.

I use the following measures successfully with Tomcat:

Output: Make sure that the content sent to the browser actually is what 
it declares to be.

1. Use UTF-8 as the JSP contentType. The JSP Specification is not as 
precise on the effect of this setting as version 2.0 is: contentType 
defines the charset of the HTTP Response. (See JSP.2.10.2 The taglib 
Directive on p. 52 in the JSP Specification version 1.2, and JSP.1.10.2 
on p. 48 in the JSP Specification version 2.0)
2. Make sure the Content-Type header in your HttpServletResponse is 
'text/html; charset="UTF-8"'. (See 
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.17)
3. If you create URLs in your HTML pages, use java.net.URLEncoder to 
make sure that non-ASCII values cannot get into the parameter values.
4. Optional: consider storing the JSP file itself as UTF-8 and using the 
pageEncoding directive to tell the JSP Processor about that.

If you have a servlet sending HTML with <meta http-equiv="contenttype" 
content="text/html; charset=UTF-8">, make sure you actually send UTF-8 
data. If you use the ServletOutputStream directly, make sure you wrap it 
in a OutputStreamWriter initialized with the correct encoding. If you 
use HttpServletResponse.getWriter(), make sure you call 
setCharacterEncoding() before you call getWriter()!

However, this is only the output aspect: If this is right, the browser 
is able to correctly display the UTF-8 data.

Input: The part most poeple forget is to take care that input data sent 
back by the browser correct.

1. Use the attribute "accept-charset" in your HTML form tag: <form ... 
accept-charset="UTF-8">. This tells the browser to send UTF-8 data. (See 
http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset). 
This requires a compliant browser and is no guarantee that it will work 
but it does with current browsers.
2. If you use Tomcat, use the attribute URIEncoding="UTF-8" in the 
Connector element in your server.xml. This makes UTF-8 input also work 
for GET parameters (see 3. above).


Please correct me if I am wrong somewhere.

Christoph


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20061020/ed84595a/attachment.htm>


More information about the opencms-dev mailing list