[opencms-dev] Re: OT: JSP & UTF-8

Stephan Hartmann hartmann at metamesh.de
Fri Oct 20 12:07:14 CEST 2006


Hi Joe,

as the URIEncoding attribute in the connector element only works for GET 
requests i suggest you to place a simple servlet filter in front of your 
servlet that only sets the character encoding of the request object's 
content correctly to UTF-8 by calling
ServletRequest.setCharacterEncoding("UTF-8")

Regards,
Stephan

Joe Desbonnet schrieb:
> Christoph,
> 
> Thanks very much for the informative posting. I'll going to keep it as
> a check list in future.
> 
> I'm going to post to the log4j list about my problem. The odd thing is
> that I don't use/touch/reference these libraries in any way, yet they
> influence on the behaviour of the JSPs. I've been able to reduce it
> down to a simple test case:
> 
> http://galway.net/tmp/UTF8Test.war   (test form + submit script, with
> commons-logging and log4j libraries in WEB-INF/lib)
> 
> http://galway.net/tmp/UTF8Test-nologlib.war (test form + submit
> script, with no libraries).
> 
> On my setup (Tomcat 5.5.16 + JDK 1.5.0_07 Linux) they behave differently.
> 
> My temporary solution is to remove commons-logging as it's not
> required right now (but some libraries I intend to use may need it :(
> 
> Thanks again for your help,
> 
> Joe.
> 
> 
> 
> On 10/20/06, Christoph Schönfeld <cschoenfeld at sylphen.com> wrote:
> 
>>
>>  Hi Joe and fellow list readers,
>>
>>  in the Tomcat implementation of the Java Servlet Specification, the 
>> first
>> call to HttpServletRequest.getParameters() has the side
>> effect that the GET and POST data is  parsed with the encoding in 
>> effect for
>> the HttpServletRequest at that time. The Java Servlet Specification 
>> defines
>> ISO-8859-1 to be the default encoding. The Tomcat implementation 
>> caches the
>> result of that operation after the first call and does not reparse 
>> them when
>> HttpServletRequest.setCharacterEncoding() is called.
>>
>>  Joe, I could imagine that your log settings cause a call to
>> HttpServletRequest.getParameters() or getParameter(). If
>> that's the case your call to
>> HttpServletRequest.setCharacterEncoding() has absolutely no
>> effect because it is made too late.
>>
>>
>>  There are two aspects which make Unicode handling difficult with HTTP.
>> First, there is no way the server can tell clients the expected input
>> charset. This is a logical consequence of the stateless nature of 
>> HTTP. But
>> secondly, there is no way the client can tell a server the charset of the
>> request data which is quite unfortunate. Servers always have to guess or
>> rely. IMO this is where the specification fails.
>>
>>  To fully support UTF-8, you have to take care to get data output and 
>> input
>> right. Getting output right is easier because it's fully supported by the
>> HTTP Content-Type header.
>>
>>  I use the following measures successfully with Tomcat:
>>
>>  Output: Make sure that the content sent to the browser actually is 
>> what it
>> declares to be.
>>
>>  1. Use UTF-8 as the JSP contentType. The JSP Specification is not as
>> precise on the effect of this setting as version 2.0 is: contentType 
>> defines
>> the charset of the HTTP Response. (See JSP.2.10.2 The taglib Directive 
>> on p.
>> 52 in the JSP Specification version 1.2, and JSP.1.10.2 on p. 48 in 
>> the JSP
>> Specification version 2.0)
>>  2. Make sure the Content-Type header in your HttpServletResponse is
>> 'text/html; charset="UTF-8"'. (See
>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.17)
>>  3. If you create URLs in your HTML pages, use java.net.URLEncoder to 
>> make
>> sure that non-ASCII values cannot get into the parameter values.
>>  4. Optional: consider storing the JSP file itself as UTF-8 and using the
>> pageEncoding directive to tell the JSP Processor about that.
>>
>>  If you have a servlet sending HTML with <meta http-equiv="contenttype"
>> content="text/html; charset=UTF-8">, make sure you actually send UTF-8 
>> data.
>> If you use the ServletOutputStream directly, make sure you wrap it in a
>> OutputStreamWriter initialized with the correct encoding. If you use
>> HttpServletResponse.getWriter(), make sure you call 
>> setCharacterEncoding()
>> before you call getWriter()!
>>
>>  However, this is only the output aspect: If this is right, the 
>> browser is
>> able to correctly display the UTF-8 data.
>>
>>  Input: The part most poeple forget is to take care that input data sent
>> back by the browser correct.
>>
>>  1. Use the attribute "accept-charset" in your HTML form tag: <form ...
>> accept-charset="UTF-8">. This tells the browser to send UTF-8 data. (See
>> http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset).
>> This requires a compliant browser and is no guarantee that it will 
>> work but
>> it does with current browsers.
>>  2. If you use Tomcat, use the attribute URIEncoding="UTF-8" in the
>> Connector element in your server.xml. This makes UTF-8 input also work 
>> for
>> GET parameters (see 3. above).
>>
>>
>>  Please correct me if I am wrong somewhere.
>>
>>  Christoph
>>
>>
>>
>>
>>
>> _______________________________________________
>> This mail is sent to you from the opencms-dev mailing list
>> To change your list options, or to unsubscribe from the list, please 
>> visit
>> http://lists.opencms.org/mailman/listinfo/opencms-dev
>>
>>
> 
> _______________________________________________
> This mail is sent to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://lists.opencms.org/mailman/listinfo/opencms-dev


-- 
Stephan Hartmann

metamesh

Lippstädter Str. 22
44143 Dortmund

http://www.metamesh.de/



More information about the opencms-dev mailing list