[opencms-dev] Problem with German umlauts in the search (Lucene) with GET

Claus Priisholm cpr at codedroids.com
Wed Sep 7 08:41:40 CEST 2005



Xavier Ottolini wrote:
> Hi,
> 
> 
>> And indeed that seem to be what happens while logged into the 
>> workspace. But when accessing the site from the outside the problem 
>> surfaces again.
>> Setting the character encoding explicitly in say the template jsp does 
>> not help - it seems that encoding is either set wrongly or set to 
>> default by tomcat (iso-8859-1 it seems) by accessing a parameter 
>> before the template jsp gets to set it explicitly.
> 
> When OpenCms exports the files is html, the final user downloads static 
> html files.
> If Apache httpd is set a front server, a parameter is to set in the 
> httpd.conf .
> AddDefaultCharset UTF-8


That would could of course be another player in the game, but in my case 
pages were served dynamically. And the issue with multi-byte encoding is 
present event when running a simple (non-OpenCms) JSP directly in Tomcat 
- IE seemingly send the post-content as 8859-1 regardless, and others do 
send accordingly to the encoding set in the HTML meta-tags. IE's 
approach seems to be wrong (once again), and who knows when that is 
about to change. So that's why I've decided to add a hidden field with a 
multi-byte character to my form processing - if it is returned in a 
mangled state then I assume that somewhere along the lines there was an 
encoding problem that need to be fixed:

value = new String(value.getBytes("ISO-8859-1"), "UTF-8");

This works with IE, Fx (on Mac and Windows) & Safari, outside OpenCms or 
inside whether in offline project or not.

Note that according to some postings on the net Tomcat always defaults 
to ISO-8859-1, but what Tomcat or IE for that matter does when run in 
another default locale I don't know.

> 
> 
> For Apaphe 1.3, the settings are the following :
> 
> For instance :
> <VirtualHost *:80>
>         ServerAdmin webmaster at myhost.com
>         DocumentRoot /home/apache/myhost/www
>         ServerName www.myhost.com
> 
>         ErrorDocument 404 /errordocs/404.html
>         <Directory "/home/apache/myhost/www/errordocs">
>                 AllowOverride None
>                 Order allow,deny
>                 Allow from all
>                 Options MultiViews
>         </Directory>
> 
>         AddDefaultCharset UTF-8
> 
>         JkMount /formmail wrkr
>         JkMount /*.jsp wrkr
>         JkMount /EcardServlet wrkr
>         <Location "/opencms/WEB-INF/">
>                 AllowOverride None
>                 deny from all
>         </Location>
> </VirtualHost>
> 
> I think that with apache 2  + tomcat 5 + opencms 6, the settings are 
> different (according to the howtoos). But there is probably a similar 
> parameter in the apache 2 settings.
> 
> I hope that it helps !
> 
> Xavier Ottolini
> Développeur multimédia
> 
> Adelis
> 37, rue d'Engwiller
> 67350 La Walck
> France
> Téléphone : +33 (0) 3 88 72 29 10
> Télécopie : +33 (0) 3 88 72 29 19
> http://www.adelis.com
> 
>>
>> /Claus
>>
>> Corsin Camichel wrote:
>>
>>> Hi Achim
>>>
>>> Thank you for your response.
>>>
>>>
>>>> I expect your <defaultcontentencoding> (in opencms-system.xml) is 
>>>> not set to utf-8 but to a
>>>> different like  ISO-8859-1. I recommend to set the default encoding 
>>>> to utf-8. This could already work.
>>>
>>>
>>> I checked my config file and in there, the defaultcontentencoding is
>>> set to UTF-8.
>>>
>>> But I made a little workaround/hack for my problem.
>>> First I send my query to a dummy jsp which replaces all the umlauts
>>> and other signs to proper html code and do a send.redirect to the
>>> results site with a new query string. This works fine for me and it is
>>> not to much work.
>>>
>>> Regards
>>> Corsin
>>>
>>> On 9/5/05, Achim Westermann <a.westermann at alkacon.com> wrote:
>>>
>>>> Hi Corsin,
>>>>
>>>> If you insist on supporting e.g. both charsets the query String has 
>>>> to be encoded additionally using
>>>> a java script form validation at clientside. This is some trouble 
>>>> with automatic encoding of
>>>> browsers and automatic decoding of tomcat. While browsers use utf-8 
>>>> to encode before submit forms
>>>> tomcat decodes (at request parameter access time) the query using 
>>>> the request encoding which (here)
>>>> is e.g. ISO-8859-1 because opencms serverd the searchpage in this 
>>>> default encoding with the meta
>>>> charset content attribute and corresponding http headers.
>>>> The first encoding at client-side will make all special characters 
>>>> (Umlaute) disappear: Only the '%'
>>>> character will remain as a "character to encode". The 2nd encoding 
>>>> then will encode the '%' a 2nd
>>>> time. Automatic decoding of tomcat will turn these "%25" back to 
>>>> mere '%' which works regardless of
>>>> any charset because it is in the ASCII range that will work for all 
>>>> exotic codepages (It is not
>>>> harmful if a different encoding at client side was used). The 2nd 
>>>> OpenCms decode operation
>>>> especially for the query now uses utf-8 (just as the browser did) 
>>>> and works correctly.
>>>>
>>>> happy coding,
>>>>
>>>> Achim
>>>>
>>>> -- 
>>>> Achim Westermann
>>>> -------------------
>>>>
>>>> Alkacon Software
>>>> Alexander Kandzior
>>>> An der Wachsfabrik 13
>>>> 50996 Koeln, DE
>>>>
>>>> Tel: +49 (0)2236 3826-0
>>>> Fax: +49 (0)2236 3826-20
>>>> Email: a.westermann at alkacon.com
>>>>
>>>> http://www.alkacon.com
>>>>
>>>>
>>>> Corsin Camichel wrote:
>>>>
>>>>> Hi everybody
>>>>>
>>>>> I am having a problem with German umlauts (ä => ä ü => ü ö
>>>>> => ö) in the Lucene search engine if I work with action="GET".
>>>>> The parameters change to something strange like
>>>>> ?query=%C3%96ffnungszeiten
>>>>> and OpenCMS decodes it back to
>>>>> Öffnungszeiten
>>>>> All documents are UTF-8 and have a  property locale=de
>>>>>
>>>>> I hope anybody has an idea or can point me to a reference of this 
>>>>> problem.
>>>>>
>>>>> Thank you very much
>>>>>
>>>>> Corsin
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> This mail is send to you from the opencms-dev mailing list
>>>> To change your list options, or to unsubscribe from the list, please 
>>>> visit
>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> _______________________________________________
>>> This mail is send to you from the opencms-dev mailing list
>>> To change your list options, or to unsubscribe from the list, please 
>>> visit
>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
> 
> ------------------------------------------------------------------------
> 
> 
> 
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev

-- 
Claus Priisholm, CodeDroids ApS
cpr (you know what) codedroids.com - http://www.codedroids.com

Javadocs and other OpenCms stuff: 
http://www.codedroids.com/community/opencms




More information about the opencms-dev mailing list