[opencms-dev] SOLVED: proper configuration solr configuration for chinese

Schliemann, Kai K.Schliemann at comundus.com
Fri Jul 4 12:40:01 CEST 2014


Hi Patrick, hi list,

we finally used the StandardTokenizer to get the results, we wanted.

             <analyzer type="index">
                 <tokenizer class="solr.StandardTokenizerFactory"/>
                 <filter class="solr.LowerCaseFilterFactory"/>
             </analyzer>
             <analyzer type="query">
                 <tokenizer class="solr.StandardTokenizerFactory"/>
                 <filter class="solr.LowerCaseFilterFactory"/>
                 <filter class="solr.PositionFilterFactory" />
             </analyzer>

Your suggestion didn't work for us, because the indexing in the Admin area of OpenCms didn't even start with this config. But anyway, thanks for helping.
By the way, the Solr wiki says, that the PositionFilterFactory should only be used at query time.

Our main problem was, not finding the correct indexer. It was the proper configuration of Tomcat.
As the solr search sends get requests, we had to add the correct URI encoding in the server.xml:

<Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1" redirectPort="8443" URIEncoding="UTF-8"/>

On our production sever we use the AJP connector, so we had to add it there as well:

<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" URIEncoding="UTF-8"/>

This setting solved the problem when searching for words with german umlauts (e.g. müller) as well. As we got search results with the wrong settings, it was hard to spot, that this were the wrong results.

Best regards
________________________________________

Kai Schliemann
Dipl.-Wirtschaftsingenieur (FH)
Senior IT-Berater Softwareentwicklung

comundus GmbH
Schüttelgrabenring 3, D-71332 Waiblingen
Zentrale      +49 7151-94421-10
Durchwahl  +49 7151-94421-20
Fax             +49 7151-94421-39
E-Mail k.schliemann at comundus.com<mailto:k.schliemann at comundus.com>
Internet www.comundus.com<http://www.comundus.com/>

Geschäftsführer Klaus Hillemeier
Amtsgericht Stuttgart, HRB 264290

comundus ist ein Unternehmen der IT EXCELLENCE Group
________________________________________

[facebook]<http://www.facebook.com/pages/Comundus-GmbH/163398933697079?v=wall> comundus bei Facebook



Von: opencms-dev-bounces at opencms.org [mailto:opencms-dev-bounces at opencms.org] Im Auftrag von Patric Dosch
Gesendet: Mittwoch, 25. Juni 2014 12:01
An: The OpenCms mailing list
Betreff: Re: [opencms-dev] proper configuration solr configuration for chinese

Hey Kai,

I have added Chinese similar. My configuration uses the SmartChineseSentenceTokenizerFactory. Currently, there are no problems which were reported by the customer.

<analyzer>
    <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
    <filter class="solr.SmartChineseWordTokenFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />
</analyzer>


In addition, I still have a copy Field in my schema.xml. Perhaps this helps?
<copyField source="*_zh" dest="text_zh"/>

Regards,
Patric



Von: opencms-dev-bounces at opencms.org<mailto:opencms-dev-bounces at opencms.org> [mailto:opencms-dev-bounces at opencms.org] Im Auftrag von Schliemann, Kai
Gesendet: Freitag, 20. Juni 2014 18:59
An: 'The OpenCms mailing list (opencms-dev at opencms.org<mailto:opencms-dev at opencms.org>)'
Betreff: [opencms-dev] proper configuration solr configuration for chinese

Hi list,
can somebody give me a hint on a proper configuration of the solr search for Chinese or check if our configuration is correct, please.

I defined the following in \WEB-INF\solr\conf\schema.xml:
...
<types>
...
    <!-found this on the net, but not sure if it is the right tokenizer and if I need some filters -->
    <fieldType name="text_zh" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
                <tokenizer class="solr.CJKTokenizerFactory"/>
      </analyzer>
    </fieldType>
...
</types>
...
<fields>
...
<!-just copied "text_de" fields and replaced "de" by "zh" -->
<field name="text_zh"             type="text_zh"      indexed="true"  stored="false" multiValued="true"/><!-- Catchall for Chinese text fields -->
...
<!-just copied "text_de" fields and replaced "de" by "zh" -->
<dynamicField name="*_zh"         type="text_zh"      indexed="true"  stored="true"/>
</fields>
...

I get search results but some search phrases don't give results, even if the word is in the document (checked with luke).


Thanks a lot in advance.

Best regards
Kai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20140704/fd5f2b3a/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 725 bytes
Desc: image001.jpg
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20140704/fd5f2b3a/attachment.jpg>


More information about the opencms-dev mailing list