[opencms-dev] Search Engine (Lucene)

Jozef Chocholáček jozef.chocholacek at qbizm.cz
Thu Aug 4 11:25:01 CEST 2005


    Hi, folks!

   Some more experience with Lucene Search module for OpenCMS v.5, 
useful on huge sites (we use it on site with more than 7000 docs, more 
than 50% other type than Page - doc, xls, pdf):

1) If used on MSOffice documents and/or PDFs, the indexing can easily 
crash. If the DocumentFactory cannot read the document (e.g. it is 
password protected) it throws exception - but the 
MAX_DOCUMENT_READING_EXCEPTION in the IndexManager is by default set to 
10, so indexing crashes on 11th bad document.

2) By default, Lucene puts into index only first 10000 terms ("words") 
from the document. This is caused by DEFAULT_MAX_FIELD_LENGTH in the 
org.apache.lucene.index.IndexWriter. It is enough for common sites with 
short pages, but when you have many PDF docs with more than 100 pages, 
and you cannot find them even if you see in the log that they are 
indexed, try to use 
"-Dorg.apache.lucene.maxFieldLength=SOME_BIGGER_VALUE" starting your 
Tomcat (or whatever).

3) As you see, the site is really huge and documents really big - so you 
can easily get TooManyClauses exception even on simple queries. So 
increase maxClauseCount in org.apache.lucene.search.BooleanQuery by 
calling BooleanQuery.setMaxClauseCount(SOME_REASONABLE_VALUE), or adding 
"-Dorg.apache.lucene.maxClauseCount=SOME_REASONABLE_VALUE" to your 
tomcat startup script. Default is 1024, our reasonable value is 8192.


   I hope it will help someone to save time.


J.Ch.
-- 
Ing. Jozef Chocholacek                      Qbizm technologies, a.s.
chief analyst                               ... the art of software.
____________________________________________________________________
www.qbizm-technologies.cz    www.qbizm.cz      www.qbizm-services.cz



More information about the opencms-dev mailing list