[opencms-dev] Search Engine (Lucene)
Jozef Chocholáček
jozef.chocholacek at qbizm.cz
Thu Aug 4 11:25:01 CEST 2005
Hi, folks!
Some more experience with Lucene Search module for OpenCMS v.5,
useful on huge sites (we use it on site with more than 7000 docs, more
than 50% other type than Page - doc, xls, pdf):
1) If used on MSOffice documents and/or PDFs, the indexing can easily
crash. If the DocumentFactory cannot read the document (e.g. it is
password protected) it throws exception - but the
MAX_DOCUMENT_READING_EXCEPTION in the IndexManager is by default set to
10, so indexing crashes on 11th bad document.
2) By default, Lucene puts into index only first 10000 terms ("words")
from the document. This is caused by DEFAULT_MAX_FIELD_LENGTH in the
org.apache.lucene.index.IndexWriter. It is enough for common sites with
short pages, but when you have many PDF docs with more than 100 pages,
and you cannot find them even if you see in the log that they are
indexed, try to use
"-Dorg.apache.lucene.maxFieldLength=SOME_BIGGER_VALUE" starting your
Tomcat (or whatever).
3) As you see, the site is really huge and documents really big - so you
can easily get TooManyClauses exception even on simple queries. So
increase maxClauseCount in org.apache.lucene.search.BooleanQuery by
calling BooleanQuery.setMaxClauseCount(SOME_REASONABLE_VALUE), or adding
"-Dorg.apache.lucene.maxClauseCount=SOME_REASONABLE_VALUE" to your
tomcat startup script. Default is 1024, our reasonable value is 8192.
I hope it will help someone to save time.
J.Ch.
--
Ing. Jozef Chocholacek Qbizm technologies, a.s.
chief analyst ... the art of software.
____________________________________________________________________
www.qbizm-technologies.cz www.qbizm.cz www.qbizm-services.cz
More information about the opencms-dev
mailing list