[opencms-dev] Lucene Search Integration

Marian Kasala marian.kasala at apsoft.sk
Fri Apr 26 13:39:43 CEST 2002


Hi Simon,

I have extended your first version of Lucene - OpenCms integration.
There is reindexing feature and support of additional formats (PDF,RTF).

Reindexing updates only differences.
PDF indexing is done either with support of Etymon PJ  library
(http://www.etymon.com/pj/)
or any suitable external convertor (batch conversion)

Because Etymon PJ  is limited in text extraction (doesn't work with
encrypted pdfs, and
also in some cases extracted text is collapsed in single word)
and I didn't find any other java library I added support for external batch
extractor.
I use for this purpose Advanced Pdf to HTML converter v. 1.4
but this is licensed. (http://www.intrapdf.com/index.html)


I enclose no documentation, but outer interface is almost same as in first
Simon's version
except that you can specify index directory in templates:
<indexDirectory>webapps/opencms/index</indexDirectory>

Maybe you or anyone may find these files usefull so I'm posting them.

Best Regards,
Marian Kasala

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lucene-opencms.zip
Type: application/x-zip-compressed
Size: 15026 bytes
Desc: not available
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20020426/4568f2e2/attachment.bin>


More information about the opencms-dev mailing list