[opencms-dev] Lucene and Binary Documents

M Butcher mbutcher at grcomputing.net
Wed Oct 15 05:07:01 CEST 2003


Hi Ben,

On Tue, 2003-10-14 at 18:57, Ben Rometsch wrote:
> I have the Lucene module working fine, indexing HTML documents on my site. I
> know you can plug in extra components to have Lucene index PDF and Microsoft
> Word documents; has anyone managed to do this within OpenCMS? Are there any
> steps that need to be taken differently to an out-the-box Lucene
> installation?

I've not done this with the Lucene module, yet. Someone on the list last
month said they had done something like this running lucene over and
_exported_ version of the files. The Lucene module, however, operates
inside OpenCMS directly on the VFS.

> As an interim measure, how easy would it be to just have Lucene index the
> filenames of any Word or PDF documents within a certain area of the VFS? Can
> anyone provide any information on how to go about this?

The Lucene module indexes certain properties for binary files. If you
give each document a title, for instance, then it can be indexed pretty
easily. If you have the source code, you can look at
net.grcomputing.opencms.search.BodylessDocument for an idea as to how
binary indexing works.

Matt

-- 
M Butcher <mbutcher at grcomputing.net>



More information about the opencms-dev mailing list