[opencms-dev] Lucene and Binary Documents

Ben Rometsch ben at solidstategroup.com
Fri Oct 17 07:17:00 CEST 2003


Hi Matt,

Thanks for the reply. If I just want to get the document title to be
included in the Lucene index, looking at the code in the
net.grcomputing.opencms.search.BodylessDocument class it appears to ignore
what the CMSObject is, and attempt to index it regardless. Is this correct?

If this is the case, is it simply a matter of instructing Lucene to index
obects other than HTML files in the VFS  (i.e. Documents) ? Or would I have
to create another class, something like
net.grcomputing.opencms.search.FileDocument and add a new hook into that
class via the registry.xml fragment?  Or does the BodyLess document provide
this functionality, and it's just a matter of adding a new XML fragment to
the registry.xml are?

Sorry for all the questions!

Ben

-----Original Message-----
From: opencms-dev-admin at opencms.org [mailto:opencms-dev-admin at opencms.org]
On Behalf Of M Butcher
Sent: 15 October 2003 12:44
To: opencms-dev at opencms.org
Subject: Re: [opencms-dev] Lucene and Binary Documents

Hi Ben,

On Tue, 2003-10-14 at 18:57, Ben Rometsch wrote:
> I have the Lucene module working fine, indexing HTML documents on my 
> site. I know you can plug in extra components to have Lucene index PDF 
> and Microsoft Word documents; has anyone managed to do this within 
> OpenCMS? Are there any steps that need to be taken differently to an 
> out-the-box Lucene installation?

I've not done this with the Lucene module, yet. Someone on the list last
month said they had done something like this running lucene over and
_exported_ version of the files. The Lucene module, however, operates inside
OpenCMS directly on the VFS.

> As an interim measure, how easy would it be to just have Lucene index 
> the filenames of any Word or PDF documents within a certain area of 
> the VFS? Can anyone provide any information on how to go about this?

The Lucene module indexes certain properties for binary files. If you give
each document a title, for instance, then it can be indexed pretty easily.
If you have the source code, you can look at
net.grcomputing.opencms.search.BodylessDocument for an idea as to how binary
indexing works.

Matt

--
M Butcher <mbutcher at grcomputing.net>
_______________________________________________
This mail is send to you from the opencms-dev mailing list To change your
list options, or to unsubscribe from the list, please visit
http://mail.opencms.org/mailman/listinfo/opencms-dev




More information about the opencms-dev mailing list