[opencms-dev] Indexing Binary Documents with the Lucene Module

Stephan Hartmann beffe at beffe.de
Wed Oct 22 19:17:01 CEST 2003


Hi Ben,

you don't have to use the news module.
You can just replace the module and then change the registry.xml but without
the docfactory for news and the ContentDefinition section.
Just add this:
 <docFactory enabled="true" type="binary">
  <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
 </docFactory>

I just explained what else can be done with the module.

Bye,
Stephan

----- Original Message -----
From: "Ben Rometsch" <ben at solidstategroup.com>
To: <opencms-dev at opencms.org>
Sent: Wednesday, October 22, 2003 4:41 PM
Subject: [opencms-dev] Indexing Binary Documents with the Lucene Module


> Hi Stephan,
>
> Thanks for the reply. I've managed to get the lucene source recompiled so
> that I could add some logging to the BodylessDocument class to see what
was
> going on, but all I discovered was what you told me, that it wasn't
indexing
> Binary documents!
>
> Is there no simple way of having the module index just the filename of the
> document? The additions you have made to index the news module are maybe
too
> complex for me...All I want to do is index the filename in the VFS...
>
> Thanks,
> Ben
>
> -----Original Message-----
> From: opencms-dev-admin at opencms.org [mailto:opencms-dev-admin at opencms.org]
> On Behalf Of Hartmann, Waehrisch & Feykes GmbH
> Sent: 22 October 2003 16:51
> To: opencms-dev at opencms.org
> Subject: Re: [opencms-dev] (no subject)
>
> Hi Ben,
>
> i think this won't work since the plainDocFactory will only be used for
> files of type "plain" but not for files of type "binary".
> Recently we have done some additions to the module - by order of Lenord,
> Bauer & Co. GmbH - that could meet your needs. It introduces a more
flexible
> way of defining docFactories that you can add new factories without having
> to recompile the whole module. So other modules (like the news) can bring
> their own docFactory and all you have to do is to edit the registry.xml.
> Here is an example:
>
>             <docFactories>
>                 <docFactory enabled="true" type="plain">
>                     <fileType name="plaintext">
>                         <extension>.txt</extension>
>
> <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
>                     </fileType>
>                 </docFactory>
>                 <docFactory enabled="true" type="news">
>
> <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
>                 </docFactory>
>             </docFactories>
>
> To index binary files all you need to add is this:
>
>            <docFactory enabled="true" type="binary">
>
> <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
>            </docFactory>
>
> There should be no need for an extension mapping.
>
> For the interested people:
> For ContentDefinitions (like news) i introduced the following:
>             <contentDefinitions>
>                 <contentDefinition type="news"> <!-- must match docFactory
> type -->
>
> <class>com.opencms.modules.homepage.news.NewsContentDefinition</class>
>
>
<initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCla
> ss>
>                     <listMethod name="getNewsList">
>                         <param type="java.lang.Integer">1</param>
>                         <param type="java.lang.String">-1</param>
>                     </listMethod>
>                     <page uri="/news.html?__element=entry">
>                         <param method="getIntId" name="newsid"/>
>                     </page>
>                 </contentDefinition>
>
> In short:
> initClass is optional: For the news the news classes have to be loaded to
> initialize the db pool.
> listMethod: a method of the content definition class that returns a List
of
> elements
> page: the page that can display an entry. Here a jsp that has a template
> element "entry". It also needs the id of the news item.
> getIntId is a method of the content definition class and newsid is the url
> parameter the page needs. A link like news.html?__element=entry&newsid=xy
> will be generated.
>
> Best regards,
> Stephan
>
>
> ----- Original Message -----
> From: "Ben Rometsch" <ben at solidstategroup.com>
> To: <opencms-dev at opencms.org>
> Sent: Wednesday, October 22, 2003 6:15 AM
> Subject: [opencms-dev] (no subject)
>
>
> > Hi Matt,
> >
> > I am not having any joy! I've updated my registry.xml file, with the
> > appropriate section reading:
> >
> > <luceneSearch>
> > <mergeFactor>100000</mergeFactor>
> > <permCheck>true</permCheck>
> > <indexDir>c:\search</indexDir>
> >
> >
<analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
> > <subsearch>true</subsearch>
> > <project>online</project>
> > <docFactories>
> > <pageDocFactory enabled="true">
> >
> > <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> > </pageDocFactory>
> > <plainDocFactory enabled="true">
> > <fileType name="plaintext">
> > <extension>.txt</extension>
> >
> > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > </fileType>
> > <fileType name="taggedtext">
> > <extension>.html</extension>
> > <extension>.htm</extension>
> > <extension>.xml</extension>
> > <!-- This will strip tags before processing
> > -->
> >
> > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
> > </fileType>
> >
> > <!-- Index binary documents -->
> > <fileType name="plaindocument">
> > <extension>.doc</extension>
> > <extension>.xls</extension>
> > <extension>.pdf</extension>
> >
> > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> > </fileType>
> >
> > </plainDocFactory>
> > <jspDocFactory enabled="true">
> >
> > <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> > </jspDocFactory>
> > <xmlTemplateDocFactory enabled="false"/>
> > </docFactories>
> > <directories>
> > <directory location="/release/">
> > <section>Test</section>
> > <subsearch>true</subsearch>
> > </directory>
> > <directory location="/RGLIntranet/">
> > <section>Test2</section>
> > <subsearch>true</subsearch>
> > </directory>
> > </directories>
> > </luceneSearch>
> >
> > Notice the section beginning after the remark "Index binary documents".
> >
> > But I cannot get any hits when searching for document names that are in
> the
> > VFS. The other (HTML) searches are working ok. Is the "name" property of
> the
> > fileType tag important? I wasn't sure what to add here...I'm not quite
> sure
> > how to move forward. Maybe it would be an idea to add some debugging
trace
> > to the BodylessDocument class to see what is going on inside it? I want
to
> > make sure my XML is correct first tho!
> >
> > Thanks for the help,
> > Ben
> >
> >
> > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> > > Hi Matt,
> > >
> > > Thanks for the reply. If I just want to get the document title to be
> > > included in the Lucene index, looking at the code in the
> > > net.grcomputing.opencms.search.BodylessDocument class it appears to
> ignore
> > > what the CMSObject is, and attempt to index it regardless. Is this
> > correct?
> > >
> >
> > Correct. It will already index the title, but it will not attempt to
> > index the body.
> >
> > > If this is the case, is it simply a matter of instructing Lucene to
> index
> > > obects other than HTML files in the VFS  (i.e. Documents) ? Or would I
> > have
> > > to create another class, something like
> > > net.grcomputing.opencms.search.FileDocument and add a new hook into
that
> > > class via the registry.xml fragment?  Or does the BodyLess document
> > provide
> > > this functionality, and it's just a matter of adding a new XML
fragment
> to
> > > the registry.xml are?
> >
> > Again, you are right -- simply adding the appropriate configuration to
> > the registry.xml file will suffice. I believe that you will just need to
> > extend the plainDocument tag set to include extensions and processors...
> > I _think_ that binary files get handled by the plain handler.
> >
> > Matt
> >
> > _______________________________________________
> > This mail is send to you from the opencms-dev mailing list
> > To change your list options, or to unsubscribe from the list, please
visit
> > http://mail.opencms.org/mailman/listinfo/opencms-dev
>
> Stephan Hartmann
> Unternehmensberatung Währisch & Feykes GmbH
> Gustav-Adolf-Str. 5
> 47057 Duisburg
>
> Tel.: 0203-373070
> Fax: 0203-376766
> E-Mail: hartmann at wfnetz.de
> Internet: www.wfnetz.de
>
> Über das Internet versandte E-Mails können unter fremden Namen erstellt
oder
> manipuliert werden. Aus diesem Grund enthalten unsere mit E-Mail
> verschickten Nachrichten grundsätzlich keine rechtsverbindlichen
> Willenserklärungen.
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
>




More information about the opencms-dev mailing list