[opencms-dev] Indexing Binary Documents with the Lucene Module

Stephan Hartmann hartmann at waehrisch-feykes.de
Tue Oct 28 09:31:01 CET 2003


Hi Ben,

i assume that it is still using the old version 1.3 and not finding any 
document factories. Did you restart tomcat?

I found another problem: The bodylessDocument indexes the title, description 
and keywords of a file in corresponding fields in the lucene index ("title", 
"description" and "keywords"). The problem is that the SearchHelper used in 
the search JSP only searches the index for the "body" field. That means that 
your PDF files will nerver be found (and also other files won't be found by 
words in their title etc.)
So we first have to find out how to tell the searchHelper to search in all 
indexed fields. Maybe Matt knows?

Bye,
Stephan


Am Dienstag, 28. Oktober 2003 04:07 schrieben Sie:
> Hi Stephan,
>
> I'm still having no luck. After updating the search module to 1.4 and
> updating my registy.xml fragment to read:
>
>
>         <luceneSearch>
>             <mergeFactor>100000</mergeFactor>
>             <permCheck>true</permCheck>
>             <indexDir>c:\search</indexDir>
>
> <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
>             <subsearch>true</subsearch>
>             <project>online</project>
>             <docFactories>
>                 <docFactory enabled="true" type="page">
>
> <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
>                 </docFactory>
>                 <docFactory enabled="true" type="plain">
>                     <fileType name="plaintext">
>                         <extension>.txt</extension>
>
> <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
>                     </fileType>
>                     <fileType name="taggedtext">
>                         <extension>.html</extension>
>                         <extension>.htm</extension>
>                         <extension>.xml</extension>
>                         <!-- This will strip tags before processing -->
>
> <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
>                     </fileType>
>                 </docFactory>
>                 <docFactory enabled="true" type="jsp">
>
> <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
>                 </docFactory>
>                 <docFactory enabled="false" type="XML Template"/>
> 				<docFactory enabled="true" type="binary">
>
> <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> 			    </docFactory>
>             </docFactories>
>             <directories>
> 				<directory location="/RGLIntranet/">
>                     <section>Test2</section>
>                     <subsearch>true</subsearch>
>                 </directory>
>             </directories>
>         </luceneSearch>
>
>
> I am now not having any documents indexed. From the logs:
>
>
> [28.10.2003 14:05:10] <opencms_cronscheduler> Starting job for
> com.opencms.core.CmsCronEntry{* * * * * admin Administrators
> net.grcomputing.opencms.search.lucene.CronIndexManager createIndex=true}
> [28.10.2003 14:05:10] <opencms_info>
> =====IndexManager==========================================================
>= ==
> [28.10.2003 14:05:10] <opencms_info> Analyzer:
> org.apache.lucene.analysis.standard.StandardAnalyzer
> [28.10.2003 14:05:10] <opencms_info> IndexManager: indexing /RGLIntranet/
> [28.10.2003 14:05:10] <opencms_info> IndexManager: indexing
> /RGLIntranet/protected/
> [28.10.2003 14:05:10] <opencms_info> IndexManager: 0 documents are being
> processed
> [28.10.2003 14:05:10] <opencms_info> Done
>
>
> Any ideas?
> Ben
>
> -----Original Message-----
> From: opencms-dev-admin at opencms.org [mailto:opencms-dev-admin at opencms.org]
> On Behalf Of Stephan Hartmann
> Sent: 23 October 2003 02:46
> To: opencms-dev at opencms.org
> Subject: Re: [opencms-dev] Indexing Binary Documents with the Lucene Module
>
> Hi Ben,
>
> you don't have to use the news module.
> You can just replace the module and then change the registry.xml but
> without the docfactory for news and the ContentDefinition section.
> Just add this:
>  <docFactory enabled="true" type="binary">
>   <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
>  </docFactory>
>
> I just explained what else can be done with the module.
>
> Bye,
> Stephan
>
> ----- Original Message -----
> From: "Ben Rometsch" <ben at solidstategroup.com>
> To: <opencms-dev at opencms.org>
> Sent: Wednesday, October 22, 2003 4:41 PM
> Subject: [opencms-dev] Indexing Binary Documents with the Lucene Module
>
> > Hi Stephan,
> >
> > Thanks for the reply. I've managed to get the lucene source recompiled
> > so that I could add some logging to the BodylessDocument class to see
> > what
>
> was
>
> > going on, but all I discovered was what you told me, that it wasn't
>
> indexing
>
> > Binary documents!
> >
> > Is there no simple way of having the module index just the filename of
> > the document? The additions you have made to index the news module are
> > maybe
>
> too
>
> > complex for me...All I want to do is index the filename in the VFS...
> >
> > Thanks,
> > Ben
> >
> > -----Original Message-----
> > From: opencms-dev-admin at opencms.org
> > [mailto:opencms-dev-admin at opencms.org]
> > On Behalf Of Hartmann, Waehrisch & Feykes GmbH
> > Sent: 22 October 2003 16:51
> > To: opencms-dev at opencms.org
> > Subject: Re: [opencms-dev] (no subject)
> >
> > Hi Ben,
> >
> > i think this won't work since the plainDocFactory will only be used
> > for files of type "plain" but not for files of type "binary".
> > Recently we have done some additions to the module - by order of
> > Lenord, Bauer & Co. GmbH - that could meet your needs. It introduces a
> > more
>
> flexible
>
> > way of defining docFactories that you can add new factories without
> > having to recompile the whole module. So other modules (like the news)
> > can bring their own docFactory and all you have to do is to edit the
>
> registry.xml.
>
> > Here is an example:
> >
> >             <docFactories>
> >                 <docFactory enabled="true" type="plain">
> >                     <fileType name="plaintext">
> >                         <extension>.txt</extension>
> >
> > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> >                     </fileType>
> >                 </docFactory>
> >                 <docFactory enabled="true" type="news">
> >
> > <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
> >                 </docFactory>
> >             </docFactories>
> >
> > To index binary files all you need to add is this:
> >
> >            <docFactory enabled="true" type="binary">
> >
> > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> >            </docFactory>
> >
> > There should be no need for an extension mapping.
> >
> > For the interested people:
> > For ContentDefinitions (like news) i introduced the following:
> >             <contentDefinitions>
> >                 <contentDefinition type="news"> <!-- must match
> > docFactory type -->
> >
> > <class>com.opencms.modules.homepage.news.NewsContentDefinition</class>
>
> <initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCl
>a
>
> > ss>
> >                     <listMethod name="getNewsList">
> >                         <param type="java.lang.Integer">1</param>
> >                         <param type="java.lang.String">-1</param>
> >                     </listMethod>
> >                     <page uri="/news.html?__element=entry">
> >                         <param method="getIntId" name="newsid"/>
> >                     </page>
> >                 </contentDefinition>
> >
> > In short:
> > initClass is optional: For the news the news classes have to be loaded
> > to initialize the db pool.
> > listMethod: a method of the content definition class that returns a
> > List
>
> of
>
> > elements
> > page: the page that can display an entry. Here a jsp that has a
> > template element "entry". It also needs the id of the news item.
> > getIntId is a method of the content definition class and newsid is the
> > url parameter the page needs. A link like
> > news.html?__element=entry&newsid=xy
> > will be generated.
> >
> > Best regards,
> > Stephan
> >
> >
> > ----- Original Message -----
> > From: "Ben Rometsch" <ben at solidstategroup.com>
> > To: <opencms-dev at opencms.org>
> > Sent: Wednesday, October 22, 2003 6:15 AM
> > Subject: [opencms-dev] (no subject)
> >
> > > Hi Matt,
> > >
> > > I am not having any joy! I've updated my registry.xml file, with the
> > > appropriate section reading:
> > >
> > > <luceneSearch>
> > > <mergeFactor>100000</mergeFactor>
> > > <permCheck>true</permCheck>
> > > <indexDir>c:\search</indexDir>
>
> <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
>
> > > <subsearch>true</subsearch>
> > > <project>online</project>
> > > <docFactories>
> > > <pageDocFactory enabled="true">
> > >
> > > <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> > > </pageDocFactory>
> > > <plainDocFactory enabled="true">
> > > <fileType name="plaintext">
> > > <extension>.txt</extension>
> > >
> > > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > > </fileType>
> > > <fileType name="taggedtext">
> > > <extension>.html</extension>
> > > <extension>.htm</extension>
> > > <extension>.xml</extension>
> > > <!-- This will strip tags before processing
> > > -->
> > >
> > > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</cl
> > > ass>
> > > </fileType>
> > >
> > > <!-- Index binary documents -->
> > > <fileType name="plaindocument">
> > > <extension>.doc</extension>
> > > <extension>.xls</extension>
> > > <extension>.pdf</extension>
> > >
> > > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class
> > >
> > > </fileType>
> > >
> > > </plainDocFactory>
> > > <jspDocFactory enabled="true">
> > >
> > > <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> > > </jspDocFactory>
> > > <xmlTemplateDocFactory enabled="false"/> </docFactories>
> > > <directories> <directory location="/release/">
> > > <section>Test</section> <subsearch>true</subsearch> </directory>
> > > <directory location="/RGLIntranet/"> <section>Test2</section>
> > > <subsearch>true</subsearch> </directory> </directories>
> > > </luceneSearch>
> > >
> > > Notice the section beginning after the remark "Index binary documents".
> > >
> > > But I cannot get any hits when searching for document names that are
> > > in
> >
> > the
> >
> > > VFS. The other (HTML) searches are working ok. Is the "name"
> > > property of
> >
> > the
> >
> > > fileType tag important? I wasn't sure what to add here...I'm not
> > > quite
> >
> > sure
> >
> > > how to move forward. Maybe it would be an idea to add some debugging
>
> trace
>
> > > to the BodylessDocument class to see what is going on inside it? I
> > > want
>
> to
>
> > > make sure my XML is correct first tho!
> > >
> > > Thanks for the help,
> > > Ben
> > >
> > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> > > > Hi Matt,
> > > >
> > > > Thanks for the reply. If I just want to get the document title to
> > > > be included in the Lucene index, looking at the code in the
> > > > net.grcomputing.opencms.search.BodylessDocument class it appears
> > > > to
> >
> > ignore
> >
> > > > what the CMSObject is, and attempt to index it regardless. Is this
> > >
> > > correct?
> > >
> > >
> > > Correct. It will already index the title, but it will not attempt to
> > > index the body.
> > >
> > > > If this is the case, is it simply a matter of instructing Lucene
> > > > to
> >
> > index
> >
> > > > obects other than HTML files in the VFS  (i.e. Documents) ? Or
> > > > would I
> > >
> > > have
> > >
> > > > to create another class, something like
> > > > net.grcomputing.opencms.search.FileDocument and add a new hook
> > > > into
>
> that
>
> > > > class via the registry.xml fragment?  Or does the BodyLess
> > > > document
> > >
> > > provide
> > >
> > > > this functionality, and it's just a matter of adding a new XML
>
> fragment
>
> > to
> >
> > > > the registry.xml are?
> > >
> > > Again, you are right -- simply adding the appropriate configuration
> > > to the registry.xml file will suffice. I believe that you will just
> > > need to extend the plainDocument tag set to include extensions and
>
> processors...
>
> > > I _think_ that binary files get handled by the plain handler.
> > >
> > > Matt
> > >
> > > _______________________________________________
> > > This mail is send to you from the opencms-dev mailing list To change
> > > your list options, or to unsubscribe from the list, please
>
> visit
>
> > > http://mail.opencms.org/mailman/listinfo/opencms-dev
> >
> > Stephan Hartmann
> > Unternehmensberatung Währisch & Feykes GmbH Gustav-Adolf-Str. 5
> > 47057 Duisburg
> >
> > Tel.: 0203-373070
> > Fax: 0203-376766
> > E-Mail: hartmann at wfnetz.de
> > Internet: www.wfnetz.de
> >
> > Über das Internet versandte E-Mails können unter fremden Namen
> > erstellt
>
> oder
>
> > manipuliert werden. Aus diesem Grund enthalten unsere mit E-Mail
> > verschickten Nachrichten grundsätzlich keine rechtsverbindlichen
> > Willenserklärungen.
> >
> > _______________________________________________
> > This mail is send to you from the opencms-dev mailing list To change
> > your list options, or to unsubscribe from the list, please visit
> > http://mail.opencms.org/mailman/listinfo/opencms-dev
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list To change your
> list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev

-- 
Stephan Hartmann

Währisch & Feykes GmbH
Gustav-Adolf-Str. 5
47057 Duisburg
Tel. 0203 / 373 070
Fax 0203 / 376 766
hartmann at wfnetz.de

------------------------------------------------------
Ausschlusserklärung (Disclaimer):
Über das Internet versandte E-mails können unter fremden Namen erstellt oder 
manipuliert werden. Aus diesem Grund enthalten unsere mit E-mail verschickten 
Nachrichten grundsätzlich keine rechtsverbindlichen Willenserklärungen.



More information about the opencms-dev mailing list