[opencms-dev] Indexing Binary Documents with the Lucene Module

Tue Oct 28 04:39:02 CET 2003

Hi Stephan,

I'm still having no luck. After updating the search module to 1.4 and
updating my registy.xml fragment to read:

        <luceneSearch>
            <mergeFactor>100000</mergeFactor>
            <permCheck>true</permCheck>
            <indexDir>c:\search</indexDir>

<analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
            <subsearch>true</subsearch>
            <project>online</project>
            <docFactories>
                <docFactory enabled="true" type="page">

<class>net.grcomputing.opencms.search.lucene.PageDocument</class>
                </docFactory>
                <docFactory enabled="true" type="plain">
                    <fileType name="plaintext">
                        <extension>.txt</extension>

<class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
                    </fileType>
                    <fileType name="taggedtext">
                        <extension>.html</extension>
                        <extension>.htm</extension>
                        <extension>.xml</extension>
                        <!-- This will strip tags before processing -->

<class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
                    </fileType>
                </docFactory>
                <docFactory enabled="true" type="jsp">

<class>net.grcomputing.opencms.search.lucene.JspDocument</class>
                </docFactory>
                <docFactory enabled="false" type="XML Template"/>
				<docFactory enabled="true" type="binary">

<class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
			    </docFactory>
            </docFactories>
            <directories>
				<directory location="/RGLIntranet/">
                    <section>Test2</section>
                    <subsearch>true</subsearch>
                </directory>
            </directories>
        </luceneSearch>

I am now not having any documents indexed. From the logs:

[28.10.2003 14:05:10] <opencms_cronscheduler> Starting job for
com.opencms.core.CmsCronEntry{* * * * * admin Administrators
net.grcomputing.opencms.search.lucene.CronIndexManager createIndex=true}
[28.10.2003 14:05:10] <opencms_info>
=====IndexManager===========================================================
==
[28.10.2003 14:05:10] <opencms_info> Analyzer:
org.apache.lucene.analysis.standard.StandardAnalyzer
[28.10.2003 14:05:10] <opencms_info> IndexManager: indexing /RGLIntranet/
[28.10.2003 14:05:10] <opencms_info> IndexManager: indexing
/RGLIntranet/protected/
[28.10.2003 14:05:10] <opencms_info> IndexManager: 0 documents are being
processed
[28.10.2003 14:05:10] <opencms_info> Done 

Any ideas?
Ben

-----Original Message-----
From: opencms-dev-admin at opencms.org [mailto:opencms-dev-admin at opencms.org]
On Behalf Of Stephan Hartmann
Sent: 23 October 2003 02:46
To: opencms-dev at opencms.org
Subject: Re: [opencms-dev] Indexing Binary Documents with the Lucene Module

Hi Ben,

you don't have to use the news module.
You can just replace the module and then change the registry.xml but without
the docfactory for news and the ContentDefinition section.
Just add this:
 <docFactory enabled="true" type="binary">
  <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
 </docFactory>

I just explained what else can be done with the module.

Bye,
Stephan

----- Original Message -----
From: "Ben Rometsch" <ben at solidstategroup.com>
To: <opencms-dev at opencms.org>
Sent: Wednesday, October 22, 2003 4:41 PM
Subject: [opencms-dev] Indexing Binary Documents with the Lucene Module

> Hi Stephan,
>
> Thanks for the reply. I've managed to get the lucene source recompiled 
> so that I could add some logging to the BodylessDocument class to see 
> what
was
> going on, but all I discovered was what you told me, that it wasn't
indexing
> Binary documents!
>
> Is there no simple way of having the module index just the filename of 
> the document? The additions you have made to index the news module are 
> maybe
too
> complex for me...All I want to do is index the filename in the VFS...
>
> Thanks,
> Ben
>
> -----Original Message-----
> From: opencms-dev-admin at opencms.org 
> [mailto:opencms-dev-admin at opencms.org]
> On Behalf Of Hartmann, Waehrisch & Feykes GmbH
> Sent: 22 October 2003 16:51
> To: opencms-dev at opencms.org
> Subject: Re: [opencms-dev] (no subject)
>
> Hi Ben,
>
> i think this won't work since the plainDocFactory will only be used 
> for files of type "plain" but not for files of type "binary".
> Recently we have done some additions to the module - by order of 
> Lenord, Bauer & Co. GmbH - that could meet your needs. It introduces a 
> more
flexible
> way of defining docFactories that you can add new factories without 
> having to recompile the whole module. So other modules (like the news) 
> can bring their own docFactory and all you have to do is to edit the
registry.xml.
> Here is an example:
>
>             <docFactories>
>                 <docFactory enabled="true" type="plain">
>                     <fileType name="plaintext">
>                         <extension>.txt</extension>
>
> <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
>                     </fileType>
>                 </docFactory>
>                 <docFactory enabled="true" type="news">
>
> <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
>                 </docFactory>
>             </docFactories>
>
> To index binary files all you need to add is this:
>
>            <docFactory enabled="true" type="binary">
>
> <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
>            </docFactory>
>
> There should be no need for an extension mapping.
>
> For the interested people:
> For ContentDefinitions (like news) i introduced the following:
>             <contentDefinitions>
>                 <contentDefinition type="news"> <!-- must match 
> docFactory type -->
>
> <class>com.opencms.modules.homepage.news.NewsContentDefinition</class>
>
>
<initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCla
> ss>
>                     <listMethod name="getNewsList">
>                         <param type="java.lang.Integer">1</param>
>                         <param type="java.lang.String">-1</param>
>                     </listMethod>
>                     <page uri="/news.html?__element=entry">
>                         <param method="getIntId" name="newsid"/>
>                     </page>
>                 </contentDefinition>
>
> In short:
> initClass is optional: For the news the news classes have to be loaded 
> to initialize the db pool.
> listMethod: a method of the content definition class that returns a 
> List
of
> elements
> page: the page that can display an entry. Here a jsp that has a 
> template element "entry". It also needs the id of the news item.
> getIntId is a method of the content definition class and newsid is the 
> url parameter the page needs. A link like 
> news.html?__element=entry&newsid=xy
> will be generated.
>
> Best regards,
> Stephan
>
>
> ----- Original Message -----
> From: "Ben Rometsch" <ben at solidstategroup.com>
> To: <opencms-dev at opencms.org>
> Sent: Wednesday, October 22, 2003 6:15 AM
> Subject: [opencms-dev] (no subject)
>
>
> > Hi Matt,
> >
> > I am not having any joy! I've updated my registry.xml file, with the 
> > appropriate section reading:
> >
> > <luceneSearch>
> > <mergeFactor>100000</mergeFactor>
> > <permCheck>true</permCheck>
> > <indexDir>c:\search</indexDir>
> >
> >
<analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
> > <subsearch>true</subsearch>
> > <project>online</project>
> > <docFactories>
> > <pageDocFactory enabled="true">
> >
> > <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> > </pageDocFactory>
> > <plainDocFactory enabled="true">
> > <fileType name="plaintext">
> > <extension>.txt</extension>
> >
> > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > </fileType>
> > <fileType name="taggedtext">
> > <extension>.html</extension>
> > <extension>.htm</extension>
> > <extension>.xml</extension>
> > <!-- This will strip tags before processing
> > -->
> >
> > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</cl
> > ass>
> > </fileType>
> >
> > <!-- Index binary documents -->
> > <fileType name="plaindocument">
> > <extension>.doc</extension>
> > <extension>.xls</extension>
> > <extension>.pdf</extension>
> >
> > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class
> > >
> > </fileType>
> >
> > </plainDocFactory>
> > <jspDocFactory enabled="true">
> >
> > <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> > </jspDocFactory>
> > <xmlTemplateDocFactory enabled="false"/> </docFactories> 
> > <directories> <directory location="/release/"> 
> > <section>Test</section> <subsearch>true</subsearch> </directory> 
> > <directory location="/RGLIntranet/"> <section>Test2</section> 
> > <subsearch>true</subsearch> </directory> </directories> 
> > </luceneSearch>
> >
> > Notice the section beginning after the remark "Index binary documents".
> >
> > But I cannot get any hits when searching for document names that are 
> > in
> the
> > VFS. The other (HTML) searches are working ok. Is the "name" 
> > property of
> the
> > fileType tag important? I wasn't sure what to add here...I'm not 
> > quite
> sure
> > how to move forward. Maybe it would be an idea to add some debugging
trace
> > to the BodylessDocument class to see what is going on inside it? I 
> > want
to
> > make sure my XML is correct first tho!
> >
> > Thanks for the help,
> > Ben
> >
> >
> > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> > > Hi Matt,
> > >
> > > Thanks for the reply. If I just want to get the document title to 
> > > be included in the Lucene index, looking at the code in the 
> > > net.grcomputing.opencms.search.BodylessDocument class it appears 
> > > to
> ignore
> > > what the CMSObject is, and attempt to index it regardless. Is this
> > correct?
> > >
> >
> > Correct. It will already index the title, but it will not attempt to 
> > index the body.
> >
> > > If this is the case, is it simply a matter of instructing Lucene 
> > > to
> index
> > > obects other than HTML files in the VFS  (i.e. Documents) ? Or 
> > > would I
> > have
> > > to create another class, something like 
> > > net.grcomputing.opencms.search.FileDocument and add a new hook 
> > > into
that
> > > class via the registry.xml fragment?  Or does the BodyLess 
> > > document
> > provide
> > > this functionality, and it's just a matter of adding a new XML
fragment
> to
> > > the registry.xml are?
> >
> > Again, you are right -- simply adding the appropriate configuration 
> > to the registry.xml file will suffice. I believe that you will just 
> > need to extend the plainDocument tag set to include extensions and
processors...
> > I _think_ that binary files get handled by the plain handler.
> >
> > Matt
> >
> > _______________________________________________
> > This mail is send to you from the opencms-dev mailing list To change 
> > your list options, or to unsubscribe from the list, please
visit
> > http://mail.opencms.org/mailman/listinfo/opencms-dev
>
> Stephan Hartmann
> Unternehmensberatung Währisch & Feykes GmbH Gustav-Adolf-Str. 5
> 47057 Duisburg
>
> Tel.: 0203-373070
> Fax: 0203-376766
> E-Mail: hartmann at wfnetz.de
> Internet: www.wfnetz.de
>
> Über das Internet versandte E-Mails können unter fremden Namen 
> erstellt
oder
> manipuliert werden. Aus diesem Grund enthalten unsere mit E-Mail 
> verschickten Nachrichten grundsätzlich keine rechtsverbindlichen 
> Willenserklärungen.
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list To change 
> your list options, or to unsubscribe from the list, please visit 
> http://mail.opencms.org/mailman/listinfo/opencms-dev
>

_______________________________________________
This mail is send to you from the opencms-dev mailing list To change your
list options, or to unsubscribe from the list, please visit
http://mail.opencms.org/mailman/listinfo/opencms-dev