[opencms-dev] Indexing Binary Documents with the Lucene Module
Ben Rometsch
ben at solidstategroup.com
Tue Oct 28 09:37:01 CET 2003
Yep, I restarted tomcat a couple of times. Is there any trace logging I can
enable? I'm not sure if that XML file is even correct and being parsed (I
think it's maybe invalid; I was not getting any index files in my indexing
directory other than a 1k "segments" file)...
-----Original Message-----
From: opencms-dev-admin at opencms.org [mailto:opencms-dev-admin at opencms.org]
On Behalf Of Stephan Hartmann
Sent: 28 October 2003 18:59
To: opencms-dev at opencms.org
Subject: Re: [opencms-dev] Indexing Binary Documents with the Lucene Module
Hi Ben,
i assume that it is still using the old version 1.3 and not finding any
document factories. Did you restart tomcat?
I found another problem: The bodylessDocument indexes the title, description
and keywords of a file in corresponding fields in the lucene index ("title",
"description" and "keywords"). The problem is that the SearchHelper used in
the search JSP only searches the index for the "body" field. That means that
your PDF files will nerver be found (and also other files won't be found by
words in their title etc.) So we first have to find out how to tell the
searchHelper to search in all indexed fields. Maybe Matt knows?
Bye,
Stephan
Am Dienstag, 28. Oktober 2003 04:07 schrieben Sie:
> Hi Stephan,
>
> I'm still having no luck. After updating the search module to 1.4 and
> updating my registy.xml fragment to read:
>
>
> <luceneSearch>
> <mergeFactor>100000</mergeFactor>
> <permCheck>true</permCheck>
> <indexDir>c:\search</indexDir>
>
> <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
> <subsearch>true</subsearch>
> <project>online</project>
> <docFactories>
> <docFactory enabled="true" type="page">
>
> <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> </docFactory>
> <docFactory enabled="true" type="plain">
> <fileType name="plaintext">
> <extension>.txt</extension>
>
> <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> </fileType>
> <fileType name="taggedtext">
> <extension>.html</extension>
> <extension>.htm</extension>
> <extension>.xml</extension>
> <!-- This will strip tags before processing
> -->
>
> <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
> </fileType>
> </docFactory>
> <docFactory enabled="true" type="jsp">
>
> <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> </docFactory>
> <docFactory enabled="false" type="XML Template"/>
> <docFactory enabled="true" type="binary">
>
> <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> </docFactory>
> </docFactories>
> <directories>
> <directory location="/RGLIntranet/">
> <section>Test2</section>
> <subsearch>true</subsearch>
> </directory>
> </directories>
> </luceneSearch>
>
>
> I am now not having any documents indexed. From the logs:
>
>
> [28.10.2003 14:05:10] <opencms_cronscheduler> Starting job for
> com.opencms.core.CmsCronEntry{* * * * * admin Administrators
>net.grcomputing.opencms.search.lucene.CronIndexManager
>createIndex=true}
> [28.10.2003 14:05:10] <opencms_info>
>
>=====IndexManager======================================================
>====
>= ==
> [28.10.2003 14:05:10] <opencms_info> Analyzer:
> org.apache.lucene.analysis.standard.StandardAnalyzer
> [28.10.2003 14:05:10] <opencms_info> IndexManager: indexing
>/RGLIntranet/
> [28.10.2003 14:05:10] <opencms_info> IndexManager: indexing
>/RGLIntranet/protected/
> [28.10.2003 14:05:10] <opencms_info> IndexManager: 0 documents are
>being processed
> [28.10.2003 14:05:10] <opencms_info> Done
>
>
> Any ideas?
> Ben
>
> -----Original Message-----
> From: opencms-dev-admin at opencms.org
> [mailto:opencms-dev-admin at opencms.org]
> On Behalf Of Stephan Hartmann
> Sent: 23 October 2003 02:46
> To: opencms-dev at opencms.org
> Subject: Re: [opencms-dev] Indexing Binary Documents with the Lucene
> Module
>
> Hi Ben,
>
> you don't have to use the news module.
> You can just replace the module and then change the registry.xml but
> without the docfactory for news and the ContentDefinition section.
> Just add this:
> <docFactory enabled="true" type="binary">
>
> <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> </docFactory>
>
> I just explained what else can be done with the module.
>
> Bye,
> Stephan
>
> ----- Original Message -----
> From: "Ben Rometsch" <ben at solidstategroup.com>
> To: <opencms-dev at opencms.org>
> Sent: Wednesday, October 22, 2003 4:41 PM
> Subject: [opencms-dev] Indexing Binary Documents with the Lucene
> Module
>
> > Hi Stephan,
> >
> > Thanks for the reply. I've managed to get the lucene source
> > recompiled so that I could add some logging to the BodylessDocument
> > class to see what
>
> was
>
> > going on, but all I discovered was what you told me, that it wasn't
>
> indexing
>
> > Binary documents!
> >
> > Is there no simple way of having the module index just the filename
> > of the document? The additions you have made to index the news
> > module are maybe
>
> too
>
> > complex for me...All I want to do is index the filename in the VFS...
> >
> > Thanks,
> > Ben
> >
> > -----Original Message-----
> > From: opencms-dev-admin at opencms.org
> > [mailto:opencms-dev-admin at opencms.org]
> > On Behalf Of Hartmann, Waehrisch & Feykes GmbH
> > Sent: 22 October 2003 16:51
> > To: opencms-dev at opencms.org
> > Subject: Re: [opencms-dev] (no subject)
> >
> > Hi Ben,
> >
> > i think this won't work since the plainDocFactory will only be used
> > for files of type "plain" but not for files of type "binary".
> > Recently we have done some additions to the module - by order of
> > Lenord, Bauer & Co. GmbH - that could meet your needs. It introduces
> > a more
>
> flexible
>
> > way of defining docFactories that you can add new factories without
> > having to recompile the whole module. So other modules (like the
> > news) can bring their own docFactory and all you have to do is to
> > edit the
>
> registry.xml.
>
> > Here is an example:
> >
> > <docFactories>
> > <docFactory enabled="true" type="plain">
> > <fileType name="plaintext">
> > <extension>.txt</extension>
> >
> > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > </fileType>
> > </docFactory>
> > <docFactory enabled="true" type="news">
> >
> > <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
> > </docFactory>
> > </docFactories>
> >
> > To index binary files all you need to add is this:
> >
> > <docFactory enabled="true" type="binary">
> >
> > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> > </docFactory>
> >
> > There should be no need for an extension mapping.
> >
> > For the interested people:
> > For ContentDefinitions (like news) i introduced the following:
> > <contentDefinitions>
> > <contentDefinition type="news"> <!-- must match
> > docFactory type -->
> >
> > <class>com.opencms.modules.homepage.news.NewsContentDefinition</clas
> > s>
>
>
><initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</in
>itCl
>a
>
> > ss>
> > <listMethod name="getNewsList">
> > <param type="java.lang.Integer">1</param>
> > <param type="java.lang.String">-1</param>
> > </listMethod>
> > <page uri="/news.html?__element=entry">
> > <param method="getIntId" name="newsid"/>
> > </page>
> > </contentDefinition>
> >
> > In short:
> > initClass is optional: For the news the news classes have to be
> > loaded to initialize the db pool.
> > listMethod: a method of the content definition class that returns a
> > List
>
> of
>
> > elements
> > page: the page that can display an entry. Here a jsp that has a
> > template element "entry". It also needs the id of the news item.
> > getIntId is a method of the content definition class and newsid is
> > the url parameter the page needs. A link like
> > news.html?__element=entry&newsid=xy
> > will be generated.
> >
> > Best regards,
> > Stephan
> >
> >
> > ----- Original Message -----
> > From: "Ben Rometsch" <ben at solidstategroup.com>
> > To: <opencms-dev at opencms.org>
> > Sent: Wednesday, October 22, 2003 6:15 AM
> > Subject: [opencms-dev] (no subject)
> >
> > > Hi Matt,
> > >
> > > I am not having any joy! I've updated my registry.xml file, with
> > > the appropriate section reading:
> > >
> > > <luceneSearch>
> > > <mergeFactor>100000</mergeFactor>
> > > <permCheck>true</permCheck>
> > > <indexDir>c:\search</indexDir>
>
> <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyz
> er>
>
> > > <subsearch>true</subsearch>
> > > <project>online</project>
> > > <docFactories>
> > > <pageDocFactory enabled="true">
> > >
> > > <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> > > </pageDocFactory>
> > > <plainDocFactory enabled="true">
> > > <fileType name="plaintext">
> > > <extension>.txt</extension>
> > >
> > > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > > </fileType>
> > > <fileType name="taggedtext">
> > > <extension>.html</extension>
> > > <extension>.htm</extension>
> > > <extension>.xml</extension>
> > > <!-- This will strip tags before processing
> > > -->
> > >
> > > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</
> > > cl
> > > ass>
> > > </fileType>
> > >
> > > <!-- Index binary documents -->
> > > <fileType name="plaindocument">
> > > <extension>.doc</extension>
> > > <extension>.xls</extension>
> > > <extension>.pdf</extension>
> > >
> > > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</cla
> > > ss
> > >
> > > </fileType>
> > >
> > > </plainDocFactory>
> > > <jspDocFactory enabled="true">
> > >
> > > <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> > > </jspDocFactory>
> > > <xmlTemplateDocFactory enabled="false"/> </docFactories>
> > > <directories> <directory location="/release/">
> > > <section>Test</section> <subsearch>true</subsearch> </directory>
> > > <directory location="/RGLIntranet/"> <section>Test2</section>
> > > <subsearch>true</subsearch> </directory> </directories>
> > > </luceneSearch>
> > >
> > > Notice the section beginning after the remark "Index binary
documents".
> > >
> > > But I cannot get any hits when searching for document names that
> > > are in
> >
> > the
> >
> > > VFS. The other (HTML) searches are working ok. Is the "name"
> > > property of
> >
> > the
> >
> > > fileType tag important? I wasn't sure what to add here...I'm not
> > > quite
> >
> > sure
> >
> > > how to move forward. Maybe it would be an idea to add some
> > > debugging
>
> trace
>
> > > to the BodylessDocument class to see what is going on inside it? I
> > > want
>
> to
>
> > > make sure my XML is correct first tho!
> > >
> > > Thanks for the help,
> > > Ben
> > >
> > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> > > > Hi Matt,
> > > >
> > > > Thanks for the reply. If I just want to get the document title
> > > > to be included in the Lucene index, looking at the code in the
> > > > net.grcomputing.opencms.search.BodylessDocument class it appears
> > > > to
> >
> > ignore
> >
> > > > what the CMSObject is, and attempt to index it regardless. Is
> > > > this
> > >
> > > correct?
> > >
> > >
> > > Correct. It will already index the title, but it will not attempt
> > > to index the body.
> > >
> > > > If this is the case, is it simply a matter of instructing Lucene
> > > > to
> >
> > index
> >
> > > > obects other than HTML files in the VFS (i.e. Documents) ? Or
> > > > would I
> > >
> > > have
> > >
> > > > to create another class, something like
> > > > net.grcomputing.opencms.search.FileDocument and add a new hook
> > > > into
>
> that
>
> > > > class via the registry.xml fragment? Or does the BodyLess
> > > > document
> > >
> > > provide
> > >
> > > > this functionality, and it's just a matter of adding a new XML
>
> fragment
>
> > to
> >
> > > > the registry.xml are?
> > >
> > > Again, you are right -- simply adding the appropriate
> > > configuration to the registry.xml file will suffice. I believe
> > > that you will just need to extend the plainDocument tag set to
> > > include extensions and
>
> processors...
>
> > > I _think_ that binary files get handled by the plain handler.
> > >
> > > Matt
> > >
> > > _______________________________________________
> > > This mail is send to you from the opencms-dev mailing list To
> > > change your list options, or to unsubscribe from the list, please
>
> visit
>
> > > http://mail.opencms.org/mailman/listinfo/opencms-dev
> >
> > Stephan Hartmann
> > Unternehmensberatung Währisch & Feykes GmbH Gustav-Adolf-Str. 5
> > 47057 Duisburg
> >
> > Tel.: 0203-373070
> > Fax: 0203-376766
> > E-Mail: hartmann at wfnetz.de
> > Internet: www.wfnetz.de
> >
> > Über das Internet versandte E-Mails können unter fremden Namen
> > erstellt
>
> oder
>
> > manipuliert werden. Aus diesem Grund enthalten unsere mit E-Mail
> > verschickten Nachrichten grundsätzlich keine rechtsverbindlichen
> > Willenserklärungen.
> >
> > _______________________________________________
> > This mail is send to you from the opencms-dev mailing list To change
> > your list options, or to unsubscribe from the list, please visit
> > http://mail.opencms.org/mailman/listinfo/opencms-dev
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list To change
> your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list To change
> your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
--
Stephan Hartmann
Währisch & Feykes GmbH
Gustav-Adolf-Str. 5
47057 Duisburg
Tel. 0203 / 373 070
Fax 0203 / 376 766
hartmann at wfnetz.de
------------------------------------------------------
Ausschlusserklärung (Disclaimer):
Über das Internet versandte E-mails können unter fremden Namen erstellt oder
manipuliert werden. Aus diesem Grund enthalten unsere mit E-mail
verschickten Nachrichten grundsätzlich keine rechtsverbindlichen
Willenserklärungen.
_______________________________________________
This mail is send to you from the opencms-dev mailing list To change your
list options, or to unsubscribe from the list, please visit
http://mail.opencms.org/mailman/listinfo/opencms-dev
More information about the opencms-dev
mailing list