[opencms-dev] Indexing News for Lucene Search - Please help..

Trevor Lee Trevor.Lee at 4Loop.com.au
Wed Nov 5 07:41:00 CET 2003


Hi

I have news2.1 and Lucene Search 1.4 installed on opencms 5.0

I'm trying to index news items and need this functionality working very
soon, so if any one can help ....

The following is what my registry.xml looks like in relation to lucene:
        <luceneSearch>
            <mergeFactor>100000</mergeFactor>
            <permCheck>true</permCheck>

<indexDir>C:\Jakarta-Tomcat-4.1.12\webapps\opencms\lucene\index\</indexDir>

<analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
            <subsearch>true</subsearch>
            <project>online</project>
            <docFactories>
                <docFactory enabled="true" type="page">

<class>net.grcomputing.opencms.search.lucene.PageDocument</class>
                </docFactory>
                <docFactory enabled="true" type="plain">
                    <fileType name="plaintext">
                        <extension>.txt</extension>

<class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
                    </fileType>
                    <fileType name="taggedtext">
                        <extension>.html</extension>
                        <extension>.htm</extension>
                        <extension>.xml</extension>
                        <!-- This will strip tags before processing -->

<class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
                    </fileType>
                </docFactory>
                <docFactory enabled="true" type="binary">

<class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
                </docFactory>
                <docFactory enabled="true" type="jsp">

<class>net.grcomputing.opencms.search.lucene.JspDocument</class>
                </docFactory>
                <docFactory enabled="true" type="news">

<class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
                </docFactory>
                <docFactory enabled="false" type="XML Template"/>
            </docFactories>
            <directories>
                <directory location="/swm/">
                    <section>Test</section>
                    <subsearch>true</subsearch>
                </directory>
            </directories>
            <contentDefinitions>
                <contentDefinition type="news">

<class>com.opencms.modules.homepage.news.NewsContentDefinition</class>

<initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCla
ss>
                    <listMethod name="getNewsList">
                        <param type="java.lang.Integer">1</param>
                        <param type="java.lang.String">-1</param>
                    </listMethod>
                   <page uri="/news/news.jsp?__element=entry">
                        <param method="getIntId" name="newsid"/>
                   </page>
                </contentDefinition>
            </contentDefinitions>
        </luceneSearch>

The news.jsp file is the same as that provided in the news2.1 zip file. I've
modified it:
<jsp:useBean id="newsbean"
class="com.opencms.modules.homepage.news.NewsContentDefinition" scope="page"
/>
<%@page session="false" import="java.util.*, java.text.*,
com.opencms.modules.homepage.news.*" %>
<%@ taglib prefix="cms" uri="http://www.opencms.org/taglib/cms" %>
<cms:template element="entry"> <!-- added this line -->
<%
	String sID = request.getParameter("id");
:
:
:
%>
</cms:template>
I've added the element "entry" as per the instructions in the message below.

When the lucene cron job runs I get the following error messages:
[05.11.2003 05:55:10] <opencms_cronscheduler> Starting job for
com.opencms.core.CmsCronEntry{55 5 * * * admin Administrators
net.grcomputing.opencms.search.lucene.CronIndexManager createIndex=true}
[05.11.2003 05:55:10] <opencms_info>
=====IndexManager===========================================================
==
[05.11.2003 05:55:10] <opencms_info> Analyzer:
org.apache.lucene.analysis.standard.StandardAnalyzer
[05.11.2003 05:55:10] <opencms_info> Extension map exists to handle
plaintext
[05.11.2003 05:55:10] <opencms_info> Extension map exists to handle
taggedtext
[05.11.2003 05:55:10] <opencms_info> JSP DocumentFactory loaded
[05.11.2003 05:55:10] <opencms_info> Bodyless DocumentFactory loaded
[05.11.2003 05:55:10] <opencms_info> Page DocumentFactory loaded
[05.11.2003 05:55:10] <opencms_info> IndexManager: indexing /swm/
:
:
05.11.2003 05:55:12] <opencms_cronscheduler> Error running job for
com.opencms.core.CmsCronEntry{55 5 * * * admin Administrators
net.grcomputing.opencms.search.lucene.CronIndexManager createIndex=true}
Error: java.lang.IllegalArgumentException: value cannot be null
	at org.apache.lucene.document.Field.<init>(Unknown Source)
	at org.apache.lucene.document.Field.UnStored(Unknown Source)
	at
net.grcomputing.opencms.search.lucene.NewsDocument.Document(NewsDocument.jav
a:140)
	at
net.grcomputing.opencms.search.lucene.IndexManager.processContentDefinitions
(IndexManager.java:437)
	at
net.grcomputing.opencms.search.lucene.IndexManager.doIndex(IndexManager.java
:240)
	at
net.grcomputing.opencms.search.lucene.CronIndexManager.launch(CronIndexManag
er.java:107)
	at com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)


IS the error due to the <page> element in <contentDefinition type="news">?

Thank you in advance.

Cheers

Trevor
-----Original Message-----
From: opencms-dev-admin at opencms.org
[mailto:opencms-dev-admin at opencms.org]On Behalf Of Hartmann, Waehrisch &
Feykes GmbH
Sent: Wednesday, October 22, 2003 4:51 PM
To: opencms-dev at opencms.org
Subject: Re: [opencms-dev] (no subject)


Hi Ben,

i think this won't work since the plainDocFactory will only be used for
files of type "plain" but not for files of type "binary".
Recently we have done some additions to the module - by order of Lenord,
Bauer & Co. GmbH - that could meet your needs. It introduces a more flexible
way of defining docFactories that you can add new factories without having
to recompile the whole module. So other modules (like the news) can bring
their own docFactory and all you have to do is to edit the registry.xml.
Here is an example:

            <docFactories>
                <docFactory enabled="true" type="plain">
                    <fileType name="plaintext">
                        <extension>.txt</extension>

<class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
                    </fileType>
                </docFactory>
                <docFactory enabled="true" type="news">

<class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
                </docFactory>
            </docFactories>

To index binary files all you need to add is this:

           <docFactory enabled="true" type="binary">

<class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
           </docFactory>

There should be no need for an extension mapping.

For the interested people:
For ContentDefinitions (like news) i introduced the following:
            <contentDefinitions>
                <contentDefinition type="news"> <!-- must match docFactory
type -->

<class>com.opencms.modules.homepage.news.NewsContentDefinition</class>

<initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCla
ss>
                    <listMethod name="getNewsList">
                        <param type="java.lang.Integer">1</param>
                        <param type="java.lang.String">-1</param>
                    </listMethod>
                    <page uri="/news.html?__element=entry">
                        <param method="getIntId" name="newsid"/>
                    </page>
                </contentDefinition>

In short:
initClass is optional: For the news the news classes have to be loaded to
initialize the db pool.
listMethod: a method of the content definition class that returns a List of
elements
page: the page that can display an entry. Here a jsp that has a template
element "entry". It also needs the id of the news item.
getIntId is a method of the content definition class and newsid is the url
parameter the page needs. A link like
news.html?__element=entry&newsid=xy
will be generated.

Best regards,
Stephan






More information about the opencms-dev mailing list