[opencms-dev] Problem with lucene 1.5 module in linux chroot environment

"Stephan Löffler" stephanloeffler at gmx.de
Wed Jul 21 17:57:02 CEST 2004


Hi all,

I have a problem with the lucene search module version 1.5.
It was installed and tested successfully in a regular opencms environment.
OpenCMS version is 5.0 and tomcat is 4.1.27.

However, on the Life System, some things are different. First all requests
go to apache webserver 2.0 and then are directed to tomcat/opencms.
Responses
are the other way around. Second, tomcat/opencms are in a change root
environment
( alias in th /opt directory ). So is the apache webserver.

Also the search module was modified a bit, to support searching for two
languages.
The registry.xml was renamed into search_de.xml (and search_en.xml).
It is in directory /opt/lucene/opencms/search_de.xml in the chroot
environment
The path where lucene writes the indexing files to is
/opt/lucene/opencms/de/

Now the problem is, that the CronManager is able to kick off the task
successfully and also
the xml file is found! As it appears in the opencms.log, the indexing task
goes through the
specified root folder in the vfs and all the subfolders, but no files are
being processed and only one file is created in the indexing (/de/ or /en/)
folder, called segments, which is 16kB large.
The cron ends with "0 documents processed" before the Done message.

Now if any search term is entered, always the message for "no results for
this term" is displayed.
I haven't found anything strange in the catalina.logs though.

I tried moving the search_??.xml files and indexing directories directly out
of the change root 
environment into the actual root folder, but then cron won't find the files
and won't kick off 
the task, because it can't find the search_??.xml files.

Oh yes one last thing - before this we ran the regular 1.2 opencms lucene
version that was fine in this environment.

Has anyone any ideas? Comments/hints are greatly appreceated!

Cheers and thanks a lot 
Stephan!

Find the opencms.log output and the search_de.xml file output here:

[21.07.2004 15:02:10] <opencms_cronscheduler> Starting job for
com.opencms.core.CmsCronEntry{2 * * * * admin Administrators
net.grcomputing.opencms.search.lucene.CronIndexManager
createIndex=true,registry=/opt/lucene/index/opencms/search_de.xml}
[21.07.2004 15:02:10] <opencms_info>

=====IndexManager=============================================================
[21.07.2004 15:02:10] <opencms_info> Analyzer:
org.apache.lucene.analysis.de.GermanAnalyzer
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing /site/de/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Aktuelles/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Aktuelles/Download/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Aktuelles/Newsletter/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Aktuelles/Newsletter/archiv/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Aktuelles/Pressemeldung/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Aktuelles/Veranstaltungen/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing /site/de/Archiv/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Archiv/Downloads/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Archiv/Jobs/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Archiv/Referenzkunden/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Archiv/Teaser_Items/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Archiv/Veranstaltungen/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/common_content/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/company/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/company/Fakten/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/company/Group/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/company/Historie/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Exklusives/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Exklusives/Effizienz_verbessern/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Exklusives/Erloese_steigern/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Exklusives/IT-Kosten_senken/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Karriere/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Karriere/Jobs/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Karriere/Moeglichkeiten/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Kompetenzen/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Kompetenzen/Branchen/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Kompetenzen/Loesungen/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Kompetenzen/Methoden/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Kompetenzen/Technologien/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing /site/de/Kunden/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Kunden/Login/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Kunden/Referenzkunden/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Leistungen/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Leistungen/Consulting/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Leistungen/IT-Services/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Leistungen/Security-Services/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Leistungen/Software_Engineering/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/news_items/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Partner/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Partner/Engagement/
[21.07.2004 15:02:10] <opencms_info> IndexManager: indexing
/site/de/Partner/Partnerschaften/
[21.07.2004 15:02:10] <opencms_info> IndexManager: 0 documents are being
processed
[21.07.2004 15:02:10] <opencms_info> IndexManager:  Index has been
optimized.
[21.07.2004 15:02:10] <opencms_info>
Done
=====IndexManager=============================================================
[21.07.2004 15:02:10] <opencms_cronscheduler> Successful launch of job
com.opencms.core.CmsCronEntry{2 * * * * admin Administrators
net.grcomputing.opencms.search.lucene.CronIndexManager
createIndex=true,registry=/opt/lucene/index/opencms/search_de.xml} Message:
CronIndexManager rebuilt the Lucene index on Wed Jul 21 15:02:10 GMT 2004


--------------------------------------------------------------------------

?xml version="1.0" ?>
<registry>
<system>
<!--
  - <luceneSearch/> and all of its contents should go within the <system/>
  - element in $CATALINA_HOME/webapp/opencms/WEB-INF/config/registry.xml.
  -
  - For info on specifying an alternate config file, see the README.txt
  - included in this module.
  -->
<luceneSearch>
	<!--
	  - mergeFactor and permCheck are currently ignored.
	  -->
	<mergeFactor>100000</mergeFactor>
	<permCheck>true</permCheck>
	
	<!--
	  - directory in which lucene will store its indexes. Note: this is real
	  - fs, not VFS.
	  -->	
	<indexDir>/opt/lucene/index/opencms/de/</indexDir>
	
	<!--
	  - The analyzer is used for parsing documents. Choose one for your 
	  - language. If language is English, use the StandardAnalyzer.
	  - There are additional analyzers at http://jakarta.apache.org/lucene
	  -->	
	<analyzer>org.apache.lucene.analysis.de.GermanAnalyzer</analyzer>

	<!--
	  - If subsearch is true, subfolders will be searched by default.
	  - This can be turned on/off per directory.
	  -->	
	<subsearch>true</subsearch>
	
	<!--
	  - Name of the project to index. Online is recommended.
	  -->
	<project>online</project>
	
	<!--
	  - docFactories determine how documents are processed. Generally, one
	  - docFactory exists for each type of content (viz. JSP, Page, Plain) 
	  - that you want to index.
	  -->	
	<docFactories>
	
		<!--
		 - This docFactory indexes documents with type page (e.g. HTML 
		 - files edited with the WYSIWYG editor).
		 -
		 - Note that the 'type' attribute specifies which content definition
		 - to use. Built in content types include page, plain, binary, and jsp
		 - (there are others, too). Custom content types can be used as well
		 - (see the contentDefinitions section below).
		-->
		<docFactory enabled="true" type="page">
			<class>net.grcomputing.opencms.search.lucene.PageDocument</class>
		</docFactory>
		
		<!--
		 - This docFactory is a little more complex. It takes documents of
		 - type "plain" and determines, by extension, what class should be
		 - used to index each particular file. In this example, we want to
		 - index plain text files exactly as they are, but any files that 
		 - contain tags need the tags stripped out before they are indexed.
		 -
		 - Note that the name="" attribute is simply for pretty output, and 
		 - can contain any allowable PCDATA text.
		 -->		
		<docFactory enabled="true" type="plain">
			<fileType name="plaintext">
				<extension>.txt</extension>
				<class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
			</fileType>
			<fileType name="taggedtext">
				<extension>.html</extension>
				<extension>.htm</extension>
				<extension>.xml</extension>
				<!-- This will strip tags before processing -->
				<class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
			</fileType>
		</docFactory>

		<!-- This is for binary files. PDF and DOC files are binary, as are
		  - CLASS and JAR files.
		  -->
		<docFactory enabled="true" type="binary">
		      <!-- This is for indexing PDF files -->
			  <fileType name="PDF">
			    <extension>.pdf</extension>
				<class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
			  </fileType>
			  <!-- This is for indexing MS Word documents -->
			  <fileType name="Word">
			    <extension>.doc</extension>
			    <extension>.dot</extension>
				<class>net.grcomputing.opencms.search.lucene.WordDocument</class>
			  </fileType>
		</docFactory>

		<!--
	     	 - This will strip JSP tags and all scriptlets. IT WILL NOT RENDER
THE
		 - JSP FIRST, as JSPs are, by nature, dynamic.
		 -
		 - Usually, this is off by default.
		 -->		
		<docFactory enabled="false" type="jsp">
			<class>net.grcomputing.opencms.search.lucene.JspDocument</class>
		</docFactory>
		
		<!-- For the forum module. Enable if you use forums. -->
		<docFactory enabled="false" type="forum">
			<class>de.wfnetz.opencms.modules.forum.ContributionDocument</class>
		</docFactory>

		<!-- If you need to index XML Template files (bad idea) use this: -->
		<docFactory enabled="false" type="XML Template"/>
	</docFactories>
	
	<!--
	  - <directories/> determines which directories are indexed. By default,
	  - the /system directory is never indexed, so it is safe to index root.
	  -
	  - If you want to specify only certain directories for indexing, create
	  - one <directory/> entry per directory. Again, you may use subsearch to
	  - override the default subsearch setting discussed above.
	  -->	
	<directories>
		<directory location="/site/de/">
			<section>Test</section>
			<subsearch>true</subsearch>
		</directory>
	</directories>

	<!--
	  - <exclude/> determines which directories are excluded from indexed. By
default,
	  - the /system directory is never indexed, so it is safe to index root.
	  -->	
	<exclude>
		<directory location="/site/de/common_content/"/>
	</exclude>
		
	<!--
	 - Use this section to define specific contentDefinitions. Provided below
	 - are entries for the news and forum modules.
	 - (Uncomment these only after you have installed the corresponding 
	 - modules)
	 -->
   	<contentDefinitions>
       		<!--
       		<contentDefinition type="news">
        	-->
          	<!-- 
            	- <class /> determines the class of the content definition.
Should
            	- be a subclass of com.opencms.defaults.A_CmsContentDefinition.
            	-->
	         <!--
	        
<class>com.opencms.modules.homepage.news.NewsContentDefinition</class>
	          -->
	          <!--
	            - <initClass /> is optional and has to implement 
	            -
net.grcomputing.opencms.search.lucene.I_ContentDefinitionInitialization.
	            - It provides you with the ability to perform some
	            - initialization before the content definition class can be
used.
	            - In case of the news module the NewsChannelContentDefinition
class
	            - has to be loaded.
	            -->
	         <!--
	        
<initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initClass>
	          -->
	           <!--
	             - <listMethod /> defines the method of the content definition
class
	             - which should be used to retrieve all content definition
objects 
				 - (or any subset).
	             - Usually you use this method also in the backoffice or any
other 
				 - list view.
	             -->
	         <!--
	         <listMethod name="getNewsList">
	           <param type="java.lang.Integer">1</param>
	           <param type="java.lang.String">-1</param>
	         </listMethod>
	          -->
	           <!--
	             - <page /> determines a page in the virtual file system that
can
	             - display a single entry of a content definition. You must
provide
				 - also a method of the content definition class that retrieves an 
				 - id (or something else that has to be appended to your page uri 
				 - to determine which entry has to be displayed). The result will
				 - look like:
	             - /news.html?__element=entry&newsid=<result of getIntId>
	             - for each content definition instance object.
	             -->
	         <!--
	         <page uri="/news.html?__element=entry">
	           <param method="getIntId" name="newsid"/>
	         </page>
	          -->
	         <!--
	           <page uri="/singleNews.jsp">
	             <param method="getIntId" name="id"/>
	           </page>
	           -->
	       <!--
	       </contentDefinition>
	        -->
			<!-- for Forums modules
	       <contentDefinition type="forum">
	        
<class>de.wfnetz.opencms.modules.forum.ContributionContentDefinition</class>
	         <listMethod name="getSortedList">
	           <param type="java.lang.String"/>
	         </listMethod>
	         <page uri="/forum.html?forumtemplate=viewcontributionentry">
	           <param method="getId" name="conid"/>
	         </page>
	       </contentDefinition>
		   -->
	</contentDefinitions>	
</luceneSearch>
</system>
</registry>



-- 
--------------------------------
Stephan Loeffler
Heinrich v. Kleist Str. 38
95447 Bayreuth 
Tel.: 0921-5072665
Cell.:0179-6994085
-------------------------------

+++ GMX DSL-Tarife 3 Monate gratis* +++ Nur bis 25.7.2004 +++
Bis 24.000 MB oder 300 Freistunden inkl. http://www.gmx.net/de/go/dsl




More information about the opencms-dev mailing list