[opencms-dev] (no subject)

Ben Rometsch ben at solidstategroup.com
Wed Oct 22 06:46:01 CEST 2003


Hi Matt,

I am not having any joy! I've updated my registry.xml file, with the
appropriate section reading:

<luceneSearch>
	<mergeFactor>100000</mergeFactor>
	<permCheck>true</permCheck>
	<indexDir>c:\search</indexDir>
	
<analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
	<subsearch>true</subsearch>
	<project>online</project>
	<docFactories>
		<pageDocFactory enabled="true">
	
<class>net.grcomputing.opencms.search.lucene.PageDocument</class>
		</pageDocFactory>
		<plainDocFactory enabled="true">
			<fileType name="plaintext">
				<extension>.txt</extension>
	
<class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
			</fileType>
			<fileType name="taggedtext">
				<extension>.html</extension>
				<extension>.htm</extension>
				<extension>.xml</extension>
				<!-- This will strip tags before processing
-->
	
<class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
			</fileType>

			<!-- Index binary documents -->
			<fileType name="plaindocument">
				<extension>.doc</extension>
				<extension>.xls</extension>
				<extension>.pdf</extension>
	
<class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
			</fileType>

		</plainDocFactory>
		<jspDocFactory enabled="true">
	
<class>net.grcomputing.opencms.search.lucene.JspDocument</class>
		</jspDocFactory>
		<xmlTemplateDocFactory enabled="false"/>
	</docFactories>
	<directories>
		<directory location="/release/">
			<section>Test</section>
			<subsearch>true</subsearch>
		</directory>
		<directory location="/RGLIntranet/">
			<section>Test2</section>
			<subsearch>true</subsearch>
		</directory>
	</directories>
</luceneSearch>

Notice the section beginning after the remark "Index binary documents".

But I cannot get any hits when searching for document names that are in the
VFS. The other (HTML) searches are working ok. Is the "name" property of the
fileType tag important? I wasn't sure what to add here...I'm not quite sure
how to move forward. Maybe it would be an idea to add some debugging trace
to the BodylessDocument class to see what is going on inside it? I want to
make sure my XML is correct first tho!

Thanks for the help,
Ben


On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> Hi Matt,
> 
> Thanks for the reply. If I just want to get the document title to be
> included in the Lucene index, looking at the code in the
> net.grcomputing.opencms.search.BodylessDocument class it appears to ignore
> what the CMSObject is, and attempt to index it regardless. Is this
correct?
> 

Correct. It will already index the title, but it will not attempt to
index the body.

> If this is the case, is it simply a matter of instructing Lucene to index
> obects other than HTML files in the VFS  (i.e. Documents) ? Or would I
have
> to create another class, something like
> net.grcomputing.opencms.search.FileDocument and add a new hook into that
> class via the registry.xml fragment?  Or does the BodyLess document
provide
> this functionality, and it's just a matter of adding a new XML fragment to
> the registry.xml are?

Again, you are right -- simply adding the appropriate configuration to
the registry.xml file will suffice. I believe that you will just need to
extend the plainDocument tag set to include extensions and processors...
I _think_ that binary files get handled by the plain handler.

Matt




More information about the opencms-dev mailing list