[opencms-dev] Registry xml for PDF and WORD document search

M Butcher mbutcher at grcomputing.net
Tue Nov 25 06:27:01 CET 2003


Again, in case I wasn't clear, the important part isn't the name="" 
attribute of <fileType/>, but the value of <extension/>. The class 
net.grcomputing.opencms.search.lucene.ExtensionMapping will try and 
determine the file type based on the file's extension.

Your XML looks fine.

However, you will need the JAR file from textmining.org. It contains the 
  interpreter for PDF and DOC files. If you have not dones so already, 
put that in the lib/ directory and restart Tomcat.

Future releases of the module will include the textmining utilities.

Matt


Trevor Lee wrote:
> Hi Ernesto,
> 
> Would you be able to cut and paste what you have in your registry.xml file
> here?
> 
> Cheers
> Trevor
> 
> -----Original Message-----
> From: opencms-dev-admin at opencms.org
> [mailto:opencms-dev-admin at opencms.org]On Behalf Of Ernesto De Santis
> Sent: Tuesday, November 25, 2003 3:21 PM
> To: OpenCms List
> Subject: [opencms-dev] Registry xml for PDF and WORD document search
> 
> 
>  Hi
> 
>  I don´t response, because i ignore the uses of name from fileType´s.
> In my registry, i write arbitrary text, and work fine. :-)
> 
>  Ernesto.
> 
> 
> 
> 
>>>Trevor,
>>>
>>>I'm not sure. I think you need a Content Definition. I'm copying Stephen
>>>on this -- he did most of the work on this part of the module. I'll also
>>>copy Ernesto, who contributed the two classes.
>>>
>>>Stephen, Ernesto -- if you can answer, I'll incorporate your answer into
>>>the README/INSTALL files for the module.
>>>
>>>Matt
>>>
>>>Trevor Lee wrote:
>>>
>>>>Hi all,
>>>>
>>>>I was wondering what the registry.xml file should have inorder to get
>>
>>lucene
>>
>>>>to index word and pdf files using Ernesto De Santis's PDFDocument and
>>>>WordDocument classes?
>>>>
>>>>I've got the following in my registry.xml file:
>>>>
>>>>                <docFactory enabled="true" type="binary">
>>>>                    <fileType name="pdftext">
>>>>                        <extension>.pdf</extension>
>>>>
>>>><class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
>>>>                    </fileType>
>>>>                    <fileType name="doctext">
>>>>                        <extension>.doc</extension>
>>>>
>>>><class>net.grcomputing.opencms.search.lucene.WordDocument</class>
>>>>                    </fileType>
>>>>                </docFactory>
>>>>
>>>>Where do i define the "pdftext" and "doctext" types?
>>>>
>>>>What else needs to be changed or included?
>>>>
>>>>Thanks in advance for your help.
>>>>
>>>>Cheers
>>>>Trevor
>>>>
>>>>_______________________________________________
>>>>This mail is send to you from the opencms-dev mailing list
>>>>To change your list options, or to unsubscribe from the list, please
>>
>>visit
>>
>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>
>>>
> 
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
> 
> 
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev





More information about the opencms-dev mailing list