[opencms-dev] Registry xml for PDF and WORD document search

Trevor Lee Trevor.Lee at 4Loop.com.au
Tue Nov 25 06:43:01 CET 2003


Hi Matt/Ernesto,

I've placed the tm-extractors-0.2.jar file in the opencms/WEB-INF/lib
directory of Tomcat. I've restarted Tomcat and scheduled an index. I have
two pdf files in /swm/auth/news/articles/ directory on opencms.

I received the following log:

[25.11.2003 05:28:26] <opencms_info> [CmsLogin] Login user Admin
[25.11.2003 05:30:10] <opencms_cronscheduler> Starting job for
com.opencms.core.CmsCronEntry{30 5 * * * admin Administrators
net.grcomputing.opencms.search.lucene.CronIndexManager createIndex=true}
[25.11.2003 05:30:10] <opencms_info>
=====IndexManager===========================================================
==
[25.11.2003 05:30:10] <opencms_info> Analyzer:
org.apache.lucene.analysis.standard.StandardAnalyzer
[25.11.2003 05:30:10] <opencms_info> Extension map exists to handle
plaintext
[25.11.2003 05:30:10] <opencms_info> Extension map exists to handle
taggedtext
[25.11.2003 05:30:10] <opencms_info> JSP DocumentFactory loaded
[25.11.2003 05:30:10] <opencms_info> Extension map exists to handle pdftext
[25.11.2003 05:30:10] <opencms_info> Extension map exists to handle wordtext
[25.11.2003 05:30:10] <opencms_info> Page DocumentFactory loaded
[25.11.2003 05:30:10] <opencms_info> IndexManager: indexing /swm/
[25.11.2003 05:30:10] <opencms_info> IndexManager: indexing /swm/auth/
[25.11.2003 05:30:11] <opencms_info> IndexManager: indexing
/swm/auth/advert/
[25.11.2003 05:30:11] <opencms_info> IndexManager: indexing
/swm/auth/enterprise/
[25.11.2003 05:30:11] <opencms_info> IndexManager: indexing
/swm/auth/enterprise/articles/
[25.11.2003 05:30:12] <opencms_info> IndexManager: indexing
/swm/auth/images/
[25.11.2003 05:30:12] <opencms_info> IndexManager: indexing /swm/auth/news/
[25.11.2003 05:30:12] <opencms_info> IndexManager: indexing
/swm/auth/news/articles/

It doesn't return how many files are indexed....
Any thoughts?

Cheers
Trevor

-----Original Message-----
From: opencms-dev-admin at opencms.org
[mailto:opencms-dev-admin at opencms.org]On Behalf Of M Butcher
Sent: Tuesday, November 25, 2003 4:34 PM
To: opencms-dev at opencms.org
Subject: Re: [opencms-dev] Registry xml for PDF and WORD document search


Again, in case I wasn't clear, the important part isn't the name=""
attribute of <fileType/>, but the value of <extension/>. The class
net.grcomputing.opencms.search.lucene.ExtensionMapping will try and
determine the file type based on the file's extension.

Your XML looks fine.

However, you will need the JAR file from textmining.org. It contains the
  interpreter for PDF and DOC files. If you have not dones so already,
put that in the lib/ directory and restart Tomcat.

Future releases of the module will include the textmining utilities.

Matt


Trevor Lee wrote:
> Hi Ernesto,
>
> Would you be able to cut and paste what you have in your registry.xml file
> here?
>
> Cheers
> Trevor
>
> -----Original Message-----
> From: opencms-dev-admin at opencms.org
> [mailto:opencms-dev-admin at opencms.org]On Behalf Of Ernesto De Santis
> Sent: Tuesday, November 25, 2003 3:21 PM
> To: OpenCms List
> Subject: [opencms-dev] Registry xml for PDF and WORD document search
>
>
>  Hi
>
>  I don´t response, because i ignore the uses of name from fileType´s.
> In my registry, i write arbitrary text, and work fine. :-)
>
>  Ernesto.
>
>
>
>
>>>Trevor,
>>>
>>>I'm not sure. I think you need a Content Definition. I'm copying Stephen
>>>on this -- he did most of the work on this part of the module. I'll also
>>>copy Ernesto, who contributed the two classes.
>>>
>>>Stephen, Ernesto -- if you can answer, I'll incorporate your answer into
>>>the README/INSTALL files for the module.
>>>
>>>Matt
>>>
>>>Trevor Lee wrote:
>>>
>>>>Hi all,
>>>>
>>>>I was wondering what the registry.xml file should have inorder to get
>>
>>lucene
>>
>>>>to index word and pdf files using Ernesto De Santis's PDFDocument and
>>>>WordDocument classes?
>>>>
>>>>I've got the following in my registry.xml file:
>>>>
>>>>                <docFactory enabled="true" type="binary">
>>>>                    <fileType name="pdftext">
>>>>                        <extension>.pdf</extension>
>>>>
>>>><class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
>>>>                    </fileType>
>>>>                    <fileType name="doctext">
>>>>                        <extension>.doc</extension>
>>>>
>>>><class>net.grcomputing.opencms.search.lucene.WordDocument</class>
>>>>                    </fileType>
>>>>                </docFactory>
>>>>
>>>>Where do i define the "pdftext" and "doctext" types?
>>>>
>>>>What else needs to be changed or included?
>>>>
>>>>Thanks in advance for your help.
>>>>
>>>>Cheers
>>>>Trevor
>>>>
>>>>_______________________________________________
>>>>This mail is send to you from the opencms-dev mailing list
>>>>To change your list options, or to unsubscribe from the list, please
>>
>>visit
>>
>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>
>>>
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev


_______________________________________________
This mail is send to you from the opencms-dev mailing list
To change your list options, or to unsubscribe from the list, please visit
http://mail.opencms.org/mailman/listinfo/opencms-dev





More information about the opencms-dev mailing list