[opencms-dev] Registry xml for PDF and WORD document search

Tue Nov 25 07:11:01 CET 2003

Hi Matt/Ernesto,

Thanks for all that help. I've copied the other two jar files to the lib dir
and the indexing is working now.

Thank you.

Trevor

-----Original Message-----
From: opencms-dev-admin at opencms.org
[mailto:opencms-dev-admin at opencms.org]On Behalf Of M Butcher
Sent: Tuesday, November 25, 2003 5:02 PM
To: opencms-dev at opencms.org
Subject: Re: [opencms-dev] Registry xml for PDF and WORD document search

Trevor,

One thing I did with the IndexManager is suppress exceptions during
indexing (up until 10 exceptions were reached, then the IndexManager
gives up and throws exceptions). I did this because I didn't want the
index build to fail every time some editor did something stupid with a
document.

If you are really frustrated and want to start seeing all of the
exceptions, edit IndexManager and change this line:
   private static int MAX_DOCUMENT_READING_EXCEPTION = 10;

Matt

Trevor Lee wrote:
> Hi Matt/Ernesto,
>
> I've placed the tm-extractors-0.2.jar file in the opencms/WEB-INF/lib
> directory of Tomcat. I've restarted Tomcat and scheduled an index. I have
> two pdf files in /swm/auth/news/articles/ directory on opencms.
>
> I received the following log:
>
> [25.11.2003 05:28:26] <opencms_info> [CmsLogin] Login user Admin
> [25.11.2003 05:30:10] <opencms_cronscheduler> Starting job for
> com.opencms.core.CmsCronEntry{30 5 * * * admin Administrators
> net.grcomputing.opencms.search.lucene.CronIndexManager createIndex=true}
> [25.11.2003 05:30:10] <opencms_info>
>
=====IndexManager===========================================================
> ==
> [25.11.2003 05:30:10] <opencms_info> Analyzer:
> org.apache.lucene.analysis.standard.StandardAnalyzer
> [25.11.2003 05:30:10] <opencms_info> Extension map exists to handle
> plaintext
> [25.11.2003 05:30:10] <opencms_info> Extension map exists to handle
> taggedtext
> [25.11.2003 05:30:10] <opencms_info> JSP DocumentFactory loaded
> [25.11.2003 05:30:10] <opencms_info> Extension map exists to handle
pdftext
> [25.11.2003 05:30:10] <opencms_info> Extension map exists to handle
wordtext
> [25.11.2003 05:30:10] <opencms_info> Page DocumentFactory loaded
> [25.11.2003 05:30:10] <opencms_info> IndexManager: indexing /swm/
> [25.11.2003 05:30:10] <opencms_info> IndexManager: indexing /swm/auth/
> [25.11.2003 05:30:11] <opencms_info> IndexManager: indexing
> /swm/auth/advert/
> [25.11.2003 05:30:11] <opencms_info> IndexManager: indexing
> /swm/auth/enterprise/
> [25.11.2003 05:30:11] <opencms_info> IndexManager: indexing
> /swm/auth/enterprise/articles/
> [25.11.2003 05:30:12] <opencms_info> IndexManager: indexing
> /swm/auth/images/
> [25.11.2003 05:30:12] <opencms_info> IndexManager: indexing
/swm/auth/news/
> [25.11.2003 05:30:12] <opencms_info> IndexManager: indexing
> /swm/auth/news/articles/
>
> It doesn't return how many files are indexed....
> Any thoughts?
>
> Cheers
> Trevor
>
> -----Original Message-----
> From: opencms-dev-admin at opencms.org
> [mailto:opencms-dev-admin at opencms.org]On Behalf Of M Butcher
> Sent: Tuesday, November 25, 2003 4:34 PM
> To: opencms-dev at opencms.org
> Subject: Re: [opencms-dev] Registry xml for PDF and WORD document search
>
>
> Again, in case I wasn't clear, the important part isn't the name=""
> attribute of <fileType/>, but the value of <extension/>. The class
> net.grcomputing.opencms.search.lucene.ExtensionMapping will try and
> determine the file type based on the file's extension.
>
> Your XML looks fine.
>
> However, you will need the JAR file from textmining.org. It contains the
>   interpreter for PDF and DOC files. If you have not dones so already,
> put that in the lib/ directory and restart Tomcat.
>
> Future releases of the module will include the textmining utilities.
>
> Matt
>
>
> Trevor Lee wrote:
>
>>Hi Ernesto,
>>
>>Would you be able to cut and paste what you have in your registry.xml file
>>here?
>>
>>Cheers
>>Trevor
>>
>>-----Original Message-----
>>From: opencms-dev-admin at opencms.org
>>[mailto:opencms-dev-admin at opencms.org]On Behalf Of Ernesto De Santis
>>Sent: Tuesday, November 25, 2003 3:21 PM
>>To: OpenCms List
>>Subject: [opencms-dev] Registry xml for PDF and WORD document search
>>
>>
>> Hi
>>
>> I don´t response, because i ignore the uses of name from fileType´s.
>>In my registry, i write arbitrary text, and work fine. :-)
>>
>> Ernesto.
>>
>>
>>
>>
>>
>>>>Trevor,
>>>>
>>>>I'm not sure. I think you need a Content Definition. I'm copying Stephen
>>>>on this -- he did most of the work on this part of the module. I'll also
>>>>copy Ernesto, who contributed the two classes.
>>>>
>>>>Stephen, Ernesto -- if you can answer, I'll incorporate your answer into
>>>>the README/INSTALL files for the module.
>>>>
>>>>Matt
>>>>
>>>>Trevor Lee wrote:
>>>>
>>>>
>>>>>Hi all,
>>>>>
>>>>>I was wondering what the registry.xml file should have inorder to get
>>>
>>>lucene
>>>
>>>
>>>>>to index word and pdf files using Ernesto De Santis's PDFDocument and
>>>>>WordDocument classes?
>>>>>
>>>>>I've got the following in my registry.xml file:
>>>>>
>>>>>               <docFactory enabled="true" type="binary">
>>>>>                   <fileType name="pdftext">
>>>>>                       <extension>.pdf</extension>
>>>>>
>>>>><class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
>>>>>                   </fileType>
>>>>>                   <fileType name="doctext">
>>>>>                       <extension>.doc</extension>
>>>>>
>>>>><class>net.grcomputing.opencms.search.lucene.WordDocument</class>
>>>>>                   </fileType>
>>>>>               </docFactory>
>>>>>
>>>>>Where do i define the "pdftext" and "doctext" types?
>>>>>
>>>>>What else needs to be changed or included?
>>>>>
>>>>>Thanks in advance for your help.
>>>>>
>>>>>Cheers
>>>>>Trevor
>>>>>
>>>>>_______________________________________________
>>>>>This mail is send to you from the opencms-dev mailing list
>>>>>To change your list options, or to unsubscribe from the list, please
>>>
>>>visit
>>>
>>>
>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>>
>>>>
>>>>
>>_______________________________________________
>>This mail is send to you from the opencms-dev mailing list
>>To change your list options, or to unsubscribe from the list, please visit
>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
>>_______________________________________________
>>This mail is send to you from the opencms-dev mailing list
>>To change your list options, or to unsubscribe from the list, please visit
>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev

_______________________________________________
This mail is send to you from the opencms-dev mailing list
To change your list options, or to unsubscribe from the list, please visit
http://mail.opencms.org/mailman/listinfo/opencms-dev