[opencms-dev] OpenCMSLucene 1.4 search: Word doc indexing is done but not for html/txt

Ritwik Datta dattaritwik at yahoo.com
Thu Jan 22 06:47:00 CET 2004


Dear All,
 
 
I have compiled opencmslucene 1.4 source from sourceforge.net CVS repository. Now I am able to index Word Documents. But what I noticed is indexing for other file extension like html txt is not happening. It was happening with lucene module 1.3 for opencms. My registry.xml does contain entries for PlainDocument, Taggeddocument and of course word document. but Index manager is not taking other files into consideration other than Word documents.
Earlier I had opencmslucene 1.3. But to upgrade I downloaded all java files from latest CVS, compiled and uploaded under $TOMCAT_HOME/webapps/opencms/WEB-INF/classes/net/grcomputing/opencms/search/lucene and jakarta-poi-1.9.0-dev-20030109.jar & tm-extractors-0.2.jar under $TOMCAT_HOME/webapps/opencms/WEB-INF/lib folder.
 I am pasting the relevant contents of my registry.xml and log entries of Index manager. but I need html/txt indexing also. Please help me. This is urgent.
 
 
<luceneSearch>
            <mergeFactor>100000</mergeFactor>
            <permCheck>true</permCheck>
            <indexDir>/opt/lucene/index/opencms/</indexDir>
            <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
            <subsearch>true</subsearch>
            <project>online</project>
            <docFactories>
                <pageDocFactory enabled="true">
                    <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
                </pageDocFactory>
                <plainDocFactory enabled="true">
                    <fileType name="plaintext">
                        <extension>.txt</extension>
                        <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
                    </fileType>
                    <fileType name="taggedtext">
                        <extension>.html</extension>
                        <extension>.htm</extension>
                        <extension>.xml</extension>
                        <!-- This will strip tags before processing -->
                        <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
                    </fileType>
                </plainDocFactory>
    <docFactory type="binary" enabled="true">
     <fileType name="doctext">
      <extension>.doc</extension>
      <extension>.dot</extension>
      <class>net.grcomputing.opencms.search.lucene.WordDocument</class>
     </fileType>
    </docFactory>
                <jspDocFactory enabled="true">
                    <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
                </jspDocFactory>
                <xmlTemplateDocFactory enabled="false"/>
   </docFactories>
   <directories>
                <directory location="/release/">
                    <section>Test</section>
                    <subsearch>true</subsearch>
                </directory>
            </directories>
        </luceneSearch>
 
=====IndexManager=============================================================
[22.01.2004 09:46:10] <opencms_info> Analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer
[22.01.2004 09:46:10] <opencms_info> Extension map exists to handle doctext
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Assessment_Findings/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Best_Practices/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Business_Goals/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/CMC_Product_Information/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/CMM_Action_Plans/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Coding_Standard/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Dashboard/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Defect_Prevention/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/ER_SI_Organisation_Structure/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Estimation/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Expert_List/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/FAQ/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/IGC_OSSP_Role_Mapping/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Metrics_and_Measurements/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/OQPM/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/OSSP/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Presentation_Library/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Process_Change_Management/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Projectwise_Plans/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/PROMPT/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Readables/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/Sample_CMM_Documents/
[22.01.2004 09:46:10] <opencms_info> IndexManager: indexing /release/spdb/SCM/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/SEPG/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/SPDB_Notes/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/SPDB_Search/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/SQA/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Notes/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Others/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Data/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Bilingual_2-tier_Application_to_3-tier_Conversion/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Citrix/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Compilation_Problem/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Driver_Installation/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/FTP_Service_on_Linux/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Hindi_Email/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Hindi_Integration_Development_Guidelines/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/HW_Requirement_for_Oracle9i_9iDS_9iASR2/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Oracle_9i_Application_Server_Release2_Installation/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Oracle_Forms9i_to_Forms6i_Conversion/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Oracle_Froms6i_Deployment_on_9iAS/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/ORARRP_Reusable_Components/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/OS_Problem/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Asset_Details/Red_Hat_Advance_Server_Installation/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Project_Info/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Register/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Reusable_Assets/Training_Materials/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/TCM_Plans/
[22.01.2004 09:46:11] <opencms_info> IndexManager: indexing /release/spdb/TCM/Templates/
[22.01.2004 09:46:12] <opencms_info> IndexManager: indexing /release/spdb/Timesheet/
[22.01.2004 09:46:12] <opencms_info> IndexManager: indexing /release/spdb/Training/
[22.01.2004 09:46:12] <opencms_info> IndexManager: 4 documents are being processed
[22.01.2004 09:46:13] <opencms_info> IndexManager:  Index has been optimized.
[22.01.2004 09:46:13] <opencms_info> Done



---------------------------------
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20040122/6011169c/attachment.htm>


More information about the opencms-dev mailing list