[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF

M Butcher mbutcher at grcomputing.net
Tue Mar 16 18:56:00 CET 2004


Alex ! wrote:
> OK, Matt. So I had some input from my colleague, changed the XMLDocument 
> class (seems it wasnt done in the best way!) and now tried calling the 
> XMLDocument(cmso,f) class directly from a jsp - and it works, returns a 
> lucene document, which i test by outputing to screen using the 
> Document.toString() method as before.
> 
> But... the cron still returns the same premature end of file exception.

On the same document? Do you know what is throwing the exception? Is it 
the XMLDocument constructor or the IndexManager?

> 
> 
> Alex
> 
> 
>> From: "Alex !" <kingofkingston at hotmail.com>
>> Reply-To: opencms-dev at opencms.org
>> To: opencms-dev at opencms.org
>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>> getting error - EOF
>> Date: Mon, 15 Mar 2004 22:28:34 +0000
>>
>> It seems to be the indexer. I have a class XMLDocument (implements 
>> I_FileDocumentFactory), which is based on bodyless document. Here I 
>> set up the XMLReader and instantiate a XMLDocumentHandlerSAX class 
>> (extends DefaultHandler).
>>
>> After some thorough debug and testing, its seems the indexer, as I can 
>> call the XMLDocumentHandlerSAX from within a jsp and it works, 
>> returning a Lucene Document, that I then print to screen using 
>> Document.toString(), it all looks ok, although I havent tried indexing 
>> it myself (i was counting on the module doing this).
>>
>> Could it be the XMLDoument class? Here is what it looks like:
>>
>> public class XMLDocument implements I_FileDocumentFactory
>> {
>>     public static String FACTORY_NAME = "XML DocumentFactory";
>>     private XMLDocumentHandlerSAX saxhdlr = null;
>>     private XMLReader xr = null;
>>     private InputStream in = null;
>>     private InputSource is = null;
>>
>>     public XMLDocument() { }
>>
>>     public String getFactoryName() {
>>        return FACTORY_NAME;
>>     }
>>
>>     public Document Document(CmsObject cmso, CmsFile f) throws 
>> CmsException
>>     {
>>         try
>>         {
>>             XMLDocumentHandlerSAX saxhdlr = new 
>> XMLDocumentHandlerSAX(cmso, f);
>>
>>             in = new ByteArrayInputStream(f.getContents());
>>             is = new InputSource(in);
>>
>>             //in = (InputStream)(new 
>> ByteArrayInputStream(f.getContents()));
>>             //is = new InputSource(in);
>>
>>             //is = new InputSource (new StringReader (xmlText));
>>
>>             xr = XMLReaderFactory.createXMLReader( 
>> "org.apache.xerces.parsers.SAXParser" );
>>           xr.setContentHandler(saxhdlr);
>>           xr.setFeature( 
>> "http://xml.org/sax/features/validation",false );
>>           xr.setFeature( 
>> "http://apache.org/xml/features/continue-after-fatal-error",true );
>>             xr.parse(is);
>>
>>         }
>>         catch (Exception e)
>>         {
>>             throw new CmsException(e.getMessage(), e.getCause());
>>         }
>>         return saxhdlr.getDocument();
>>     }
>>
>>     public Document Document(CmsObject cmso, CmsFile f, HashMap h) 
>> throws CmsException
>>     {
>>         return Document(cmso,f);
>>     }
>> }
>>
>>
>> It seems the handler class returns what it should, so it is either the 
>> XMLDocument class or the indexer which is complaining. Should I send 
>> you the two src files ? theyre about as complete as they are gonna get...
>>
>> Cheers
>>
>> Alex
>>
>>
>>> From: M Butcher <mbutcher at grcomputing.net>
>>> Reply-To: opencms-dev at opencms.org
>>> To: opencms-dev at opencms.org
>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>>> getting error - EOF
>>> Date: Mon, 15 Mar 2004 13:53:17 -0700
>>>
>>> What is throwing the exception, the XML parser or the indexer? Last 
>>> week, I was working on my XSLT code and created some code that looks 
>>> almost exactly like yours (except I created a Transformer instead of 
>>> an XMLReader) and it worked fine -- perhaps the problem is in 
>>> whatever gets handed to the IndexManager.
>>>
>>> Matt
>>>
>>> Alex ! wrote:
>>>
>>>> Ok so I think I'm alsmost done but now when the cron runs (yes it is 
>>>> mysteriously begun working!), I get the following error,  for a 
>>>> premature end of file? any ideas? the way i am retrievin the file 
>>>> contents is as follows:
>>>>
>>>>             in = new ByteArrayInputStream(f.getContents());
>>>>             is = new InputSource(in);
>>>>             xr.parse(is);
>>>>
>>>> where:     private XMLReader xr
>>>>     private InputStream in
>>>>     private InputSource is
>>>>
>>>>
>>>> Error output form OCMS log:
>>>>
>>>> [13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for 
>>>> com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>>
>>>>
>>>> [13.03.2004 06:58:10] <opencms_info>
>>>> =====IndexManager============================================================= 
>>>>
>>>>
>>>> [13.03.2004 06:58:10] <opencms_info> Analyzer: 
>>>> org.apache.lucene.analysis.standard.StandardAnalyzer
>>>> [13.03.2004 06:58:10] <opencms_info> Extension map exists to handle XML
>>>> [13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>> processing file test_xml.xml: com.opencms.core.CmsException: 0 
>>>> Unknown exception. Detailed error: Premature end of file..
>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/xml/
>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>> processing file article5.xml: com.opencms.core.CmsException: 0 
>>>> Unknown exception. Detailed error: Premature end of file..
>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>> processing file article7.xml: com.opencms.core.CmsException: 0 
>>>> Unknown exception. Detailed error: Premature end of file..
>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are 
>>>> being processed
>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager:  Index has been 
>>>> optimized.
>>>> [13.03.2004 06:58:10] <opencms_info> Done
>>>> =====IndexManager============================================================= 
>>>>
>>>>
>>>> [13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of 
>>>> job com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>> Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13 
>>>> 06:58:10 GMT 2004
>>>>
>>>>
>>>> Thanks alex
>>>>
>>>>
>>>>> From: M Butcher <mbutcher at grcomputing.net>
>>>>> Reply-To: opencms-dev at opencms.org
>>>>> To: opencms-dev at opencms.org
>>>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>>>>> getting error
>>>>> Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>>>
>>>>>
>>>>> Alex,
>>>>>
>>>>> I can't tell, from the stack trace, what is going on. Judging from 
>>>>> where the exception is located, it looks like a problem with 
>>>>> content defs... but that doesn't make sense....
>>>>>
>>>>> When you finish it, please do send it to Stephan and I. It sounds 
>>>>> like a very useful addition to the existing indexing tools.
>>>>>
>>>>> Matt
>>>>>
>>>>> Alex ! wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> this ones probably for Matt/Stefan.
>>>>>>
>>>>>> I have written an XML Indexer for the lucene module (almost 
>>>>>> finished), which will basically take an xml file, parse it, and 
>>>>>> then add its elements and their contents to the lucene index, 
>>>>>> instead of stripping the element tags and then including the 
>>>>>> remaining content a a siingle searchable body (as is currently 
>>>>>> available).
>>>>>>
>>>>>> Everything is now compiled (into a seprate jar, just 2 class 
>>>>>> files), the cron job runs but gives the following error:
>>>>>>
>>>>>> [07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for 
>>>>>> com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators 
>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml} 
>>>>>>
>>>>>>
>>>>>>
>>>>>> [07.03.2004 14:20:10] <opencms_info>
>>>>>> =====IndexManager============================================================= 
>>>>>>
>>>>>>
>>>>>>
>>>>>> [07.03.2004 14:20:10] <opencms_info> Analyzer: 
>>>>>> org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>> [07.03.2004 14:20:10] <opencms_info> Extension map exists to 
>>>>>> handle XML
>>>>>> [07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>>> [07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>>> [07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>>> [07.03.2004 14:20:11] <opencms_info> Return Document
>>>>>> [07.03.2004 14:20:11] <opencms_cronscheduler> Error running job 
>>>>>> for com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators 
>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>>>> Error: java.lang.NullPointerException
>>>>>>     at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>>>>     at 
>>>>>> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92) 
>>>>>>
>>>>>>
>>>>>>     at 
>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>>>>>>     at 
>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>>>>>>     at 
>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown 
>>>>>> Source)
>>>>>>     at 
>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown 
>>>>>> Source)
>>>>>>     at 
>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown 
>>>>>> Source)
>>>>>>     at 
>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown 
>>>>>> Source)
>>>>>>     at 
>>>>>> com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>>>
>>>>>>
>>>>>> my registry entry for the xml files look like this (contained in 
>>>>>> external registry file):
>>>>>>
>>>>>>       <!-- For XML Files :) -->
>>>>>>       <docFactory enabled="true" type="plain">
>>>>>>          <fileType name="XML">
>>>>>>            <extension>.xml</extension>
>>>>>>            
>>>>>> <class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>>>>          </fileType>
>>>>>>       </docFactory>
>>>>>>
>>>>>> Your help would be much appreciated.
>>>>>>
>>>>>> (should I send you the source to correct and include in your next 
>>>>>> patch/update?)
>>>>>>
>>>>>> Many Thanks
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> _________________________________________________________________
>>>>>> Find a cheaper internet access deal - choose one to suit you. 
>>>>>> http://www.msn.co.uk/internetaccess
>>>>>>
>>>>>> _______________________________________________
>>>>>> This mail is send to you from the opencms-dev mailing list
>>>>>> To change your list options, or to unsubscribe from the list, 
>>>>>> please visit
>>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> This mail is send to you from the opencms-dev mailing list
>>>>> To change your list options, or to unsubscribe from the list, 
>>>>> please visit
>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>>
>>>>
>>>> _________________________________________________________________
>>>> Find a cheaper internet access deal - choose one to suit you. 
>>>> http://www.msn.co.uk/internetaccess
>>>>
>>>> _______________________________________________
>>>> This mail is send to you from the opencms-dev mailing list
>>>> To change your list options, or to unsubscribe from the list, please 
>>>> visit
>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>> _______________________________________________
>>> This mail is send to you from the opencms-dev mailing list
>>> To change your list options, or to unsubscribe from the list, please 
>>> visit
>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
>> _________________________________________________________________
>> Stay in touch with absent friends - get MSN Messenger 
>> http://www.msn.co.uk/messenger
>>
>> _______________________________________________
>> This mail is send to you from the opencms-dev mailing list
>> To change your list options, or to unsubscribe from the list, please 
>> visit
>> http://mail.opencms.org/mailman/listinfo/opencms-dev
> 
> 
> _________________________________________________________________
> Tired of 56k? Get a FREE BT Broadband connection 
> http://www.msn.co.uk/specials/btbroadband
> 
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev




More information about the opencms-dev mailing list