[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF

M Butcher mbutcher at grcomputing.net
Wed Mar 17 00:08:01 CET 2004


Hmm... it is possible that Lucene throws an EOF error for Document 
objects... but the more I think about it, the less likely I think that 
is. More likely, I would expect a SAX parser to throw such an exception 
if either the XML document wasn't well formed or the root element was 
missing a closing >.

If it's not that, you'll probably have to modify the exception code in 
IndexManager and see if you can find the exact location of the error.

Matt

Alex ! wrote:
> I only have 3 xml files in the test dir im trying to index. One of those 
> files I am using in my jsp, and it works fine. See below code snippet. 
> XMLDocument constructor throws no exceptions, nor does 
> XMLDocument.Document(cmso,f).
> 
> Its gotta be IndexManager. Maybe the document I am producing is not what 
> it is expecting? But then why the EOF?
> 
> Inside my jsp:
> 
> 
> <%
>     CmsJspActionElement cmsJspAE = new CmsJspActionElement(pageContext, 
> request, response);
>     CmsObject cmso = cmsJspAE.getCmsObject();
>        CmsFile f = cmso.readFile("/test/test_xml.xml");
> 
>        String thepath = f.getAbsolutePath();
>        out.println("<br>"+thepath+"<br><br>");
> 
>        XMLDocument xmldoc = null;
>        Document thisdoc = null;
> 
>         try
>         {
>                        xmldoc = new XMLDocument();
>                        out.println("<br>"+xmldoc.getFactoryName()+"<br>");
>                 thisdoc = xmldoc.Document(cmso, f);
> 
>         }
>         catch (Exception e)
>         {
>             throw new CmsException(e.getMessage(), e.getCause());
>         }
>                String outdoc = thisdoc.toString();
>                out.println("Lucene Document: <br><br>" + outdoc);
> %>
> 
> 
>> From: M Butcher <mbutcher at grcomputing.net>
>> Reply-To: opencms-dev at opencms.org
>> To: opencms-dev at opencms.org
>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>> getting error - EOF
>> Date: Tue, 16 Mar 2004 11:09:57 -0700
>>
>> Alex ! wrote:
>>
>>> OK, Matt. So I had some input from my colleague, changed the 
>>> XMLDocument class (seems it wasnt done in the best way!) and now 
>>> tried calling the XMLDocument(cmso,f) class directly from a jsp - and 
>>> it works, returns a lucene document, which i test by outputing to 
>>> screen using the Document.toString() method as before.
>>>
>>> But... the cron still returns the same premature end of file exception.
>>
>>
>> On the same document? Do you know what is throwing the exception? Is 
>> it the XMLDocument constructor or the IndexManager?
>>
>>>
>>>
>>> Alex
>>>
>>>
>>>> From: "Alex !" <kingofkingston at hotmail.com>
>>>> Reply-To: opencms-dev at opencms.org
>>>> To: opencms-dev at opencms.org
>>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>>>> getting error - EOF
>>>> Date: Mon, 15 Mar 2004 22:28:34 +0000
>>>>
>>>> It seems to be the indexer. I have a class XMLDocument (implements 
>>>> I_FileDocumentFactory), which is based on bodyless document. Here I 
>>>> set up the XMLReader and instantiate a XMLDocumentHandlerSAX class 
>>>> (extends DefaultHandler).
>>>>
>>>> After some thorough debug and testing, its seems the indexer, as I 
>>>> can call the XMLDocumentHandlerSAX from within a jsp and it works, 
>>>> returning a Lucene Document, that I then print to screen using 
>>>> Document.toString(), it all looks ok, although I havent tried 
>>>> indexing it myself (i was counting on the module doing this).
>>>>
>>>> Could it be the XMLDoument class? Here is what it looks like:
>>>>
>>>> public class XMLDocument implements I_FileDocumentFactory
>>>> {
>>>>     public static String FACTORY_NAME = "XML DocumentFactory";
>>>>     private XMLDocumentHandlerSAX saxhdlr = null;
>>>>     private XMLReader xr = null;
>>>>     private InputStream in = null;
>>>>     private InputSource is = null;
>>>>
>>>>     public XMLDocument() { }
>>>>
>>>>     public String getFactoryName() {
>>>>        return FACTORY_NAME;
>>>>     }
>>>>
>>>>     public Document Document(CmsObject cmso, CmsFile f) throws 
>>>> CmsException
>>>>     {
>>>>         try
>>>>         {
>>>>             XMLDocumentHandlerSAX saxhdlr = new 
>>>> XMLDocumentHandlerSAX(cmso, f);
>>>>
>>>>             in = new ByteArrayInputStream(f.getContents());
>>>>             is = new InputSource(in);
>>>>
>>>>             //in = (InputStream)(new 
>>>> ByteArrayInputStream(f.getContents()));
>>>>             //is = new InputSource(in);
>>>>
>>>>             //is = new InputSource (new StringReader (xmlText));
>>>>
>>>>             xr = XMLReaderFactory.createXMLReader( 
>>>> "org.apache.xerces.parsers.SAXParser" );
>>>>           xr.setContentHandler(saxhdlr);
>>>>           xr.setFeature( 
>>>> "http://xml.org/sax/features/validation",false );
>>>>           xr.setFeature( 
>>>> "http://apache.org/xml/features/continue-after-fatal-error",true );
>>>>             xr.parse(is);
>>>>
>>>>         }
>>>>         catch (Exception e)
>>>>         {
>>>>             throw new CmsException(e.getMessage(), e.getCause());
>>>>         }
>>>>         return saxhdlr.getDocument();
>>>>     }
>>>>
>>>>     public Document Document(CmsObject cmso, CmsFile f, HashMap h) 
>>>> throws CmsException
>>>>     {
>>>>         return Document(cmso,f);
>>>>     }
>>>> }
>>>>
>>>>
>>>> It seems the handler class returns what it should, so it is either 
>>>> the XMLDocument class or the indexer which is complaining. Should I 
>>>> send you the two src files ? theyre about as complete as they are 
>>>> gonna get...
>>>>
>>>> Cheers
>>>>
>>>> Alex
>>>>
>>>>
>>>>> From: M Butcher <mbutcher at grcomputing.net>
>>>>> Reply-To: opencms-dev at opencms.org
>>>>> To: opencms-dev at opencms.org
>>>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>>>>> getting error - EOF
>>>>> Date: Mon, 15 Mar 2004 13:53:17 -0700
>>>>>
>>>>> What is throwing the exception, the XML parser or the indexer? Last 
>>>>> week, I was working on my XSLT code and created some code that 
>>>>> looks almost exactly like yours (except I created a Transformer 
>>>>> instead of an XMLReader) and it worked fine -- perhaps the problem 
>>>>> is in whatever gets handed to the IndexManager.
>>>>>
>>>>> Matt
>>>>>
>>>>> Alex ! wrote:
>>>>>
>>>>>> Ok so I think I'm alsmost done but now when the cron runs (yes it 
>>>>>> is mysteriously begun working!), I get the following error,  for a 
>>>>>> premature end of file? any ideas? the way i am retrievin the file 
>>>>>> contents is as follows:
>>>>>>
>>>>>>             in = new ByteArrayInputStream(f.getContents());
>>>>>>             is = new InputSource(in);
>>>>>>             xr.parse(is);
>>>>>>
>>>>>> where:     private XMLReader xr
>>>>>>     private InputStream in
>>>>>>     private InputSource is
>>>>>>
>>>>>>
>>>>>> Error output form OCMS log:
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for 
>>>>>> com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>>>>
>>>>>>
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_info>
>>>>>> =====IndexManager============================================================= 
>>>>>>
>>>>>>
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_info> Analyzer: 
>>>>>> org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>> [13.03.2004 06:58:10] <opencms_info> Extension map exists to 
>>>>>> handle XML
>>>>>> [13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>>>> processing file test_xml.xml: com.opencms.core.CmsException: 0 
>>>>>> Unknown exception. Detailed error: Premature end of file..
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: indexing 
>>>>>> /test/xml/
>>>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>>>> processing file article5.xml: com.opencms.core.CmsException: 0 
>>>>>> Unknown exception. Detailed error: Premature end of file..
>>>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>>>> processing file article7.xml: com.opencms.core.CmsException: 0 
>>>>>> Unknown exception. Detailed error: Premature end of file..
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are 
>>>>>> being processed
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager:  Index has been 
>>>>>> optimized.
>>>>>> [13.03.2004 06:58:10] <opencms_info> Done
>>>>>> =====IndexManager============================================================= 
>>>>>>
>>>>>>
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of 
>>>>>> job com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>>>> Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13 
>>>>>> 06:58:10 GMT 2004
>>>>>>
>>>>>>
>>>>>> Thanks alex
>>>>>>
>>>>>>
>>>>>>> From: M Butcher <mbutcher at grcomputing.net>
>>>>>>> Reply-To: opencms-dev at opencms.org
>>>>>>> To: opencms-dev at opencms.org
>>>>>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene 
>>>>>>> but getting error
>>>>>>> Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>>>>>
>>>>>>>
>>>>>>> Alex,
>>>>>>>
>>>>>>> I can't tell, from the stack trace, what is going on. Judging 
>>>>>>> from where the exception is located, it looks like a problem with 
>>>>>>> content defs... but that doesn't make sense....
>>>>>>>
>>>>>>> When you finish it, please do send it to Stephan and I. It sounds 
>>>>>>> like a very useful addition to the existing indexing tools.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> Alex ! wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> this ones probably for Matt/Stefan.
>>>>>>>>
>>>>>>>> I have written an XML Indexer for the lucene module (almost 
>>>>>>>> finished), which will basically take an xml file, parse it, and 
>>>>>>>> then add its elements and their contents to the lucene index, 
>>>>>>>> instead of stripping the element tags and then including the 
>>>>>>>> remaining content a a siingle searchable body (as is currently 
>>>>>>>> available).
>>>>>>>>
>>>>>>>> Everything is now compiled (into a seprate jar, just 2 class 
>>>>>>>> files), the cron job runs but gives the following error:
>>>>>>>>
>>>>>>>> [07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for 
>>>>>>>> com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators 
>>>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml} 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [07.03.2004 14:20:10] <opencms_info>
>>>>>>>> =====IndexManager============================================================= 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> Analyzer: 
>>>>>>>> org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> Extension map exists to 
>>>>>>>> handle XML
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>>>>> [07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>>>>> [07.03.2004 14:20:11] <opencms_info> Return Document
>>>>>>>> [07.03.2004 14:20:11] <opencms_cronscheduler> Error running job 
>>>>>>>> for com.opencms.core.CmsCronEntry{20 14 * * * admin 
>>>>>>>> Administrators 
>>>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>>>>>> Error: java.lang.NullPointerException
>>>>>>>>     at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>>>>>>     at 
>>>>>>>> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92) 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     at 
>>>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257) 
>>>>>>>>
>>>>>>>>     at 
>>>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244) 
>>>>>>>>
>>>>>>>>     at 
>>>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown 
>>>>>>>> Source)
>>>>>>>>     at 
>>>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown 
>>>>>>>> Source)
>>>>>>>>     at 
>>>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown 
>>>>>>>> Source)
>>>>>>>>     at 
>>>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown 
>>>>>>>> Source)
>>>>>>>>     at 
>>>>>>>> com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>>>>>
>>>>>>>>
>>>>>>>> my registry entry for the xml files look like this (contained in 
>>>>>>>> external registry file):
>>>>>>>>
>>>>>>>>       <!-- For XML Files :) -->
>>>>>>>>       <docFactory enabled="true" type="plain">
>>>>>>>>          <fileType name="XML">
>>>>>>>>            <extension>.xml</extension>
>>>>>>>>            
>>>>>>>> <class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>>>>>>          </fileType>
>>>>>>>>       </docFactory>
>>>>>>>>
>>>>>>>> Your help would be much appreciated.
>>>>>>>>
>>>>>>>> (should I send you the source to correct and include in your 
>>>>>>>> next patch/update?)
>>>>>>>>
>>>>>>>> Many Thanks
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>> _________________________________________________________________
>>>>>>>> Find a cheaper internet access deal - choose one to suit you. 
>>>>>>>> http://www.msn.co.uk/internetaccess
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> This mail is send to you from the opencms-dev mailing list
>>>>>>>> To change your list options, or to unsubscribe from the list, 
>>>>>>>> please visit
>>>>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> This mail is send to you from the opencms-dev mailing list
>>>>>>> To change your list options, or to unsubscribe from the list, 
>>>>>>> please visit
>>>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _________________________________________________________________
>>>>>> Find a cheaper internet access deal - choose one to suit you. 
>>>>>> http://www.msn.co.uk/internetaccess
>>>>>>
>>>>>> _______________________________________________
>>>>>> This mail is send to you from the opencms-dev mailing list
>>>>>> To change your list options, or to unsubscribe from the list, 
>>>>>> please visit
>>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> This mail is send to you from the opencms-dev mailing list
>>>>> To change your list options, or to unsubscribe from the list, 
>>>>> please visit
>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>>
>>>>
>>>> _________________________________________________________________
>>>> Stay in touch with absent friends - get MSN Messenger 
>>>> http://www.msn.co.uk/messenger
>>>>
>>>> _______________________________________________
>>>> This mail is send to you from the opencms-dev mailing list
>>>> To change your list options, or to unsubscribe from the list, please 
>>>> visit
>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>
>>> _________________________________________________________________
>>> Tired of 56k? Get a FREE BT Broadband connection 
>>> http://www.msn.co.uk/specials/btbroadband
>>>
>>> _______________________________________________
>>> This mail is send to you from the opencms-dev mailing list
>>> To change your list options, or to unsubscribe from the list, please 
>>> visit
>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
>> _______________________________________________
>> This mail is send to you from the opencms-dev mailing list
>> To change your list options, or to unsubscribe from the list, please 
>> visit
>> http://mail.opencms.org/mailman/listinfo/opencms-dev
> 
> 
> _________________________________________________________________
> Stay in touch with absent friends - get MSN Messenger 
> http://www.msn.co.uk/messenger
> 
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev




More information about the opencms-dev mailing list