[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF
M Butcher
mbutcher at grcomputing.net
Wed Mar 17 00:08:01 CET 2004
Hmm... it is possible that Lucene throws an EOF error for Document
objects... but the more I think about it, the less likely I think that
is. More likely, I would expect a SAX parser to throw such an exception
if either the XML document wasn't well formed or the root element was
missing a closing >.
If it's not that, you'll probably have to modify the exception code in
IndexManager and see if you can find the exact location of the error.
Matt
Alex ! wrote:
> I only have 3 xml files in the test dir im trying to index. One of those
> files I am using in my jsp, and it works fine. See below code snippet.
> XMLDocument constructor throws no exceptions, nor does
> XMLDocument.Document(cmso,f).
>
> Its gotta be IndexManager. Maybe the document I am producing is not what
> it is expecting? But then why the EOF?
>
> Inside my jsp:
>
>
> <%
> CmsJspActionElement cmsJspAE = new CmsJspActionElement(pageContext,
> request, response);
> CmsObject cmso = cmsJspAE.getCmsObject();
> CmsFile f = cmso.readFile("/test/test_xml.xml");
>
> String thepath = f.getAbsolutePath();
> out.println("<br>"+thepath+"<br><br>");
>
> XMLDocument xmldoc = null;
> Document thisdoc = null;
>
> try
> {
> xmldoc = new XMLDocument();
> out.println("<br>"+xmldoc.getFactoryName()+"<br>");
> thisdoc = xmldoc.Document(cmso, f);
>
> }
> catch (Exception e)
> {
> throw new CmsException(e.getMessage(), e.getCause());
> }
> String outdoc = thisdoc.toString();
> out.println("Lucene Document: <br><br>" + outdoc);
> %>
>
>
>> From: M Butcher <mbutcher at grcomputing.net>
>> Reply-To: opencms-dev at opencms.org
>> To: opencms-dev at opencms.org
>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>> getting error - EOF
>> Date: Tue, 16 Mar 2004 11:09:57 -0700
>>
>> Alex ! wrote:
>>
>>> OK, Matt. So I had some input from my colleague, changed the
>>> XMLDocument class (seems it wasnt done in the best way!) and now
>>> tried calling the XMLDocument(cmso,f) class directly from a jsp - and
>>> it works, returns a lucene document, which i test by outputing to
>>> screen using the Document.toString() method as before.
>>>
>>> But... the cron still returns the same premature end of file exception.
>>
>>
>> On the same document? Do you know what is throwing the exception? Is
>> it the XMLDocument constructor or the IndexManager?
>>
>>>
>>>
>>> Alex
>>>
>>>
>>>> From: "Alex !" <kingofkingston at hotmail.com>
>>>> Reply-To: opencms-dev at opencms.org
>>>> To: opencms-dev at opencms.org
>>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>>>> getting error - EOF
>>>> Date: Mon, 15 Mar 2004 22:28:34 +0000
>>>>
>>>> It seems to be the indexer. I have a class XMLDocument (implements
>>>> I_FileDocumentFactory), which is based on bodyless document. Here I
>>>> set up the XMLReader and instantiate a XMLDocumentHandlerSAX class
>>>> (extends DefaultHandler).
>>>>
>>>> After some thorough debug and testing, its seems the indexer, as I
>>>> can call the XMLDocumentHandlerSAX from within a jsp and it works,
>>>> returning a Lucene Document, that I then print to screen using
>>>> Document.toString(), it all looks ok, although I havent tried
>>>> indexing it myself (i was counting on the module doing this).
>>>>
>>>> Could it be the XMLDoument class? Here is what it looks like:
>>>>
>>>> public class XMLDocument implements I_FileDocumentFactory
>>>> {
>>>> public static String FACTORY_NAME = "XML DocumentFactory";
>>>> private XMLDocumentHandlerSAX saxhdlr = null;
>>>> private XMLReader xr = null;
>>>> private InputStream in = null;
>>>> private InputSource is = null;
>>>>
>>>> public XMLDocument() { }
>>>>
>>>> public String getFactoryName() {
>>>> return FACTORY_NAME;
>>>> }
>>>>
>>>> public Document Document(CmsObject cmso, CmsFile f) throws
>>>> CmsException
>>>> {
>>>> try
>>>> {
>>>> XMLDocumentHandlerSAX saxhdlr = new
>>>> XMLDocumentHandlerSAX(cmso, f);
>>>>
>>>> in = new ByteArrayInputStream(f.getContents());
>>>> is = new InputSource(in);
>>>>
>>>> //in = (InputStream)(new
>>>> ByteArrayInputStream(f.getContents()));
>>>> //is = new InputSource(in);
>>>>
>>>> //is = new InputSource (new StringReader (xmlText));
>>>>
>>>> xr = XMLReaderFactory.createXMLReader(
>>>> "org.apache.xerces.parsers.SAXParser" );
>>>> xr.setContentHandler(saxhdlr);
>>>> xr.setFeature(
>>>> "http://xml.org/sax/features/validation",false );
>>>> xr.setFeature(
>>>> "http://apache.org/xml/features/continue-after-fatal-error",true );
>>>> xr.parse(is);
>>>>
>>>> }
>>>> catch (Exception e)
>>>> {
>>>> throw new CmsException(e.getMessage(), e.getCause());
>>>> }
>>>> return saxhdlr.getDocument();
>>>> }
>>>>
>>>> public Document Document(CmsObject cmso, CmsFile f, HashMap h)
>>>> throws CmsException
>>>> {
>>>> return Document(cmso,f);
>>>> }
>>>> }
>>>>
>>>>
>>>> It seems the handler class returns what it should, so it is either
>>>> the XMLDocument class or the indexer which is complaining. Should I
>>>> send you the two src files ? theyre about as complete as they are
>>>> gonna get...
>>>>
>>>> Cheers
>>>>
>>>> Alex
>>>>
>>>>
>>>>> From: M Butcher <mbutcher at grcomputing.net>
>>>>> Reply-To: opencms-dev at opencms.org
>>>>> To: opencms-dev at opencms.org
>>>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>>>>> getting error - EOF
>>>>> Date: Mon, 15 Mar 2004 13:53:17 -0700
>>>>>
>>>>> What is throwing the exception, the XML parser or the indexer? Last
>>>>> week, I was working on my XSLT code and created some code that
>>>>> looks almost exactly like yours (except I created a Transformer
>>>>> instead of an XMLReader) and it worked fine -- perhaps the problem
>>>>> is in whatever gets handed to the IndexManager.
>>>>>
>>>>> Matt
>>>>>
>>>>> Alex ! wrote:
>>>>>
>>>>>> Ok so I think I'm alsmost done but now when the cron runs (yes it
>>>>>> is mysteriously begun working!), I get the following error, for a
>>>>>> premature end of file? any ideas? the way i am retrievin the file
>>>>>> contents is as follows:
>>>>>>
>>>>>> in = new ByteArrayInputStream(f.getContents());
>>>>>> is = new InputSource(in);
>>>>>> xr.parse(is);
>>>>>>
>>>>>> where: private XMLReader xr
>>>>>> private InputStream in
>>>>>> private InputSource is
>>>>>>
>>>>>>
>>>>>> Error output form OCMS log:
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for
>>>>>> com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>>>
>>>>>>
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_info>
>>>>>> =====IndexManager=============================================================
>>>>>>
>>>>>>
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_info> Analyzer:
>>>>>> org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>> [13.03.2004 06:58:10] <opencms_info> Extension map exists to
>>>>>> handle XML
>>>>>> [13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>>>> processing file test_xml.xml: com.opencms.core.CmsException: 0
>>>>>> Unknown exception. Detailed error: Premature end of file..
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: indexing
>>>>>> /test/xml/
>>>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>>>> processing file article5.xml: com.opencms.core.CmsException: 0
>>>>>> Unknown exception. Detailed error: Premature end of file..
>>>>>> [13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>>>> processing file article7.xml: com.opencms.core.CmsException: 0
>>>>>> Unknown exception. Detailed error: Premature end of file..
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are
>>>>>> being processed
>>>>>> [13.03.2004 06:58:10] <opencms_info> IndexManager: Index has been
>>>>>> optimized.
>>>>>> [13.03.2004 06:58:10] <opencms_info> Done
>>>>>> =====IndexManager=============================================================
>>>>>>
>>>>>>
>>>>>>
>>>>>> [13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of
>>>>>> job com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>>> Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13
>>>>>> 06:58:10 GMT 2004
>>>>>>
>>>>>>
>>>>>> Thanks alex
>>>>>>
>>>>>>
>>>>>>> From: M Butcher <mbutcher at grcomputing.net>
>>>>>>> Reply-To: opencms-dev at opencms.org
>>>>>>> To: opencms-dev at opencms.org
>>>>>>> Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene
>>>>>>> but getting error
>>>>>>> Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>>>>>
>>>>>>>
>>>>>>> Alex,
>>>>>>>
>>>>>>> I can't tell, from the stack trace, what is going on. Judging
>>>>>>> from where the exception is located, it looks like a problem with
>>>>>>> content defs... but that doesn't make sense....
>>>>>>>
>>>>>>> When you finish it, please do send it to Stephan and I. It sounds
>>>>>>> like a very useful addition to the existing indexing tools.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> Alex ! wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> this ones probably for Matt/Stefan.
>>>>>>>>
>>>>>>>> I have written an XML Indexer for the lucene module (almost
>>>>>>>> finished), which will basically take an xml file, parse it, and
>>>>>>>> then add its elements and their contents to the lucene index,
>>>>>>>> instead of stripping the element tags and then including the
>>>>>>>> remaining content a a siingle searchable body (as is currently
>>>>>>>> available).
>>>>>>>>
>>>>>>>> Everything is now compiled (into a seprate jar, just 2 class
>>>>>>>> files), the cron job runs but gives the following error:
>>>>>>>>
>>>>>>>> [07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for
>>>>>>>> com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
>>>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml}
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [07.03.2004 14:20:10] <opencms_info>
>>>>>>>> =====IndexManager=============================================================
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> Analyzer:
>>>>>>>> org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> Extension map exists to
>>>>>>>> handle XML
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>>>>> [07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>>>>> [07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>>>>> [07.03.2004 14:20:11] <opencms_info> Return Document
>>>>>>>> [07.03.2004 14:20:11] <opencms_cronscheduler> Error running job
>>>>>>>> for com.opencms.core.CmsCronEntry{20 14 * * * admin
>>>>>>>> Administrators
>>>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>>>> createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>>>>> Error: java.lang.NullPointerException
>>>>>>>> at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>>>>>> at
>>>>>>>> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> at
>>>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>>>>>>>>
>>>>>>>> at
>>>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>>>>>>>>
>>>>>>>> at
>>>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown
>>>>>>>> Source)
>>>>>>>> at
>>>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown
>>>>>>>> Source)
>>>>>>>> at
>>>>>>>> net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown
>>>>>>>> Source)
>>>>>>>> at
>>>>>>>> net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown
>>>>>>>> Source)
>>>>>>>> at
>>>>>>>> com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>>>>>
>>>>>>>>
>>>>>>>> my registry entry for the xml files look like this (contained in
>>>>>>>> external registry file):
>>>>>>>>
>>>>>>>> <!-- For XML Files :) -->
>>>>>>>> <docFactory enabled="true" type="plain">
>>>>>>>> <fileType name="XML">
>>>>>>>> <extension>.xml</extension>
>>>>>>>>
>>>>>>>> <class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>>>>>> </fileType>
>>>>>>>> </docFactory>
>>>>>>>>
>>>>>>>> Your help would be much appreciated.
>>>>>>>>
>>>>>>>> (should I send you the source to correct and include in your
>>>>>>>> next patch/update?)
>>>>>>>>
>>>>>>>> Many Thanks
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>> _________________________________________________________________
>>>>>>>> Find a cheaper internet access deal - choose one to suit you.
>>>>>>>> http://www.msn.co.uk/internetaccess
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> This mail is send to you from the opencms-dev mailing list
>>>>>>>> To change your list options, or to unsubscribe from the list,
>>>>>>>> please visit
>>>>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> This mail is send to you from the opencms-dev mailing list
>>>>>>> To change your list options, or to unsubscribe from the list,
>>>>>>> please visit
>>>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _________________________________________________________________
>>>>>> Find a cheaper internet access deal - choose one to suit you.
>>>>>> http://www.msn.co.uk/internetaccess
>>>>>>
>>>>>> _______________________________________________
>>>>>> This mail is send to you from the opencms-dev mailing list
>>>>>> To change your list options, or to unsubscribe from the list,
>>>>>> please visit
>>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> This mail is send to you from the opencms-dev mailing list
>>>>> To change your list options, or to unsubscribe from the list,
>>>>> please visit
>>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>>
>>>>
>>>> _________________________________________________________________
>>>> Stay in touch with absent friends - get MSN Messenger
>>>> http://www.msn.co.uk/messenger
>>>>
>>>> _______________________________________________
>>>> This mail is send to you from the opencms-dev mailing list
>>>> To change your list options, or to unsubscribe from the list, please
>>>> visit
>>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>
>>> _________________________________________________________________
>>> Tired of 56k? Get a FREE BT Broadband connection
>>> http://www.msn.co.uk/specials/btbroadband
>>>
>>> _______________________________________________
>>> This mail is send to you from the opencms-dev mailing list
>>> To change your list options, or to unsubscribe from the list, please
>>> visit
>>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
>> _______________________________________________
>> This mail is send to you from the opencms-dev mailing list
>> To change your list options, or to unsubscribe from the list, please
>> visit
>> http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>
> _________________________________________________________________
> Stay in touch with absent friends - get MSN Messenger
> http://www.msn.co.uk/messenger
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
More information about the opencms-dev
mailing list