[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF

Hartmann, Waehrisch & Feykes GmbH hartmann at waehrisch-feykes.de
Wed Mar 17 08:24:02 CET 2004


Alex,

it seems that f.getContents() returns nothing.
You can try to reread the file to get its content:
cmso.readFile(f.getAbsolutePath()).getContents()

Bye,

Stephan



----- Original Message ----- 
From: "Alex !" <kingofkingston at hotmail.com>
To: <opencms-dev at opencms.org>
Sent: Tuesday, March 16, 2004 9:39 PM
Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting
error - EOF


> I only have 3 xml files in the test dir im trying to index. One of those
> files I am using in my jsp, and it works fine. See below code snippet.
> XMLDocument constructor throws no exceptions, nor does
> XMLDocument.Document(cmso,f).
>
> Its gotta be IndexManager. Maybe the document I am producing is not what
it
> is expecting? But then why the EOF?
>
> Inside my jsp:
>
>
> <%
> CmsJspActionElement cmsJspAE = new CmsJspActionElement(pageContext,
> request, response);
> CmsObject cmso = cmsJspAE.getCmsObject();
>         CmsFile f = cmso.readFile("/test/test_xml.xml");
>
>         String thepath = f.getAbsolutePath();
>         out.println("<br>"+thepath+"<br><br>");
>
>         XMLDocument xmldoc = null;
>         Document thisdoc = null;
>
> try
> {
>                         xmldoc = new XMLDocument();
>
out.println("<br>"+xmldoc.getFactoryName()+"<br>");
>         thisdoc = xmldoc.Document(cmso, f);
>
> }
> catch (Exception e)
> {
> throw new CmsException(e.getMessage(), e.getCause());
> }
>                 String outdoc = thisdoc.toString();
>                 out.println("Lucene Document: <br><br>" + outdoc);
> %>
>
>
> >From: M Butcher <mbutcher at grcomputing.net>
> >Reply-To: opencms-dev at opencms.org
> >To: opencms-dev at opencms.org
> >Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
getting
> >error - EOF
> >Date: Tue, 16 Mar 2004 11:09:57 -0700
> >
> >Alex ! wrote:
> >>OK, Matt. So I had some input from my colleague, changed the XMLDocument
> >>class (seems it wasnt done in the best way!) and now tried calling the
> >>XMLDocument(cmso,f) class directly from a jsp - and it works, returns a
> >>lucene document, which i test by outputing to screen using the
> >>Document.toString() method as before.
> >>
> >>But... the cron still returns the same premature end of file exception.
> >
> >On the same document? Do you know what is throwing the exception? Is it
the
> >XMLDocument constructor or the IndexManager?
> >
> >>
> >>
> >>Alex
> >>
> >>
> >>>From: "Alex !" <kingofkingston at hotmail.com>
> >>>Reply-To: opencms-dev at opencms.org
> >>>To: opencms-dev at opencms.org
> >>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
> >>>getting error - EOF
> >>>Date: Mon, 15 Mar 2004 22:28:34 +0000
> >>>
> >>>It seems to be the indexer. I have a class XMLDocument (implements
> >>>I_FileDocumentFactory), which is based on bodyless document. Here I set
> >>>up the XMLReader and instantiate a XMLDocumentHandlerSAX class (extends
> >>>DefaultHandler).
> >>>
> >>>After some thorough debug and testing, its seems the indexer, as I can
> >>>call the XMLDocumentHandlerSAX from within a jsp and it works,
returning
> >>>a Lucene Document, that I then print to screen using
Document.toString(),
> >>>it all looks ok, although I havent tried indexing it myself (i was
> >>>counting on the module doing this).
> >>>
> >>>Could it be the XMLDoument class? Here is what it looks like:
> >>>
> >>>public class XMLDocument implements I_FileDocumentFactory
> >>>{
> >>>     public static String FACTORY_NAME = "XML DocumentFactory";
> >>>     private XMLDocumentHandlerSAX saxhdlr = null;
> >>>     private XMLReader xr = null;
> >>>     private InputStream in = null;
> >>>     private InputSource is = null;
> >>>
> >>>     public XMLDocument() { }
> >>>
> >>>     public String getFactoryName() {
> >>>        return FACTORY_NAME;
> >>>     }
> >>>
> >>>     public Document Document(CmsObject cmso, CmsFile f) throws
> >>>CmsException
> >>>     {
> >>>         try
> >>>         {
> >>>             XMLDocumentHandlerSAX saxhdlr = new
> >>>XMLDocumentHandlerSAX(cmso, f);
> >>>
> >>>             in = new ByteArrayInputStream(f.getContents());
> >>>             is = new InputSource(in);
> >>>
> >>>             //in = (InputStream)(new
> >>>ByteArrayInputStream(f.getContents()));
> >>>             //is = new InputSource(in);
> >>>
> >>>             //is = new InputSource (new StringReader (xmlText));
> >>>
> >>>             xr = XMLReaderFactory.createXMLReader(
> >>>"org.apache.xerces.parsers.SAXParser" );
> >>>           xr.setContentHandler(saxhdlr);
> >>>           xr.setFeature(
"http://xml.org/sax/features/validation",false
> >>>);
> >>>           xr.setFeature(
> >>>"http://apache.org/xml/features/continue-after-fatal-error",true );
> >>>             xr.parse(is);
> >>>
> >>>         }
> >>>         catch (Exception e)
> >>>         {
> >>>             throw new CmsException(e.getMessage(), e.getCause());
> >>>         }
> >>>         return saxhdlr.getDocument();
> >>>     }
> >>>
> >>>     public Document Document(CmsObject cmso, CmsFile f, HashMap h)
> >>>throws CmsException
> >>>     {
> >>>         return Document(cmso,f);
> >>>     }
> >>>}
> >>>
> >>>
> >>>It seems the handler class returns what it should, so it is either the
> >>>XMLDocument class or the indexer which is complaining. Should I send
you
> >>>the two src files ? theyre about as complete as they are gonna get...
> >>>
> >>>Cheers
> >>>
> >>>Alex
> >>>
> >>>
> >>>>From: M Butcher <mbutcher at grcomputing.net>
> >>>>Reply-To: opencms-dev at opencms.org
> >>>>To: opencms-dev at opencms.org
> >>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
> >>>>getting error - EOF
> >>>>Date: Mon, 15 Mar 2004 13:53:17 -0700
> >>>>
> >>>>What is throwing the exception, the XML parser or the indexer? Last
> >>>>week, I was working on my XSLT code and created some code that looks
> >>>>almost exactly like yours (except I created a Transformer instead of
an
> >>>>XMLReader) and it worked fine -- perhaps the problem is in whatever
gets
> >>>>handed to the IndexManager.
> >>>>
> >>>>Matt
> >>>>
> >>>>Alex ! wrote:
> >>>>
> >>>>>Ok so I think I'm alsmost done but now when the cron runs (yes it is
> >>>>>mysteriously begun working!), I get the following error,  for a
> >>>>>premature end of file? any ideas? the way i am retrievin the file
> >>>>>contents is as follows:
> >>>>>
> >>>>>             in = new ByteArrayInputStream(f.getContents());
> >>>>>             is = new InputSource(in);
> >>>>>             xr.parse(is);
> >>>>>
> >>>>>where:     private XMLReader xr
> >>>>>     private InputStream in
> >>>>>     private InputSource is
> >>>>>
> >>>>>
> >>>>>Error output form OCMS log:
> >>>>>
> >>>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for
> >>>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
> >>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB
-INF/config/epfolio_uk_lucene_registry.xml}
> >>>>>
> >>>>>
> >>>>>[13.03.2004 06:58:10] <opencms_info>
>
>>>>>=====IndexManager======================================================
=======
> >>>>>
> >>>>>
> >>>>>[13.03.2004 06:58:10] <opencms_info> Analyzer:
> >>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
> >>>>>[13.03.2004 06:58:10] <opencms_info> Extension map exists to handle
XML
> >>>>>[13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
> >>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
> >>>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
> >>>>>processing file test_xml.xml: com.opencms.core.CmsException: 0
Unknown
> >>>>>exception. Detailed error: Premature end of file..
> >>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing
/test/xml/
> >>>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
> >>>>>processing file article5.xml: com.opencms.core.CmsException: 0
Unknown
> >>>>>exception. Detailed error: Premature end of file..
> >>>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
> >>>>>processing file article7.xml: com.opencms.core.CmsException: 0
Unknown
> >>>>>exception. Detailed error: Premature end of file..
> >>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are
> >>>>>being processed
> >>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager:  Index has been
> >>>>>optimized.
> >>>>>[13.03.2004 06:58:10] <opencms_info> Done
>
>>>>>=====IndexManager======================================================
=======
> >>>>>
> >>>>>
> >>>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of
job
> >>>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
> >>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB
-INF/config/epfolio_uk_lucene_registry.xml}
> >>>>>Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13
> >>>>>06:58:10 GMT 2004
> >>>>>
> >>>>>
> >>>>>Thanks alex
> >>>>>
> >>>>>
> >>>>>>From: M Butcher <mbutcher at grcomputing.net>
> >>>>>>Reply-To: opencms-dev at opencms.org
> >>>>>>To: opencms-dev at opencms.org
> >>>>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
> >>>>>>getting error
> >>>>>>Date: Mon, 08 Mar 2004 10:03:43 -0700
> >>>>>>
> >>>>>>
> >>>>>>Alex,
> >>>>>>
> >>>>>>I can't tell, from the stack trace, what is going on. Judging from
> >>>>>>where the exception is located, it looks like a problem with content
> >>>>>>defs... but that doesn't make sense....
> >>>>>>
> >>>>>>When you finish it, please do send it to Stephan and I. It sounds
like
> >>>>>>a very useful addition to the existing indexing tools.
> >>>>>>
> >>>>>>Matt
> >>>>>>
> >>>>>>Alex ! wrote:
> >>>>>>
> >>>>>>>Hi,
> >>>>>>>
> >>>>>>>this ones probably for Matt/Stefan.
> >>>>>>>
> >>>>>>>I have written an XML Indexer for the lucene module (almost
> >>>>>>>finished), which will basically take an xml file, parse it, and
then
> >>>>>>>add its elements and their contents to the lucene index, instead of
> >>>>>>>stripping the element tags and then including the remaining content
a
> >>>>>>>a siingle searchable body (as is currently available).
> >>>>>>>
> >>>>>>>Everything is now compiled (into a seprate jar, just 2 class
files),
> >>>>>>>the cron job runs but gives the following error:
> >>>>>>>
> >>>>>>>[07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for
> >>>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
> >>>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>
>>>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/W
EB-INF/config/uk_lucene_registry.xml}
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>[07.03.2004 14:20:10] <opencms_info>
>
>>>>>>>=====IndexManager====================================================
=========
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>[07.03.2004 14:20:10] <opencms_info> Analyzer:
> >>>>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
> >>>>>>>[07.03.2004 14:20:10] <opencms_info> Extension map exists to handle
> >>>>>>>XML
> >>>>>>>[07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
> >>>>>>>[07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
> >>>>>>>[07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
> >>>>>>>[07.03.2004 14:20:11] <opencms_info> Return Document
> >>>>>>>[07.03.2004 14:20:11] <opencms_cronscheduler> Error running job for
> >>>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
> >>>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>
>>>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/W
EB-INF/config/epfolio_uk_lucene_registry.xml}
> >>>>>>>Error: java.lang.NullPointerException
> >>>>>>>     at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
> >>>>>>>     at
>
>>>>>>>org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.jav
a:92)
> >>>>>>>
> >>>>>>>
> >>>>>>>     at
>
>>>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
> >>>>>>>     at
>
>>>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
> >>>>>>>     at
>
>>>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknow
n
> >>>>>>>Source)
> >>>>>>>     at
>
>>>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown
> >>>>>>>Source)
> >>>>>>>     at
> >>>>>>>net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown
> >>>>>>>Source)
> >>>>>>>     at
>
>>>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown
> >>>>>>>Source)
> >>>>>>>     at
> >>>>>>>com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
> >>>>>>>
> >>>>>>>
> >>>>>>>my registry entry for the xml files look like this (contained in
> >>>>>>>external registry file):
> >>>>>>>
> >>>>>>>       <!-- For XML Files :) -->
> >>>>>>>       <docFactory enabled="true" type="plain">
> >>>>>>>          <fileType name="XML">
> >>>>>>>            <extension>.xml</extension>
> >>>>>>>
> >>>>>>><class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
> >>>>>>>          </fileType>
> >>>>>>>       </docFactory>
> >>>>>>>
> >>>>>>>Your help would be much appreciated.
> >>>>>>>
> >>>>>>>(should I send you the source to correct and include in your next
> >>>>>>>patch/update?)
> >>>>>>>
> >>>>>>>Many Thanks
> >>>>>>>
> >>>>>>>Alex
> >>>>>>>
> >>>>>>>_________________________________________________________________
> >>>>>>>Find a cheaper internet access deal - choose one to suit you.
> >>>>>>>http://www.msn.co.uk/internetaccess
> >>>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>This mail is send to you from the opencms-dev mailing list
> >>>>>>>To change your list options, or to unsubscribe from the list,
please
> >>>>>>>visit
> >>>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>_______________________________________________
> >>>>>>This mail is send to you from the opencms-dev mailing list
> >>>>>>To change your list options, or to unsubscribe from the list, please
> >>>>>>visit
> >>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
> >>>>>
> >>>>>
> >>>>>
> >>>>>_________________________________________________________________
> >>>>>Find a cheaper internet access deal - choose one to suit you.
> >>>>>http://www.msn.co.uk/internetaccess
> >>>>>
> >>>>>_______________________________________________
> >>>>>This mail is send to you from the opencms-dev mailing list
> >>>>>To change your list options, or to unsubscribe from the list, please
> >>>>>visit
> >>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
> >>>>
> >>>>
> >>>>_______________________________________________
> >>>>This mail is send to you from the opencms-dev mailing list
> >>>>To change your list options, or to unsubscribe from the list, please
> >>>>visit
> >>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
> >>>
> >>>
> >>>_________________________________________________________________
> >>>Stay in touch with absent friends - get MSN Messenger
> >>>http://www.msn.co.uk/messenger
> >>>
> >>>_______________________________________________
> >>>This mail is send to you from the opencms-dev mailing list
> >>>To change your list options, or to unsubscribe from the list, please
> >>>visit
> >>>http://mail.opencms.org/mailman/listinfo/opencms-dev
> >>
> >>
> >>_________________________________________________________________
> >>Tired of 56k? Get a FREE BT Broadband connection
> >>http://www.msn.co.uk/specials/btbroadband
> >>
> >>_______________________________________________
> >>This mail is send to you from the opencms-dev mailing list
> >>To change your list options, or to unsubscribe from the list, please
visit
> >>http://mail.opencms.org/mailman/listinfo/opencms-dev
> >
> >_______________________________________________
> >This mail is send to you from the opencms-dev mailing list
> >To change your list options, or to unsubscribe from the list, please
visit
> >http://mail.opencms.org/mailman/listinfo/opencms-dev
>
> _________________________________________________________________
> Stay in touch with absent friends - get MSN Messenger
> http://www.msn.co.uk/messenger
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev




More information about the opencms-dev mailing list