[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF
Alex !
kingofkingston at hotmail.com
Tue Mar 16 00:24:02 CET 2004
OK, Matt. So I had some input from my colleague, changed the XMLDocument
class (seems it wasnt done in the best way!) and now tried calling the
XMLDocument(cmso,f) class directly from a jsp - and it works, returns a
lucene document, which i test by outputing to screen using the
Document.toString() method as before.
But... the cron still returns the same premature end of file exception.
Alex
>From: "Alex !" <kingofkingston at hotmail.com>
>Reply-To: opencms-dev at opencms.org
>To: opencms-dev at opencms.org
>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting
>error - EOF
>Date: Mon, 15 Mar 2004 22:28:34 +0000
>
>It seems to be the indexer. I have a class XMLDocument (implements
>I_FileDocumentFactory), which is based on bodyless document. Here I set up
>the XMLReader and instantiate a XMLDocumentHandlerSAX class (extends
>DefaultHandler).
>
>After some thorough debug and testing, its seems the indexer, as I can call
>the XMLDocumentHandlerSAX from within a jsp and it works, returning a
>Lucene Document, that I then print to screen using Document.toString(), it
>all looks ok, although I havent tried indexing it myself (i was counting on
>the module doing this).
>
>Could it be the XMLDoument class? Here is what it looks like:
>
>public class XMLDocument implements I_FileDocumentFactory
>{
> public static String FACTORY_NAME = "XML DocumentFactory";
> private XMLDocumentHandlerSAX saxhdlr = null;
> private XMLReader xr = null;
> private InputStream in = null;
> private InputSource is = null;
>
> public XMLDocument() { }
>
> public String getFactoryName() {
> return FACTORY_NAME;
> }
>
> public Document Document(CmsObject cmso, CmsFile f) throws CmsException
> {
> try
> {
> XMLDocumentHandlerSAX saxhdlr = new XMLDocumentHandlerSAX(cmso, f);
>
> in = new ByteArrayInputStream(f.getContents());
> is = new InputSource(in);
>
> //in = (InputStream)(new ByteArrayInputStream(f.getContents()));
> //is = new InputSource(in);
>
> //is = new InputSource (new StringReader (xmlText));
>
> xr = XMLReaderFactory.createXMLReader(
>"org.apache.xerces.parsers.SAXParser" );
> xr.setContentHandler(saxhdlr);
> xr.setFeature( "http://xml.org/sax/features/validation",false );
> xr.setFeature(
>"http://apache.org/xml/features/continue-after-fatal-error",true );
> xr.parse(is);
>
> }
> catch (Exception e)
> {
> throw new CmsException(e.getMessage(), e.getCause());
> }
> return saxhdlr.getDocument();
> }
>
> public Document Document(CmsObject cmso, CmsFile f, HashMap h) throws
>CmsException
> {
> return Document(cmso,f);
> }
>}
>
>
>It seems the handler class returns what it should, so it is either the
>XMLDocument class or the indexer which is complaining. Should I send you
>the two src files ? theyre about as complete as they are gonna get...
>
>Cheers
>
>Alex
>
>
>>From: M Butcher <mbutcher at grcomputing.net>
>>Reply-To: opencms-dev at opencms.org
>>To: opencms-dev at opencms.org
>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting
>>error - EOF
>>Date: Mon, 15 Mar 2004 13:53:17 -0700
>>
>>What is throwing the exception, the XML parser or the indexer? Last week,
>>I was working on my XSLT code and created some code that looks almost
>>exactly like yours (except I created a Transformer instead of an
>>XMLReader) and it worked fine -- perhaps the problem is in whatever gets
>>handed to the IndexManager.
>>
>>Matt
>>
>>Alex ! wrote:
>>>Ok so I think I'm alsmost done but now when the cron runs (yes it is
>>>mysteriously begun working!), I get the following error, for a premature
>>>end of file? any ideas? the way i am retrievin the file contents is as
>>>follows:
>>>
>>> in = new ByteArrayInputStream(f.getContents());
>>> is = new InputSource(in);
>>> xr.parse(is);
>>>
>>>where: private XMLReader xr
>>> private InputStream in
>>> private InputSource is
>>>
>>>
>>>Error output form OCMS log:
>>>
>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for
>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>
>>>[13.03.2004 06:58:10] <opencms_info>
>>>=====IndexManager=============================================================
>>>
>>>[13.03.2004 06:58:10] <opencms_info> Analyzer:
>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>[13.03.2004 06:58:10] <opencms_info> Extension map exists to handle XML
>>>[13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>processing file test_xml.xml: com.opencms.core.CmsException: 0 Unknown
>>>exception. Detailed error: Premature end of file..
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/xml/
>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>processing file article5.xml: com.opencms.core.CmsException: 0 Unknown
>>>exception. Detailed error: Premature end of file..
>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>processing file article7.xml: com.opencms.core.CmsException: 0 Unknown
>>>exception. Detailed error: Premature end of file..
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are being
>>>processed
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: Index has been
>>>optimized.
>>>[13.03.2004 06:58:10] <opencms_info> Done
>>>=====IndexManager=============================================================
>>>
>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of job
>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13 06:58:10
>>>GMT 2004
>>>
>>>
>>>Thanks alex
>>>
>>>
>>>>From: M Butcher <mbutcher at grcomputing.net>
>>>>Reply-To: opencms-dev at opencms.org
>>>>To: opencms-dev at opencms.org
>>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>>>>getting error
>>>>Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>>
>>>>
>>>>Alex,
>>>>
>>>>I can't tell, from the stack trace, what is going on. Judging from where
>>>>the exception is located, it looks like a problem with content defs...
>>>>but that doesn't make sense....
>>>>
>>>>When you finish it, please do send it to Stephan and I. It sounds like a
>>>>very useful addition to the existing indexing tools.
>>>>
>>>>Matt
>>>>
>>>>Alex ! wrote:
>>>>
>>>>>Hi,
>>>>>
>>>>>this ones probably for Matt/Stefan.
>>>>>
>>>>>I have written an XML Indexer for the lucene module (almost finished),
>>>>>which will basically take an xml file, parse it, and then add its
>>>>>elements and their contents to the lucene index, instead of stripping
>>>>>the element tags and then including the remaining content a a siingle
>>>>>searchable body (as is currently available).
>>>>>
>>>>>Everything is now compiled (into a seprate jar, just 2 class files),
>>>>>the cron job runs but gives the following error:
>>>>>
>>>>>[07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for
>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml}
>>>>>
>>>>>
>>>>>[07.03.2004 14:20:10] <opencms_info>
>>>>>=====IndexManager=============================================================
>>>>>
>>>>>
>>>>>[07.03.2004 14:20:10] <opencms_info> Analyzer:
>>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>[07.03.2004 14:20:10] <opencms_info> Extension map exists to handle XML
>>>>>[07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>>[07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>>[07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>>[07.03.2004 14:20:11] <opencms_info> Return Document
>>>>>[07.03.2004 14:20:11] <opencms_cronscheduler> Error running job for
>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>>Error: java.lang.NullPointerException
>>>>> at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>>> at
>>>>>org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92)
>>>>>
>>>>> at
>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>>>>> at
>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>>>>> at
>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown
>>>>>Source)
>>>>> at
>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown
>>>>>Source)
>>>>> at
>>>>>net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown
>>>>>Source)
>>>>> at
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown
>>>>>Source)
>>>>> at
>>>>>com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>>
>>>>>
>>>>>my registry entry for the xml files look like this (contained in
>>>>>external registry file):
>>>>>
>>>>> <!-- For XML Files :) -->
>>>>> <docFactory enabled="true" type="plain">
>>>>> <fileType name="XML">
>>>>> <extension>.xml</extension>
>>>>>
>>>>><class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>>> </fileType>
>>>>> </docFactory>
>>>>>
>>>>>Your help would be much appreciated.
>>>>>
>>>>>(should I send you the source to correct and include in your next
>>>>>patch/update?)
>>>>>
>>>>>Many Thanks
>>>>>
>>>>>Alex
>>>>>
>>>>>_________________________________________________________________
>>>>>Find a cheaper internet access deal - choose one to suit you.
>>>>>http://www.msn.co.uk/internetaccess
>>>>>
>>>>>_______________________________________________
>>>>>This mail is send to you from the opencms-dev mailing list
>>>>>To change your list options, or to unsubscribe from the list, please
>>>>>visit
>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>>
>>>>_______________________________________________
>>>>This mail is send to you from the opencms-dev mailing list
>>>>To change your list options, or to unsubscribe from the list, please
>>>>visit
>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>_________________________________________________________________
>>>Find a cheaper internet access deal - choose one to suit you.
>>>http://www.msn.co.uk/internetaccess
>>>
>>>_______________________________________________
>>>This mail is send to you from the opencms-dev mailing list
>>>To change your list options, or to unsubscribe from the list, please
>>>visit
>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>_______________________________________________
>>This mail is send to you from the opencms-dev mailing list
>>To change your list options, or to unsubscribe from the list, please visit
>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>_________________________________________________________________
>Stay in touch with absent friends - get MSN Messenger
>http://www.msn.co.uk/messenger
>
>_______________________________________________
>This mail is send to you from the opencms-dev mailing list
>To change your list options, or to unsubscribe from the list, please visit
>http://mail.opencms.org/mailman/listinfo/opencms-dev
_________________________________________________________________
Tired of 56k? Get a FREE BT Broadband connection
http://www.msn.co.uk/specials/btbroadband
More information about the opencms-dev
mailing list