[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF

Alex ! kingofkingston at hotmail.com
Tue Mar 16 00:24:02 CET 2004


OK, Matt. So I had some input from my colleague, changed the XMLDocument 
class (seems it wasnt done in the best way!) and now tried calling the 
XMLDocument(cmso,f) class directly from a jsp - and it works, returns a 
lucene document, which i test by outputing to screen using the 
Document.toString() method as before.

But... the cron still returns the same premature end of file exception.


Alex


>From: "Alex !" <kingofkingston at hotmail.com>
>Reply-To: opencms-dev at opencms.org
>To: opencms-dev at opencms.org
>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting 
>error - EOF
>Date: Mon, 15 Mar 2004 22:28:34 +0000
>
>It seems to be the indexer. I have a class XMLDocument (implements 
>I_FileDocumentFactory), which is based on bodyless document. Here I set up 
>the XMLReader and instantiate a XMLDocumentHandlerSAX class (extends 
>DefaultHandler).
>
>After some thorough debug and testing, its seems the indexer, as I can call 
>the XMLDocumentHandlerSAX from within a jsp and it works, returning a 
>Lucene Document, that I then print to screen using Document.toString(), it 
>all looks ok, although I havent tried indexing it myself (i was counting on 
>the module doing this).
>
>Could it be the XMLDoument class? Here is what it looks like:
>
>public class XMLDocument implements I_FileDocumentFactory
>{
>	public static String FACTORY_NAME = "XML DocumentFactory";
>	private XMLDocumentHandlerSAX saxhdlr = null;
>	private XMLReader xr = null;
>	private InputStream in = null;
>	private InputSource is = null;
>
>	public XMLDocument() { }
>
>	public String getFactoryName() {
>	   return FACTORY_NAME;
>	}
>
>	public Document Document(CmsObject cmso, CmsFile f) throws CmsException
>	{
>		try
>		{
>			XMLDocumentHandlerSAX saxhdlr = new XMLDocumentHandlerSAX(cmso, f);
>
>			in = new ByteArrayInputStream(f.getContents());
>			is = new InputSource(in);
>
>			//in = (InputStream)(new ByteArrayInputStream(f.getContents()));
>			//is = new InputSource(in);
>
>	    	//is = new InputSource (new StringReader (xmlText));
>
>			xr = XMLReaderFactory.createXMLReader( 
>"org.apache.xerces.parsers.SAXParser" );
>	      xr.setContentHandler(saxhdlr);
>	      xr.setFeature( "http://xml.org/sax/features/validation",false );
>	      xr.setFeature( 
>"http://apache.org/xml/features/continue-after-fatal-error",true );
>			xr.parse(is);
>
>		}
>		catch (Exception e)
>		{
>			throw new CmsException(e.getMessage(), e.getCause());
>		}
>		return saxhdlr.getDocument();
>	}
>
>	public Document Document(CmsObject cmso, CmsFile f, HashMap h) throws 
>CmsException
>	{
>		return Document(cmso,f);
>	}
>}
>
>
>It seems the handler class returns what it should, so it is either the 
>XMLDocument class or the indexer which is complaining. Should I send you 
>the two src files ? theyre about as complete as they are gonna get...
>
>Cheers
>
>Alex
>
>
>>From: M Butcher <mbutcher at grcomputing.net>
>>Reply-To: opencms-dev at opencms.org
>>To: opencms-dev at opencms.org
>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting 
>>error - EOF
>>Date: Mon, 15 Mar 2004 13:53:17 -0700
>>
>>What is throwing the exception, the XML parser or the indexer? Last week, 
>>I was working on my XSLT code and created some code that looks almost 
>>exactly like yours (except I created a Transformer instead of an 
>>XMLReader) and it worked fine -- perhaps the problem is in whatever gets 
>>handed to the IndexManager.
>>
>>Matt
>>
>>Alex ! wrote:
>>>Ok so I think I'm alsmost done but now when the cron runs (yes it is 
>>>mysteriously begun working!), I get the following error,  for a premature 
>>>end of file? any ideas? the way i am retrievin the file contents is as 
>>>follows:
>>>
>>>             in = new ByteArrayInputStream(f.getContents());
>>>             is = new InputSource(in);
>>>             xr.parse(is);
>>>
>>>where:     private XMLReader xr
>>>     private InputStream in
>>>     private InputSource is
>>>
>>>
>>>Error output form OCMS log:
>>>
>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for 
>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>
>>>[13.03.2004 06:58:10] <opencms_info>
>>>=====IndexManager=============================================================
>>>
>>>[13.03.2004 06:58:10] <opencms_info> Analyzer: 
>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>[13.03.2004 06:58:10] <opencms_info> Extension map exists to handle XML
>>>[13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>processing file test_xml.xml: com.opencms.core.CmsException: 0 Unknown 
>>>exception. Detailed error: Premature end of file..
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/xml/
>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>processing file article5.xml: com.opencms.core.CmsException: 0 Unknown 
>>>exception. Detailed error: Premature end of file..
>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>>processing file article7.xml: com.opencms.core.CmsException: 0 Unknown 
>>>exception. Detailed error: Premature end of file..
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are being 
>>>processed
>>>[13.03.2004 06:58:10] <opencms_info> IndexManager:  Index has been 
>>>optimized.
>>>[13.03.2004 06:58:10] <opencms_info> Done
>>>=====IndexManager=============================================================
>>>
>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of job 
>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13 06:58:10 
>>>GMT 2004
>>>
>>>
>>>Thanks alex
>>>
>>>
>>>>From: M Butcher <mbutcher at grcomputing.net>
>>>>Reply-To: opencms-dev at opencms.org
>>>>To: opencms-dev at opencms.org
>>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>>>>getting error
>>>>Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>>
>>>>
>>>>Alex,
>>>>
>>>>I can't tell, from the stack trace, what is going on. Judging from where 
>>>>the exception is located, it looks like a problem with content defs... 
>>>>but that doesn't make sense....
>>>>
>>>>When you finish it, please do send it to Stephan and I. It sounds like a 
>>>>very useful addition to the existing indexing tools.
>>>>
>>>>Matt
>>>>
>>>>Alex ! wrote:
>>>>
>>>>>Hi,
>>>>>
>>>>>this ones probably for Matt/Stefan.
>>>>>
>>>>>I have written an XML Indexer for the lucene module (almost finished), 
>>>>>which will basically take an xml file, parse it, and then add its 
>>>>>elements and their contents to the lucene index, instead of stripping 
>>>>>the element tags and then including the remaining content a a siingle 
>>>>>searchable body (as is currently available).
>>>>>
>>>>>Everything is now compiled (into a seprate jar, just 2 class files), 
>>>>>the cron job runs but gives the following error:
>>>>>
>>>>>[07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for 
>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators 
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml}
>>>>>
>>>>>
>>>>>[07.03.2004 14:20:10] <opencms_info>
>>>>>=====IndexManager=============================================================
>>>>>
>>>>>
>>>>>[07.03.2004 14:20:10] <opencms_info> Analyzer: 
>>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>[07.03.2004 14:20:10] <opencms_info> Extension map exists to handle XML
>>>>>[07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>>[07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>>[07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>>[07.03.2004 14:20:11] <opencms_info> Return Document
>>>>>[07.03.2004 14:20:11] <opencms_cronscheduler> Error running job for 
>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators 
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>>>Error: java.lang.NullPointerException
>>>>>     at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>>>     at 
>>>>>org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92)
>>>>>
>>>>>     at 
>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>>>>>     at 
>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>>>>>     at 
>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown 
>>>>>Source)
>>>>>     at 
>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown 
>>>>>Source)
>>>>>     at 
>>>>>net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown 
>>>>>Source)
>>>>>     at 
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown 
>>>>>Source)
>>>>>     at 
>>>>>com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>>
>>>>>
>>>>>my registry entry for the xml files look like this (contained in 
>>>>>external registry file):
>>>>>
>>>>>       <!-- For XML Files :) -->
>>>>>       <docFactory enabled="true" type="plain">
>>>>>          <fileType name="XML">
>>>>>            <extension>.xml</extension>
>>>>>            
>>>>><class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>>>          </fileType>
>>>>>       </docFactory>
>>>>>
>>>>>Your help would be much appreciated.
>>>>>
>>>>>(should I send you the source to correct and include in your next 
>>>>>patch/update?)
>>>>>
>>>>>Many Thanks
>>>>>
>>>>>Alex
>>>>>
>>>>>_________________________________________________________________
>>>>>Find a cheaper internet access deal - choose one to suit you. 
>>>>>http://www.msn.co.uk/internetaccess
>>>>>
>>>>>_______________________________________________
>>>>>This mail is send to you from the opencms-dev mailing list
>>>>>To change your list options, or to unsubscribe from the list, please 
>>>>>visit
>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>>
>>>>_______________________________________________
>>>>This mail is send to you from the opencms-dev mailing list
>>>>To change your list options, or to unsubscribe from the list, please 
>>>>visit
>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>_________________________________________________________________
>>>Find a cheaper internet access deal - choose one to suit you. 
>>>http://www.msn.co.uk/internetaccess
>>>
>>>_______________________________________________
>>>This mail is send to you from the opencms-dev mailing list
>>>To change your list options, or to unsubscribe from the list, please 
>>>visit
>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>_______________________________________________
>>This mail is send to you from the opencms-dev mailing list
>>To change your list options, or to unsubscribe from the list, please visit
>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>_________________________________________________________________
>Stay in touch with absent friends - get MSN Messenger 
>http://www.msn.co.uk/messenger
>
>_______________________________________________
>This mail is send to you from the opencms-dev mailing list
>To change your list options, or to unsubscribe from the list, please visit
>http://mail.opencms.org/mailman/listinfo/opencms-dev

_________________________________________________________________
Tired of 56k? Get a FREE BT Broadband connection 
http://www.msn.co.uk/specials/btbroadband




More information about the opencms-dev mailing list