[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF

Alex ! kingofkingston at hotmail.com
Mon Mar 15 23:29:00 CET 2004


It seems to be the indexer. I have a class XMLDocument (implements 
I_FileDocumentFactory), which is based on bodyless document. Here I set up 
the XMLReader and instantiate a XMLDocumentHandlerSAX class (extends 
DefaultHandler).

After some thorough debug and testing, its seems the indexer, as I can call 
the XMLDocumentHandlerSAX from within a jsp and it works, returning a Lucene 
Document, that I then print to screen using Document.toString(), it all 
looks ok, although I havent tried indexing it myself (i was counting on the 
module doing this).

Could it be the XMLDoument class? Here is what it looks like:

public class XMLDocument implements I_FileDocumentFactory
{
	public static String FACTORY_NAME = "XML DocumentFactory";
	private XMLDocumentHandlerSAX saxhdlr = null;
	private XMLReader xr = null;
	private InputStream in = null;
	private InputSource is = null;

	public XMLDocument() { }

	public String getFactoryName() {
	   return FACTORY_NAME;
	}

	public Document Document(CmsObject cmso, CmsFile f) throws CmsException
	{
		try
		{
			XMLDocumentHandlerSAX saxhdlr = new XMLDocumentHandlerSAX(cmso, f);

			in = new ByteArrayInputStream(f.getContents());
			is = new InputSource(in);

			//in = (InputStream)(new ByteArrayInputStream(f.getContents()));
			//is = new InputSource(in);

	    	//is = new InputSource (new StringReader (xmlText));

			xr = XMLReaderFactory.createXMLReader( 
"org.apache.xerces.parsers.SAXParser" );
	      xr.setContentHandler(saxhdlr);
	      xr.setFeature( "http://xml.org/sax/features/validation",false );
	      xr.setFeature( 
"http://apache.org/xml/features/continue-after-fatal-error",true );
			xr.parse(is);

		}
		catch (Exception e)
		{
			throw new CmsException(e.getMessage(), e.getCause());
		}
		return saxhdlr.getDocument();
	}

	public Document Document(CmsObject cmso, CmsFile f, HashMap h) throws 
CmsException
	{
		return Document(cmso,f);
	}
}


It seems the handler class returns what it should, so it is either the 
XMLDocument class or the indexer which is complaining. Should I send you the 
two src files ? theyre about as complete as they are gonna get...

Cheers

Alex


>From: M Butcher <mbutcher at grcomputing.net>
>Reply-To: opencms-dev at opencms.org
>To: opencms-dev at opencms.org
>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting 
>error - EOF
>Date: Mon, 15 Mar 2004 13:53:17 -0700
>
>What is throwing the exception, the XML parser or the indexer? Last week, I 
>was working on my XSLT code and created some code that looks almost exactly 
>like yours (except I created a Transformer instead of an XMLReader) and it 
>worked fine -- perhaps the problem is in whatever gets handed to the 
>IndexManager.
>
>Matt
>
>Alex ! wrote:
>>Ok so I think I'm alsmost done but now when the cron runs (yes it is 
>>mysteriously begun working!), I get the following error,  for a premature 
>>end of file? any ideas? the way i am retrievin the file contents is as 
>>follows:
>>
>>             in = new ByteArrayInputStream(f.getContents());
>>             is = new InputSource(in);
>>             xr.parse(is);
>>
>>where:     private XMLReader xr
>>     private InputStream in
>>     private InputSource is
>>
>>
>>Error output form OCMS log:
>>
>>[13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for 
>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>
>>[13.03.2004 06:58:10] <opencms_info>
>>=====IndexManager=============================================================
>>
>>[13.03.2004 06:58:10] <opencms_info> Analyzer: 
>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>[13.03.2004 06:58:10] <opencms_info> Extension map exists to handle XML
>>[13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>processing file test_xml.xml: com.opencms.core.CmsException: 0 Unknown 
>>exception. Detailed error: Premature end of file..
>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/xml/
>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>processing file article5.xml: com.opencms.core.CmsException: 0 Unknown 
>>exception. Detailed error: Premature end of file..
>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error 
>>processing file article7.xml: com.opencms.core.CmsException: 0 Unknown 
>>exception. Detailed error: Premature end of file..
>>[13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are being 
>>processed
>>[13.03.2004 06:58:10] <opencms_info> IndexManager:  Index has been 
>>optimized.
>>[13.03.2004 06:58:10] <opencms_info> Done
>>=====IndexManager=============================================================
>>
>>[13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of job 
>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators 
>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13 06:58:10 
>>GMT 2004
>>
>>
>>Thanks alex
>>
>>
>>>From: M Butcher <mbutcher at grcomputing.net>
>>>Reply-To: opencms-dev at opencms.org
>>>To: opencms-dev at opencms.org
>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but 
>>>getting error
>>>Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>
>>>
>>>Alex,
>>>
>>>I can't tell, from the stack trace, what is going on. Judging from where 
>>>the exception is located, it looks like a problem with content defs... 
>>>but that doesn't make sense....
>>>
>>>When you finish it, please do send it to Stephan and I. It sounds like a 
>>>very useful addition to the existing indexing tools.
>>>
>>>Matt
>>>
>>>Alex ! wrote:
>>>
>>>>Hi,
>>>>
>>>>this ones probably for Matt/Stefan.
>>>>
>>>>I have written an XML Indexer for the lucene module (almost finished), 
>>>>which will basically take an xml file, parse it, and then add its 
>>>>elements and their contents to the lucene index, instead of stripping 
>>>>the element tags and then including the remaining content a a siingle 
>>>>searchable body (as is currently available).
>>>>
>>>>Everything is now compiled (into a seprate jar, just 2 class files), the 
>>>>cron job runs but gives the following error:
>>>>
>>>>[07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for 
>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators 
>>>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml}
>>>>
>>>>
>>>>[07.03.2004 14:20:10] <opencms_info>
>>>>=====IndexManager=============================================================
>>>>
>>>>
>>>>[07.03.2004 14:20:10] <opencms_info> Analyzer: 
>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>[07.03.2004 14:20:10] <opencms_info> Extension map exists to handle XML
>>>>[07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>[07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>[07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>[07.03.2004 14:20:11] <opencms_info> Return Document
>>>>[07.03.2004 14:20:11] <opencms_cronscheduler> Error running job for 
>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators 
>>>>net.grcomputing.opencms.search.lucene.CronIndexManager 
>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml} 
>>>>Error: java.lang.NullPointerException
>>>>     at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>>     at 
>>>>org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92)
>>>>
>>>>     at 
>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>>>>     at 
>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>>>>     at 
>>>>net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown 
>>>>Source)
>>>>     at 
>>>>net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown 
>>>>Source)
>>>>     at 
>>>>net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown 
>>>>Source)
>>>>     at 
>>>>net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown 
>>>>Source)
>>>>     at 
>>>>com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>
>>>>
>>>>my registry entry for the xml files look like this (contained in 
>>>>external registry file):
>>>>
>>>>       <!-- For XML Files :) -->
>>>>       <docFactory enabled="true" type="plain">
>>>>          <fileType name="XML">
>>>>            <extension>.xml</extension>
>>>>            
>>>><class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>>          </fileType>
>>>>       </docFactory>
>>>>
>>>>Your help would be much appreciated.
>>>>
>>>>(should I send you the source to correct and include in your next 
>>>>patch/update?)
>>>>
>>>>Many Thanks
>>>>
>>>>Alex
>>>>
>>>>_________________________________________________________________
>>>>Find a cheaper internet access deal - choose one to suit you. 
>>>>http://www.msn.co.uk/internetaccess
>>>>
>>>>_______________________________________________
>>>>This mail is send to you from the opencms-dev mailing list
>>>>To change your list options, or to unsubscribe from the list, please 
>>>>visit
>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>_______________________________________________
>>>This mail is send to you from the opencms-dev mailing list
>>>To change your list options, or to unsubscribe from the list, please 
>>>visit
>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
>>_________________________________________________________________
>>Find a cheaper internet access deal - choose one to suit you. 
>>http://www.msn.co.uk/internetaccess
>>
>>_______________________________________________
>>This mail is send to you from the opencms-dev mailing list
>>To change your list options, or to unsubscribe from the list, please visit
>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>_______________________________________________
>This mail is send to you from the opencms-dev mailing list
>To change your list options, or to unsubscribe from the list, please visit
>http://mail.opencms.org/mailman/listinfo/opencms-dev

_________________________________________________________________
Stay in touch with absent friends - get MSN Messenger 
http://www.msn.co.uk/messenger




More information about the opencms-dev mailing list