[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF
Alex !
kingofkingston at hotmail.com
Mon Mar 15 23:29:00 CET 2004
It seems to be the indexer. I have a class XMLDocument (implements
I_FileDocumentFactory), which is based on bodyless document. Here I set up
the XMLReader and instantiate a XMLDocumentHandlerSAX class (extends
DefaultHandler).
After some thorough debug and testing, its seems the indexer, as I can call
the XMLDocumentHandlerSAX from within a jsp and it works, returning a Lucene
Document, that I then print to screen using Document.toString(), it all
looks ok, although I havent tried indexing it myself (i was counting on the
module doing this).
Could it be the XMLDoument class? Here is what it looks like:
public class XMLDocument implements I_FileDocumentFactory
{
public static String FACTORY_NAME = "XML DocumentFactory";
private XMLDocumentHandlerSAX saxhdlr = null;
private XMLReader xr = null;
private InputStream in = null;
private InputSource is = null;
public XMLDocument() { }
public String getFactoryName() {
return FACTORY_NAME;
}
public Document Document(CmsObject cmso, CmsFile f) throws CmsException
{
try
{
XMLDocumentHandlerSAX saxhdlr = new XMLDocumentHandlerSAX(cmso, f);
in = new ByteArrayInputStream(f.getContents());
is = new InputSource(in);
//in = (InputStream)(new ByteArrayInputStream(f.getContents()));
//is = new InputSource(in);
//is = new InputSource (new StringReader (xmlText));
xr = XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser" );
xr.setContentHandler(saxhdlr);
xr.setFeature( "http://xml.org/sax/features/validation",false );
xr.setFeature(
"http://apache.org/xml/features/continue-after-fatal-error",true );
xr.parse(is);
}
catch (Exception e)
{
throw new CmsException(e.getMessage(), e.getCause());
}
return saxhdlr.getDocument();
}
public Document Document(CmsObject cmso, CmsFile f, HashMap h) throws
CmsException
{
return Document(cmso,f);
}
}
It seems the handler class returns what it should, so it is either the
XMLDocument class or the indexer which is complaining. Should I send you the
two src files ? theyre about as complete as they are gonna get...
Cheers
Alex
>From: M Butcher <mbutcher at grcomputing.net>
>Reply-To: opencms-dev at opencms.org
>To: opencms-dev at opencms.org
>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting
>error - EOF
>Date: Mon, 15 Mar 2004 13:53:17 -0700
>
>What is throwing the exception, the XML parser or the indexer? Last week, I
>was working on my XSLT code and created some code that looks almost exactly
>like yours (except I created a Transformer instead of an XMLReader) and it
>worked fine -- perhaps the problem is in whatever gets handed to the
>IndexManager.
>
>Matt
>
>Alex ! wrote:
>>Ok so I think I'm alsmost done but now when the cron runs (yes it is
>>mysteriously begun working!), I get the following error, for a premature
>>end of file? any ideas? the way i am retrievin the file contents is as
>>follows:
>>
>> in = new ByteArrayInputStream(f.getContents());
>> is = new InputSource(in);
>> xr.parse(is);
>>
>>where: private XMLReader xr
>> private InputStream in
>> private InputSource is
>>
>>
>>Error output form OCMS log:
>>
>>[13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for
>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>
>>[13.03.2004 06:58:10] <opencms_info>
>>=====IndexManager=============================================================
>>
>>[13.03.2004 06:58:10] <opencms_info> Analyzer:
>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>[13.03.2004 06:58:10] <opencms_info> Extension map exists to handle XML
>>[13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>processing file test_xml.xml: com.opencms.core.CmsException: 0 Unknown
>>exception. Detailed error: Premature end of file..
>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/xml/
>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>processing file article5.xml: com.opencms.core.CmsException: 0 Unknown
>>exception. Detailed error: Premature end of file..
>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>processing file article7.xml: com.opencms.core.CmsException: 0 Unknown
>>exception. Detailed error: Premature end of file..
>>[13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are being
>>processed
>>[13.03.2004 06:58:10] <opencms_info> IndexManager: Index has been
>>optimized.
>>[13.03.2004 06:58:10] <opencms_info> Done
>>=====IndexManager=============================================================
>>
>>[13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of job
>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13 06:58:10
>>GMT 2004
>>
>>
>>Thanks alex
>>
>>
>>>From: M Butcher <mbutcher at grcomputing.net>
>>>Reply-To: opencms-dev at opencms.org
>>>To: opencms-dev at opencms.org
>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>>>getting error
>>>Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>
>>>
>>>Alex,
>>>
>>>I can't tell, from the stack trace, what is going on. Judging from where
>>>the exception is located, it looks like a problem with content defs...
>>>but that doesn't make sense....
>>>
>>>When you finish it, please do send it to Stephan and I. It sounds like a
>>>very useful addition to the existing indexing tools.
>>>
>>>Matt
>>>
>>>Alex ! wrote:
>>>
>>>>Hi,
>>>>
>>>>this ones probably for Matt/Stefan.
>>>>
>>>>I have written an XML Indexer for the lucene module (almost finished),
>>>>which will basically take an xml file, parse it, and then add its
>>>>elements and their contents to the lucene index, instead of stripping
>>>>the element tags and then including the remaining content a a siingle
>>>>searchable body (as is currently available).
>>>>
>>>>Everything is now compiled (into a seprate jar, just 2 class files), the
>>>>cron job runs but gives the following error:
>>>>
>>>>[07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for
>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml}
>>>>
>>>>
>>>>[07.03.2004 14:20:10] <opencms_info>
>>>>=====IndexManager=============================================================
>>>>
>>>>
>>>>[07.03.2004 14:20:10] <opencms_info> Analyzer:
>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>[07.03.2004 14:20:10] <opencms_info> Extension map exists to handle XML
>>>>[07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>[07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>[07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>[07.03.2004 14:20:11] <opencms_info> Return Document
>>>>[07.03.2004 14:20:11] <opencms_cronscheduler> Error running job for
>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>Error: java.lang.NullPointerException
>>>> at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>> at
>>>>org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92)
>>>>
>>>> at
>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>>>> at
>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>>>> at
>>>>net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown
>>>>Source)
>>>> at
>>>>net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown
>>>>Source)
>>>> at
>>>>net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown
>>>>Source)
>>>> at
>>>>net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown
>>>>Source)
>>>> at
>>>>com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>
>>>>
>>>>my registry entry for the xml files look like this (contained in
>>>>external registry file):
>>>>
>>>> <!-- For XML Files :) -->
>>>> <docFactory enabled="true" type="plain">
>>>> <fileType name="XML">
>>>> <extension>.xml</extension>
>>>>
>>>><class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>> </fileType>
>>>> </docFactory>
>>>>
>>>>Your help would be much appreciated.
>>>>
>>>>(should I send you the source to correct and include in your next
>>>>patch/update?)
>>>>
>>>>Many Thanks
>>>>
>>>>Alex
>>>>
>>>>_________________________________________________________________
>>>>Find a cheaper internet access deal - choose one to suit you.
>>>>http://www.msn.co.uk/internetaccess
>>>>
>>>>_______________________________________________
>>>>This mail is send to you from the opencms-dev mailing list
>>>>To change your list options, or to unsubscribe from the list, please
>>>>visit
>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>_______________________________________________
>>>This mail is send to you from the opencms-dev mailing list
>>>To change your list options, or to unsubscribe from the list, please
>>>visit
>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
>>_________________________________________________________________
>>Find a cheaper internet access deal - choose one to suit you.
>>http://www.msn.co.uk/internetaccess
>>
>>_______________________________________________
>>This mail is send to you from the opencms-dev mailing list
>>To change your list options, or to unsubscribe from the list, please visit
>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>_______________________________________________
>This mail is send to you from the opencms-dev mailing list
>To change your list options, or to unsubscribe from the list, please visit
>http://mail.opencms.org/mailman/listinfo/opencms-dev
_________________________________________________________________
Stay in touch with absent friends - get MSN Messenger
http://www.msn.co.uk/messenger
More information about the opencms-dev
mailing list