[opencms-dev] Developed an XML Indexer for Lucene but getting error - EOF
Alex !
kingofkingston at hotmail.com
Tue Mar 16 21:40:02 CET 2004
I only have 3 xml files in the test dir im trying to index. One of those
files I am using in my jsp, and it works fine. See below code snippet.
XMLDocument constructor throws no exceptions, nor does
XMLDocument.Document(cmso,f).
Its gotta be IndexManager. Maybe the document I am producing is not what it
is expecting? But then why the EOF?
Inside my jsp:
<%
CmsJspActionElement cmsJspAE = new CmsJspActionElement(pageContext,
request, response);
CmsObject cmso = cmsJspAE.getCmsObject();
CmsFile f = cmso.readFile("/test/test_xml.xml");
String thepath = f.getAbsolutePath();
out.println("<br>"+thepath+"<br><br>");
XMLDocument xmldoc = null;
Document thisdoc = null;
try
{
xmldoc = new XMLDocument();
out.println("<br>"+xmldoc.getFactoryName()+"<br>");
thisdoc = xmldoc.Document(cmso, f);
}
catch (Exception e)
{
throw new CmsException(e.getMessage(), e.getCause());
}
String outdoc = thisdoc.toString();
out.println("Lucene Document: <br><br>" + outdoc);
%>
>From: M Butcher <mbutcher at grcomputing.net>
>Reply-To: opencms-dev at opencms.org
>To: opencms-dev at opencms.org
>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but getting
>error - EOF
>Date: Tue, 16 Mar 2004 11:09:57 -0700
>
>Alex ! wrote:
>>OK, Matt. So I had some input from my colleague, changed the XMLDocument
>>class (seems it wasnt done in the best way!) and now tried calling the
>>XMLDocument(cmso,f) class directly from a jsp - and it works, returns a
>>lucene document, which i test by outputing to screen using the
>>Document.toString() method as before.
>>
>>But... the cron still returns the same premature end of file exception.
>
>On the same document? Do you know what is throwing the exception? Is it the
>XMLDocument constructor or the IndexManager?
>
>>
>>
>>Alex
>>
>>
>>>From: "Alex !" <kingofkingston at hotmail.com>
>>>Reply-To: opencms-dev at opencms.org
>>>To: opencms-dev at opencms.org
>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>>>getting error - EOF
>>>Date: Mon, 15 Mar 2004 22:28:34 +0000
>>>
>>>It seems to be the indexer. I have a class XMLDocument (implements
>>>I_FileDocumentFactory), which is based on bodyless document. Here I set
>>>up the XMLReader and instantiate a XMLDocumentHandlerSAX class (extends
>>>DefaultHandler).
>>>
>>>After some thorough debug and testing, its seems the indexer, as I can
>>>call the XMLDocumentHandlerSAX from within a jsp and it works, returning
>>>a Lucene Document, that I then print to screen using Document.toString(),
>>>it all looks ok, although I havent tried indexing it myself (i was
>>>counting on the module doing this).
>>>
>>>Could it be the XMLDoument class? Here is what it looks like:
>>>
>>>public class XMLDocument implements I_FileDocumentFactory
>>>{
>>> public static String FACTORY_NAME = "XML DocumentFactory";
>>> private XMLDocumentHandlerSAX saxhdlr = null;
>>> private XMLReader xr = null;
>>> private InputStream in = null;
>>> private InputSource is = null;
>>>
>>> public XMLDocument() { }
>>>
>>> public String getFactoryName() {
>>> return FACTORY_NAME;
>>> }
>>>
>>> public Document Document(CmsObject cmso, CmsFile f) throws
>>>CmsException
>>> {
>>> try
>>> {
>>> XMLDocumentHandlerSAX saxhdlr = new
>>>XMLDocumentHandlerSAX(cmso, f);
>>>
>>> in = new ByteArrayInputStream(f.getContents());
>>> is = new InputSource(in);
>>>
>>> //in = (InputStream)(new
>>>ByteArrayInputStream(f.getContents()));
>>> //is = new InputSource(in);
>>>
>>> //is = new InputSource (new StringReader (xmlText));
>>>
>>> xr = XMLReaderFactory.createXMLReader(
>>>"org.apache.xerces.parsers.SAXParser" );
>>> xr.setContentHandler(saxhdlr);
>>> xr.setFeature( "http://xml.org/sax/features/validation",false
>>>);
>>> xr.setFeature(
>>>"http://apache.org/xml/features/continue-after-fatal-error",true );
>>> xr.parse(is);
>>>
>>> }
>>> catch (Exception e)
>>> {
>>> throw new CmsException(e.getMessage(), e.getCause());
>>> }
>>> return saxhdlr.getDocument();
>>> }
>>>
>>> public Document Document(CmsObject cmso, CmsFile f, HashMap h)
>>>throws CmsException
>>> {
>>> return Document(cmso,f);
>>> }
>>>}
>>>
>>>
>>>It seems the handler class returns what it should, so it is either the
>>>XMLDocument class or the indexer which is complaining. Should I send you
>>>the two src files ? theyre about as complete as they are gonna get...
>>>
>>>Cheers
>>>
>>>Alex
>>>
>>>
>>>>From: M Butcher <mbutcher at grcomputing.net>
>>>>Reply-To: opencms-dev at opencms.org
>>>>To: opencms-dev at opencms.org
>>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>>>>getting error - EOF
>>>>Date: Mon, 15 Mar 2004 13:53:17 -0700
>>>>
>>>>What is throwing the exception, the XML parser or the indexer? Last
>>>>week, I was working on my XSLT code and created some code that looks
>>>>almost exactly like yours (except I created a Transformer instead of an
>>>>XMLReader) and it worked fine -- perhaps the problem is in whatever gets
>>>>handed to the IndexManager.
>>>>
>>>>Matt
>>>>
>>>>Alex ! wrote:
>>>>
>>>>>Ok so I think I'm alsmost done but now when the cron runs (yes it is
>>>>>mysteriously begun working!), I get the following error, for a
>>>>>premature end of file? any ideas? the way i am retrievin the file
>>>>>contents is as follows:
>>>>>
>>>>> in = new ByteArrayInputStream(f.getContents());
>>>>> is = new InputSource(in);
>>>>> xr.parse(is);
>>>>>
>>>>>where: private XMLReader xr
>>>>> private InputStream in
>>>>> private InputSource is
>>>>>
>>>>>
>>>>>Error output form OCMS log:
>>>>>
>>>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Starting job for
>>>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>>
>>>>>
>>>>>[13.03.2004 06:58:10] <opencms_info>
>>>>>=====IndexManager=============================================================
>>>>>
>>>>>
>>>>>[13.03.2004 06:58:10] <opencms_info> Analyzer:
>>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>[13.03.2004 06:58:10] <opencms_info> Extension map exists to handle XML
>>>>>[13.03.2004 06:58:10] <opencms_info> Page DocumentFactory loaded
>>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/
>>>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>>>processing file test_xml.xml: com.opencms.core.CmsException: 0 Unknown
>>>>>exception. Detailed error: Premature end of file..
>>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: indexing /test/xml/
>>>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>>>processing file article5.xml: com.opencms.core.CmsException: 0 Unknown
>>>>>exception. Detailed error: Premature end of file..
>>>>>[13.03.2004 06:58:10] <opencms_critical> IndexManager: CMS Error
>>>>>processing file article7.xml: com.opencms.core.CmsException: 0 Unknown
>>>>>exception. Detailed error: Premature end of file..
>>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: 4 documents are
>>>>>being processed
>>>>>[13.03.2004 06:58:10] <opencms_info> IndexManager: Index has been
>>>>>optimized.
>>>>>[13.03.2004 06:58:10] <opencms_info> Done
>>>>>=====IndexManager=============================================================
>>>>>
>>>>>
>>>>>[13.03.2004 06:58:10] <opencms_cronscheduler> Successful launch of job
>>>>>com.opencms.core.CmsCronEntry{58 6 * * * admin Administrators
>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>>Message: CronIndexManager rebuilt the Lucene index on Sat Mar 13
>>>>>06:58:10 GMT 2004
>>>>>
>>>>>
>>>>>Thanks alex
>>>>>
>>>>>
>>>>>>From: M Butcher <mbutcher at grcomputing.net>
>>>>>>Reply-To: opencms-dev at opencms.org
>>>>>>To: opencms-dev at opencms.org
>>>>>>Subject: Re: [opencms-dev] Developed an XML Indexer for Lucene but
>>>>>>getting error
>>>>>>Date: Mon, 08 Mar 2004 10:03:43 -0700
>>>>>>
>>>>>>
>>>>>>Alex,
>>>>>>
>>>>>>I can't tell, from the stack trace, what is going on. Judging from
>>>>>>where the exception is located, it looks like a problem with content
>>>>>>defs... but that doesn't make sense....
>>>>>>
>>>>>>When you finish it, please do send it to Stephan and I. It sounds like
>>>>>>a very useful addition to the existing indexing tools.
>>>>>>
>>>>>>Matt
>>>>>>
>>>>>>Alex ! wrote:
>>>>>>
>>>>>>>Hi,
>>>>>>>
>>>>>>>this ones probably for Matt/Stefan.
>>>>>>>
>>>>>>>I have written an XML Indexer for the lucene module (almost
>>>>>>>finished), which will basically take an xml file, parse it, and then
>>>>>>>add its elements and their contents to the lucene index, instead of
>>>>>>>stripping the element tags and then including the remaining content a
>>>>>>>a siingle searchable body (as is currently available).
>>>>>>>
>>>>>>>Everything is now compiled (into a seprate jar, just 2 class files),
>>>>>>>the cron job runs but gives the following error:
>>>>>>>
>>>>>>>[07.03.2004 14:20:10] <opencms_cronscheduler> Starting job for
>>>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
>>>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/uk_lucene_registry.xml}
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>[07.03.2004 14:20:10] <opencms_info>
>>>>>>>=====IndexManager=============================================================
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>[07.03.2004 14:20:10] <opencms_info> Analyzer:
>>>>>>>org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>>>[07.03.2004 14:20:10] <opencms_info> Extension map exists to handle
>>>>>>>XML
>>>>>>>[07.03.2004 14:20:10] <opencms_info> Page DocumentFactory loaded
>>>>>>>[07.03.2004 14:20:10] <opencms_info> IndexManager: indexing /test/
>>>>>>>[07.03.2004 14:20:11] <opencms_info> Created XMLDocumentHandlerSAX
>>>>>>>[07.03.2004 14:20:11] <opencms_info> Return Document
>>>>>>>[07.03.2004 14:20:11] <opencms_cronscheduler> Error running job for
>>>>>>>com.opencms.core.CmsCronEntry{20 14 * * * admin Administrators
>>>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager
>>>>>>>createIndex=true,registry=C:/dev/java/tomcat-4.1.27/webapps/opencms/WEB-INF/config/epfolio_uk_lucene_registry.xml}
>>>>>>>Error: java.lang.NullPointerException
>>>>>>> at org.apache.lucene.index.FieldInfos.add(FieldInfos.java:90)
>>>>>>> at
>>>>>>>org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:92)
>>>>>>>
>>>>>>>
>>>>>>> at
>>>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>>>>>>> at
>>>>>>>org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>>>>>>> at
>>>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown
>>>>>>>Source)
>>>>>>> at
>>>>>>>net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown
>>>>>>>Source)
>>>>>>> at
>>>>>>>net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown
>>>>>>>Source)
>>>>>>> at
>>>>>>>net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown
>>>>>>>Source)
>>>>>>> at
>>>>>>>com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
>>>>>>>
>>>>>>>
>>>>>>>my registry entry for the xml files look like this (contained in
>>>>>>>external registry file):
>>>>>>>
>>>>>>> <!-- For XML Files :) -->
>>>>>>> <docFactory enabled="true" type="plain">
>>>>>>> <fileType name="XML">
>>>>>>> <extension>.xml</extension>
>>>>>>>
>>>>>>><class>com.mydomain.opencms.lucene.xmlindexing.XMLDocument</class>
>>>>>>> </fileType>
>>>>>>> </docFactory>
>>>>>>>
>>>>>>>Your help would be much appreciated.
>>>>>>>
>>>>>>>(should I send you the source to correct and include in your next
>>>>>>>patch/update?)
>>>>>>>
>>>>>>>Many Thanks
>>>>>>>
>>>>>>>Alex
>>>>>>>
>>>>>>>_________________________________________________________________
>>>>>>>Find a cheaper internet access deal - choose one to suit you.
>>>>>>>http://www.msn.co.uk/internetaccess
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>This mail is send to you from the opencms-dev mailing list
>>>>>>>To change your list options, or to unsubscribe from the list, please
>>>>>>>visit
>>>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>This mail is send to you from the opencms-dev mailing list
>>>>>>To change your list options, or to unsubscribe from the list, please
>>>>>>visit
>>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>>
>>>>>
>>>>>
>>>>>_________________________________________________________________
>>>>>Find a cheaper internet access deal - choose one to suit you.
>>>>>http://www.msn.co.uk/internetaccess
>>>>>
>>>>>_______________________________________________
>>>>>This mail is send to you from the opencms-dev mailing list
>>>>>To change your list options, or to unsubscribe from the list, please
>>>>>visit
>>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>>
>>>>
>>>>_______________________________________________
>>>>This mail is send to you from the opencms-dev mailing list
>>>>To change your list options, or to unsubscribe from the list, please
>>>>visit
>>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>>
>>>
>>>_________________________________________________________________
>>>Stay in touch with absent friends - get MSN Messenger
>>>http://www.msn.co.uk/messenger
>>>
>>>_______________________________________________
>>>This mail is send to you from the opencms-dev mailing list
>>>To change your list options, or to unsubscribe from the list, please
>>>visit
>>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>>
>>
>>_________________________________________________________________
>>Tired of 56k? Get a FREE BT Broadband connection
>>http://www.msn.co.uk/specials/btbroadband
>>
>>_______________________________________________
>>This mail is send to you from the opencms-dev mailing list
>>To change your list options, or to unsubscribe from the list, please visit
>>http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>_______________________________________________
>This mail is send to you from the opencms-dev mailing list
>To change your list options, or to unsubscribe from the list, please visit
>http://mail.opencms.org/mailman/listinfo/opencms-dev
_________________________________________________________________
Stay in touch with absent friends - get MSN Messenger
http://www.msn.co.uk/messenger
More information about the opencms-dev
mailing list