<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2716.2200" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Hello </FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Thanks for the previous reply.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Now, i use </FONT></DIV>
<DIV><FONT face=Arial size=2>- version 1.4 of lucene searche module. (the
version attached in this list)</FONT></DIV>
<DIV><FONT face=Arial size=2>- new version of registry.xml format for module.
(like you write me)</FONT></DIV>
<DIV><FONT face=Arial size=2>- the pdf files are stored with the binary
type.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>But i have the next problem:</FONT></DIV>
<DIV><FONT face=Arial size=2>i can´t make a InputStream for the cmsfile
content.</FONT></DIV>
<DIV><FONT face=Arial size=2>For this i write this code in de Document
method of my class PDFDocument:</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>-----------------</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>InputStream in = </FONT><B><FONT color=#7f0055
size=2>new</B></FONT><FONT size=2> ByteArrayInputStream(f.getContents()); //f is
the parameter CmsFile of the Document method</DIV>
<DIV>
<P></P>
<P>PDFExtractor extractor = </FONT><B><FONT color=#7f0055
size=2>new</B></FONT><FONT size=2> PDFExtractor(); //PDFExtractor is lib i use.
in file system work fine. </P>
<P></P></FONT><FONT size=2>
<P>bodyText = extractor.extractText(in);</P>
<P>----------------</P></FONT></DIV>
<DIV><FONT face=Arial size=2>Is correct use ByteArrayInputStream for make a
InputStream for a CmsFile?</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>The error ocurr in the third line.</FONT></DIV>
<DIV><FONT face=Arial size=2>In the PDFParcer.</FONT></DIV>
<DIV><FONT face=Arial size=2>the error menssage in tomcat is:</FONT></DIV><FONT
face=Arial size=2>
<DIV><BR>java.io.IOException: Error: Header is corrupt ''</DIV>
<DIV>at PDFParcer.parse</DIV>
<DIV>at PDFExtractor.extractText</DIV>
<DIV>at PDFDocument.Document (my class)</DIV>
<DIV>at.....</DIV>
<DIV></FONT> </DIV>
<DIV><FONT face=Arial size=2>By, and thanks.</FONT></DIV>
<DIV><FONT face=Arial size=2>Ernesto.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV>----- Original Message ----- </DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
<A title=hartmann@waehrisch-feykes.de
href="mailto:hartmann@waehrisch-feykes.de">Hartmann, Waehrisch & Feykes
GmbH</A> </DIV>
<DIV style="FONT: 10pt arial"><B>To:</B> <A title=opencms-dev@opencms.org
href="mailto:opencms-dev@opencms.org">opencms-dev@opencms.org</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Sent:</B> Friday, October 24, 2003 4:45
AM</DIV>
<DIV style="FONT: 10pt arial"><B>Subject:</B> Re: [opencms-dev] Index pdf
files with your content in lucene.</DIV>
<DIV><FONT face=Arial size=2></FONT><BR></DIV>
<DIV><FONT face=Arial size=2>Hello Ernesto,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>i assume you are using the unpatched version 1.3
of the search module.</FONT></DIV>
<DIV><FONT face=Arial size=2>As i mentioned yesterday, the plainDocFactory
does only index cmsFiles of type "plain" but not of type "binary". PDF files
are stored as binary.</FONT></DIV>
<DIV><FONT face=Arial size=2>I suggest to use the version i posted yesterday.
Then your registry.xml would have to look like this:</FONT></DIV>
<DIV><FONT face=Arial size=2>...</FONT></DIV>
<DIV><FONT face=Arial size=2><docFactories></FONT></DIV>
<DIV><FONT face=Arial size=2>...</FONT></DIV>
<DIV><FONT face=Arial size=2> <docFactory type="plain"
enabled="true"></FONT></DIV>
<DIV><FONT face=Arial size=2>...</FONT></DIV>
<DIV><FONT face=Arial size=2> </docFactory></FONT></DIV>
<DIV><FONT face=Arial size=2> <docFactory type="binary"
enabled="true"></FONT></DIV>
<DIV><FONT face=Arial size=2> <fileType
name="pdftext"><BR>
<extension>.pdf</extension><BR>
<class>net.grcomputing.opencms.search.lucene.PDFDocument</class><BR>
</fileType></FONT></DIV>
<DIV><FONT face=Arial size=2> </docFactory></FONT></DIV>
<DIV><FONT face=Arial size=2>...</FONT></DIV>
<DIV><FONT face=Arial size=2></docFactories></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Important: The type attribute must match the file
types of OpenCms (also defined in the registry.xml).</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>Bye,</FONT></DIV>
<DIV><FONT face=Arial size=2>Stephan</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<BLOCKQUOTE
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
<DIV
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
<A title=edesantis@fibertel.com.ar
href="mailto:edesantis@fibertel.com.ar">Ernesto De Santis</A> </DIV>
<DIV style="FONT: 10pt arial"><B>To:</B> <A
title=lucene-user@jakarta.apache.org
href="mailto:lucene-user@jakarta.apache.org">Lucene Users List</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Cc:</B> <A title=opencms-dev@opencms.org
href="mailto:opencms-dev@opencms.org">opencms-dev@opencms.org</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Sent:</B> Thursday, October 23, 2003 4:16
PM</DIV>
<DIV style="FONT: 10pt arial"><B>Subject:</B> [opencms-dev] Index pdf files
with your content in lucene.</DIV>
<DIV><BR></DIV>
<DIV><FONT face=Arial size=2>Hello</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I am new in opencms and lucene tecnology.
</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I won index pdf files, and index de content of
this files.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I work in this way:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Make a PDFDocument class like
JspDocument class. </FONT></DIV>
<DIV><FONT face=Arial size=2>use org.textmining.text.extraction.PDFExtractor
class, this class work fine out of vfs.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>and write my registry.xml for pdf document, in
plainDocFactory tag.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial
size=2>
<fileType
name="pdftext"><BR>
<extension>.pdf</extension><BR>
<!-- This will strip tags before processing
--><BR>
<class>net.grcomputing.opencms.search.lucene.PDFDocument</class><BR>
</fileType><BR></DIV></FONT>
<DIV><FONT face=Arial size=2>my PDFDocument content this code:</FONT></DIV>
<DIV><FONT face=Arial size=2>I think that the probrem is how take the
content from CmsFile?, what InputStream use?</FONT></DIV>
<DIV>
<DIV><FONT face=Arial size=2>PDFExtractor work with extractText(InputStream)
method.</FONT></DIV></DIV>
<DIV><FONT face=Arial size=2><B><FONT color=#7f0055
size=2></FONT></B></FONT> </DIV>
<DIV><FONT face=Arial size=2><B><FONT color=#7f0055
size=2>public</B></FONT><FONT size=2> </FONT><B><FONT color=#7f0055
size=2>class</B></FONT><FONT size=2> PDFDocument </FONT><B><FONT
color=#7f0055 size=2>implements</B></FONT><FONT size=2> I_DocumentConstants,
I_DocumentFactory {</DIV>
<DIV>
<P></P>
<P></FONT><B><FONT color=#7f0055 size=2>public</B></FONT><FONT size=2>
PDFDocument(){</P>
<P>}</P>
<P></P>
<P></FONT><B><FONT color=#7f0055 size=2>public</B></FONT><FONT size=2>
Document Document(CmsObject cmsobject, CmsFile cmsfile)</P>
<P></FONT><B><FONT color=#7f0055 size=2>throws</B></FONT><FONT size=2>
CmsException </P>
<P>{</P>
<P></FONT><B><FONT color=#7f0055 size=2>return</B></FONT><FONT size=2>
Document(cmsobject, cmsfile, </FONT><B><FONT color=#7f0055
size=2>null</B></FONT><FONT size=2>);</P>
<P>}</P>
<P></FONT><B><FONT color=#7f0055 size=2>public</B></FONT><FONT size=2>
Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap)</P>
<P></FONT><B><FONT color=#7f0055 size=2>throws</B></FONT><FONT size=2>
CmsException</P>
<P>{</P>
<P>Document document=(</FONT><B><FONT color=#7f0055
size=2>new</B></FONT><FONT size=2> BodylessDocument()).Document(cmsobject,
cmsfile);</P>
<P></P>
<P></FONT><FONT color=#3f7f5f size=2>//put de content in the pdf
file.</P></FONT><FONT size=2>
<P>String contenido = </FONT><B><FONT color=#7f0055
size=2>new</B></FONT><FONT size=2> String(cmsfile.getContents());</P>
<P>StringBufferInputStream in = </FONT><B><FONT color=#7f0055
size=2>new</B></FONT><FONT size=2>
StringBufferInputStream(contenido);</P></FONT><FONT color=#3f7f5f size=2>
<P>// ByteArrayInputStream in = new
ByteArrayInputStream(contenido.getBytes());</P></FONT><FONT size=2>
<P></P></FONT><FONT color=#3f7f5f size=2>
<P>/* try{</P>
<P>FileInputStream in = new FileInputStream (cmsfile.getPath() +
cmsfile.getName());</P>
<P>*/</P></FONT><FONT size=2>
<P>PDFExtractor extractor = </FONT><B><FONT color=#7f0055
size=2>new</B></FONT><FONT size=2> PDFExtractor();</P>
<P>String body = extractor.extractText(in);</P>
<P></P>
<P>document.add(Field.Text(</FONT><FONT color=#2a00ff
size=2>"body"</FONT><FONT size=2>, body));</P></FONT><FONT color=#3f7f5f
size=2>
<P>/* }catch(FileNotFoundException e){</P>
<P>e.toString();</P>
<P>throw new CmsException();</P>
<P>}</P>
<P></P>
<P>*/</FONT><FONT size=2> </P>
<P></FONT><B><FONT color=#7f0055 size=2>return</B></FONT><FONT size=2>
(document);</P>
<P>}</P></FONT></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>thanks<BR>Ernesto</FONT></DIV>
<DIV><FONT face=Arial size=2>PD: Sorry for my poor english.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>----- Original Message ----- </FONT>
<DIV><FONT face=Arial size=2>From: "Hartmann, Waehrisch & Feykes GmbH"
<</FONT><A href="mailto:hartmann@waehrisch-feykes.de"><FONT face=Arial
size=2>hartmann@waehrisch-feykes.de</FONT></A><FONT face=Arial
size=2>></FONT></DIV>
<DIV><FONT face=Arial size=2>To: <</FONT><A
href="mailto:opencms-dev@opencms.org"><FONT face=Arial
size=2>opencms-dev@opencms.org</FONT></A><FONT face=Arial
size=2>></FONT></DIV>
<DIV><FONT face=Arial size=2>Sent: Wednesday, October 22, 2003 3:50
AM</FONT></DIV>
<DIV><FONT face=Arial size=2>Subject: Re: [opencms-dev] (no
subject)</FONT></DIV></DIV>
<DIV><FONT face=Arial><BR><FONT size=2></FONT></FONT></DIV><FONT face=Arial
size=2>> Hi Ben,<BR>> <BR>> i think this won't work since the
plainDocFactory will only be used for<BR>> files of type "plain" but not
for files of type "binary".<BR>> Recently we have done some additions to
the module - by order of Lenord,<BR>> Bauer & Co. GmbH - that could
meet your needs. It introduces a more flexible<BR>> way of defining
docFactories that you can add new factories without having<BR>> to
recompile the whole module. So other modules (like the news) can
bring<BR>> their own docFactory and all you have to do is to edit the
registry.xml.<BR>> Here is an example:<BR>> <BR>>
<docFactories><BR>>
<docFactory enabled="true" type="plain"><BR>>
<fileType name="plaintext"><BR>>
<extension>.txt</extension><BR>> <BR>>
<class>net.grcomputing.opencms.search.lucene.PlainDocument</class><BR>>
</fileType><BR>>
</docFactory><BR>>
<docFactory enabled="true" type="news"><BR>> <BR>>
<class>net.grcomputing.opencms.search.lucene.NewsDocument</class><BR>>
</docFactory><BR>>
</docFactories><BR>> <BR>> To index binary files all you need to
add is this:<BR>> <BR>>
<docFactory
enabled="true" type="binary"><BR>> <BR>>
<class>net.grcomputing.opencms.search.lucene.BodylessDocument</class><BR>>
</docFactory><BR>> <BR>> There should be no need for an
extension mapping.<BR>> <BR>> For the interested people:<BR>> For
ContentDefinitions (like news) i introduced the following:<BR>>
<contentDefinitions><BR>>
<contentDefinition type="news"> <!-- must match docFactory<BR>>
type --><BR>> <BR>>
<class>com.opencms.modules.homepage.news.NewsContentDefinition</class><BR>>
<BR>>
<initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCla<BR>>
ss><BR>>
<listMethod name="getNewsList"><BR>>
<param type="java.lang.Integer">1</param><BR>>
<param type="java.lang.String">-1</param><BR>>
</listMethod><BR>>
<page uri="/news.html?__element=entry"><BR>>
<param method="getIntId" name="newsid"/><BR>>
</page><BR>>
</contentDefinition><BR>> <BR>> In short:<BR>> initClass is
optional: For the news the news classes have to be loaded to<BR>>
initialize the db pool.<BR>> listMethod: a method of the content
definition class that returns a List of<BR>> elements<BR>> page: the
page that can display an entry. Here a jsp that has a template<BR>>
element "entry". It also needs the id of the news item.<BR>> getIntId is
a method of the content definition class and newsid is the url<BR>>
parameter the page needs. A link like<BR>>
news.html?__element=entry&newsid=xy<BR>> will be generated.<BR>>
<BR>> Best regards,<BR>> Stephan<BR>> <BR>> <BR>> -----
Original Message ----- <BR>> From: "Ben Rometsch" <</FONT><A
href="mailto:ben@solidstategroup.com"><FONT face=Arial
size=2>ben@solidstategroup.com</FONT></A><FONT face=Arial
size=2>><BR>> To: <</FONT><A
href="mailto:opencms-dev@opencms.org"><FONT face=Arial
size=2>opencms-dev@opencms.org</FONT></A><FONT face=Arial
size=2>><BR>> Sent: Wednesday, October 22, 2003 6:15 AM<BR>>
Subject: [opencms-dev] (no subject)<BR>> <BR>> <BR>> > Hi
Matt,<BR>> ><BR>> > I am not having any joy! I've updated my
registry.xml file, with the<BR>> > appropriate section
reading:<BR>> ><BR>> > <luceneSearch><BR>> >
<mergeFactor>100000</mergeFactor><BR>> >
<permCheck>true</permCheck><BR>> >
<indexDir>c:\search</indexDir><BR>> ><BR>> >
<analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer><BR>>
> <subsearch>true</subsearch><BR>> >
<project>online</project><BR>> >
<docFactories><BR>> > <pageDocFactory
enabled="true"><BR>> ><BR>> >
<class>net.grcomputing.opencms.search.lucene.PageDocument</class><BR>>
> </pageDocFactory><BR>> > <plainDocFactory
enabled="true"><BR>> > <fileType name="plaintext"><BR>>
> <extension>.txt</extension><BR>> ><BR>> >
<class>net.grcomputing.opencms.search.lucene.PlainDocument</class><BR>>
> </fileType><BR>> > <fileType
name="taggedtext"><BR>> >
<extension>.html</extension><BR>> >
<extension>.htm</extension><BR>> >
<extension>.xml</extension><BR>> > <!-- This will strip
tags before processing<BR>> > --><BR>> ><BR>> >
<class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class><BR>>
> </fileType><BR>> ><BR>> > <!-- Index binary
documents --><BR>> > <fileType name="plaindocument"><BR>>
> <extension>.doc</extension><BR>> >
<extension>.xls</extension><BR>> >
<extension>.pdf</extension><BR>> ><BR>> >
<class>net.grcomputing.opencms.search.lucene.BodylessDocument</class><BR>>
> </fileType><BR>> ><BR>> >
</plainDocFactory><BR>> > <jspDocFactory
enabled="true"><BR>> ><BR>> >
<class>net.grcomputing.opencms.search.lucene.JspDocument</class><BR>>
> </jspDocFactory><BR>> > <xmlTemplateDocFactory
enabled="false"/><BR>> > </docFactories><BR>> >
<directories><BR>> > <directory
location="/release/"><BR>> >
<section>Test</section><BR>> >
<subsearch>true</subsearch><BR>> >
</directory><BR>> > <directory
location="/RGLIntranet/"><BR>> >
<section>Test2</section><BR>> >
<subsearch>true</subsearch><BR>> >
</directory><BR>> > </directories><BR>> >
</luceneSearch><BR>> ><BR>> > Notice the section beginning
after the remark "Index binary documents".<BR>> ><BR>> > But I
cannot get any hits when searching for document names that are in<BR>>
the<BR>> > VFS. The other (HTML) searches are working ok. Is the
"name" property of<BR>> the<BR>> > fileType tag important? I wasn't
sure what to add here...I'm not quite<BR>> sure<BR>> > how to move
forward. Maybe it would be an idea to add some debugging trace<BR>> >
to the BodylessDocument class to see what is going on inside it? I want
to<BR>> > make sure my XML is correct first tho!<BR>> ><BR>>
> Thanks for the help,<BR>> > Ben<BR>> ><BR>> ><BR>>
> On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:<BR>> > > Hi
Matt,<BR>> > ><BR>> > > Thanks for the reply. If I just
want to get the document title to be<BR>> > > included in the
Lucene index, looking at the code in the<BR>> > >
net.grcomputing.opencms.search.BodylessDocument class it appears to<BR>>
ignore<BR>> > > what the CMSObject is, and attempt to index it
regardless. Is this<BR>> > correct?<BR>> > ><BR>>
><BR>> > Correct. It will already index the title, but it will not
attempt to<BR>> > index the body.<BR>> ><BR>> > > If
this is the case, is it simply a matter of instructing Lucene to<BR>>
index<BR>> > > obects other than HTML files in the VFS (i.e.
Documents) ? Or would I<BR>> > have<BR>> > > to create
another class, something like<BR>> > >
net.grcomputing.opencms.search.FileDocument and add a new hook into
that<BR>> > > class via the registry.xml fragment? Or does
the BodyLess document<BR>> > provide<BR>> > > this
functionality, and it's just a matter of adding a new XML fragment<BR>>
to<BR>> > > the registry.xml are?<BR>> ><BR>> > Again,
you are right -- simply adding the appropriate configuration to<BR>> >
the registry.xml file will suffice. I believe that you will just need
to<BR>> > extend the plainDocument tag set to include extensions and
processors...<BR>> > I _think_ that binary files get handled by the
plain handler.<BR>> ><BR>> > Matt<BR>> ><BR>> >
_______________________________________________<BR>> > This mail is
send to you from the opencms-dev mailing list<BR>> > To change your
list options, or to unsubscribe from the list, please visit<BR>> >
</FONT><A href="http://mail.opencms.org/mailman/listinfo/opencms-dev"><FONT
face=Arial
size=2>http://mail.opencms.org/mailman/listinfo/opencms-dev</FONT></A><BR><FONT
face=Arial size=2>> <BR>> Stephan Hartmann<BR>>
Unternehmensberatung Währisch & Feykes GmbH<BR>> Gustav-Adolf-Str.
5<BR>> 47057 Duisburg<BR>> <BR>> Tel.: 0203-373070<BR>> Fax:
0203-376766<BR>> E-Mail: </FONT><A href="mailto:hartmann@wfnetz.de"><FONT
face=Arial size=2>hartmann@wfnetz.de</FONT></A><BR><FONT face=Arial
size=2>> Internet: </FONT><A href="http://www.wfnetz.de"><FONT face=Arial
size=2>www.wfnetz.de</FONT></A><BR><FONT face=Arial size=2>> <BR>>
Über das Internet versandte E-Mails können unter fremden Namen erstellt
oder<BR>> manipuliert werden. Aus diesem Grund enthalten unsere mit
E-Mail<BR>> verschickten Nachrichten grundsätzlich keine
rechtsverbindlichen<BR>> Willenserklärungen.<BR>>
</FONT></BLOCKQUOTE></BLOCKQUOTE></BODY></HTML>