[opencms-dev] Index pdf files with your content in lucene.

Wed Oct 29 16:32:01 CET 2003

Hello all,

Thans very much Stephan for your valuable help.
Attached you will find the PDFDocument, and WordDocument class source code

Ernesto.

----- Original Message -----
From: "Hartmann, Waehrisch & Feykes GmbH" <hartmann at waehrisch-feykes.de>
To: <opencms-dev at opencms.org>
Sent: Tuesday, October 28, 2003 11:10 AM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.

> Hi Ernesto,
>
> the IndexManager retrieves a list of files of a folder by calling the
method
> getFilesInFolder of CmsObject. This method returns only empty files, i.e.
> with empty content. To get the content of a pdf file you have to reread
the
> file:
> f = cms.readFile(f.getAbsolutePath());
>
> Bye,
> Stephan
>
> Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
>
> > > Hello
> >
> > Thanks for the previous reply.
> >
> > Now, i use
> > - version 1.4 of lucene searche module. (the version attached in this
list)
> > - new version of registry.xml format for module. (like you write me)
> > - the pdf files are stored with the binary type.
> >
> > But i have the next problem:
> > i can´t make a InputStream for the cmsfile content.
> > For this i write this code in de Document method of my class
PDFDocument:
> >
> > -----------------
> >
> > InputStream in = new ByteArrayInputStream(f.getContents()); //f is the
> > parameter CmsFile of the Document method
> >
> > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i
use.
> > in file system work fine.
> >
> >
> > bodyText = extractor.extractText(in);
> >
> > ----------------
> >
> > Is correct use ByteArrayInputStream for make a InputStream for a
CmsFile?
> >
> > The error ocurr in the third line.
> > In the PDFParcer.
> > the error menssage in tomcat is:
> >
> > java.io.IOException: Error: Header is corrupt ''
> > at PDFParcer.parse
> > at PDFExtractor.extractText
> > at PDFDocument.Document (my class)
> > at.....
> >
> > By, and thanks.
> > Ernesto.
> >
> >
> > ----- Original Message -----
> >   From: Hartmann, Waehrisch & Feykes GmbH
> >   To: opencms-dev at opencms.org
> >   Sent: Friday, October 24, 2003 4:45 AM
> >   Subject: Re: [opencms-dev] Index pdf files with your content in
lucene.
> >
> >
> >   Hello Ernesto,
> >
> >   i assume you are using the unpatched version 1.3 of the search module.
> >   As i mentioned yesterday, the plainDocFactory does only index cmsFiles
of
> > type "plain" but not of type "binary". PDF files are stored as binary. I
> > suggest to use the version i posted yesterday. Then your registry.xml
would
> > have to look like this: ...
> >   <docFactories>
> >   ...
> >      <docFactory type="plain" enabled="true">
> >   ...
> >      </docFactory>
> >      <docFactory type="binary" enabled="true">
> >         <fileType name="pdftext">
> >            <extension>.pdf</extension>
> >
<class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
> >         </fileType>
> >      </docFactory>
> >   ...
> >   </docFactories>
> >
> >   Important: The type attribute must match the file types of OpenCms
(also
> > defined in the registry.xml).
> >
> >   Bye,
> >   Stephan
> >
> >     ----- Original Message -----
> >     From: Ernesto De Santis
> >     To: Lucene Users List
> >     Cc: opencms-dev at opencms.org
> >     Sent: Thursday, October 23, 2003 4:16 PM
> >     Subject: [opencms-dev] Index pdf files with your content in lucene.
> >
> >
> >     Hello
> >
> >     I am new in opencms and lucene tecnology.
> >
> >     I won index pdf files, and index de content of this files.
> >
> >     I work in this way:
> >
> >     Make a PDFDocument class like JspDocument class.
> >     use org.textmining.text.extraction.PDFExtractor class, this class
work
> > fine out of vfs.
> >
> >     and write my registry.xml for pdf document, in plainDocFactory tag.
> >
> >                         <fileType name="pdftext">
> >                             <extension>.pdf</extension>
> >                             <!-- This will strip tags before
processing -->
> >
> > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
> > </fileType>
> >
> >     my PDFDocument content this code:
> >     I think that the probrem is how take the content from CmsFile?, what
> > InputStream use? PDFExtractor work with extractText(InputStream) method.
> >
> >     public class PDFDocument implements I_DocumentConstants,
> > I_DocumentFactory {
> >
> >     public PDFDocument(){
> >
> >     }
> >
> >
> >     public Document Document(CmsObject cmsobject, CmsFile cmsfile)
> >
> >     throws CmsException
> >
> >     {
> >
> >     return Document(cmsobject, cmsfile, null);
> >
> >     }
> >
> >     public Document Document(CmsObject cmsobject, CmsFile cmsfile,
HashMap
> > hashmap)
> >
> >     throws CmsException
> >
> >     {
> >
> >     Document document=(new BodylessDocument()).Document(cmsobject,
> > cmsfile);
> >
> >
> >     //put de content in the pdf file.
> >
> >     String contenido = new String(cmsfile.getContents());
> >
> >     StringBufferInputStream in = new StringBufferInputStream(contenido);
> >
> >     // ByteArrayInputStream in = new
> > ByteArrayInputStream(contenido.getBytes());
> >
> >
> >     /* try{
> >
> >     FileInputStream in = new FileInputStream (cmsfile.getPath() +
> > cmsfile.getName());
> >
> >     */
> >
> >     PDFExtractor extractor = new PDFExtractor();
> >
> >     String body = extractor.extractText(in);
> >
> >
> >     document.add(Field.Text("body", body));
> >
> >     /* }catch(FileNotFoundException e){
> >
> >     e.toString();
> >
> >     throw new CmsException();
> >
> >     }
> >
> >
> >     */
> >
> >     return (document);
> >
> >     }
> >
> >
> >     thanks
> >     Ernesto
> >     PD: Sorry for my poor english.
> >
> >
> >
> >
> >     ----- Original Message -----
> >     From: "Hartmann, Waehrisch & Feykes GmbH"
> > <hartmann at waehrisch-feykes.de> To: <opencms-dev at opencms.org>
> >     Sent: Wednesday, October 22, 2003 3:50 AM
> >     Subject: Re: [opencms-dev] (no subject)
> >
> >     > Hi Ben,
> >     >
> >     > i think this won't work since the plainDocFactory will only be
used
> >     > for files of type "plain" but not for files of type "binary".
> >     > Recently we have done some additions to the module - by order of
> >     > Lenord, Bauer & Co. GmbH - that could meet your needs. It
introduces
> >     > a more flexible way of defining docFactories that you can add new
> >     > factories without having to recompile the whole module. So other
> >     > modules (like the news) can bring their own docFactory and all you
> >     > have to do is to edit the registry.xml. Here is an example:
> >     >
> >     >             <docFactories>
> >     >                 <docFactory enabled="true" type="plain">
> >     >                     <fileType name="plaintext">
> >     >                         <extension>.txt</extension>
> >     >
> >     > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> >     >                     </fileType>
> >     >                 </docFactory>
> >     >                 <docFactory enabled="true" type="news">
> >     >
> >     > <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
> >     >                 </docFactory>
> >     >             </docFactories>
> >     >
> >     > To index binary files all you need to add is this:
> >     >
> >     >            <docFactory enabled="true" type="binary">
> >     >
> >     >
<class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> >     >            </docFactory>
> >     >
> >     > There should be no need for an extension mapping.
> >     >
> >     > For the interested people:
> >     > For ContentDefinitions (like news) i introduced the following:
> >     >             <contentDefinitions>
> >     >                 <contentDefinition type="news"> <!-- must match
> >     > docFactory type -->
> >     >
> >     >
<class>com.opencms.modules.homepage.news.NewsContentDefinition</class
> >     >>
> >     >
> >     >
<initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</
> >     >initCla ss>
> >     >                     <listMethod name="getNewsList">
> >     >                         <param type="java.lang.Integer">1</param>
> >     >                         <param type="java.lang.String">-1</param>
> >     >                     </listMethod>
> >     >                     <page uri="/news.html?__element=entry">
> >     >                         <param method="getIntId" name="newsid"/>
> >     >                     </page>
> >     >                 </contentDefinition>
> >     >
> >     > In short:
> >     > initClass is optional: For the news the news classes have to be
> >     > loaded to initialize the db pool.
> >     > listMethod: a method of the content definition class that returns
a
> >     > List of elements
> >     > page: the page that can display an entry. Here a jsp that has a
> >     > template element "entry". It also needs the id of the news item.
> >     > getIntId is a method of the content definition class and newsid is
> >     > the url parameter the page needs. A link like
> >     > news.html?__element=entry&newsid=xy
> >     > will be generated.
> >     >
> >     > Best regards,
> >     > Stephan
> >     >
> >     >
> >     > ----- Original Message -----
> >     > From: "Ben Rometsch" <ben at solidstategroup.com>
> >     > To: <opencms-dev at opencms.org>
> >     > Sent: Wednesday, October 22, 2003 6:15 AM
> >     > Subject: [opencms-dev] (no subject)
> >     >
> >     > > Hi Matt,
> >     > >
> >     > > I am not having any joy! I've updated my registry.xml file, with
> >     > > the appropriate section reading:
> >     > >
> >     > > <luceneSearch>
> >     > > <mergeFactor>100000</mergeFactor>
> >     > > <permCheck>true</permCheck>
> >     > > <indexDir>c:\search</indexDir>
> >     > >
> >     > >
<analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</ana
> >     > >lyzer> <subsearch>true</subsearch>
> >     > > <project>online</project>
> >     > > <docFactories>
> >     > > <pageDocFactory enabled="true">
> >     > >
> >     > >
<class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> >     > > </pageDocFactory>
> >     > > <plainDocFactory enabled="true">
> >     > > <fileType name="plaintext">
> >     > > <extension>.txt</extension>
> >     > >
> >     > >
<class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> >     > > </fileType>
> >     > > <fileType name="taggedtext">
> >     > > <extension>.html</extension>
> >     > > <extension>.htm</extension>
> >     > > <extension>.xml</extension>
> >     > > <!-- This will strip tags before processing
> >     > > -->
> >     > >
> >     > >
<class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</c
> >     > >lass> </fileType>
> >     > >
> >     > > <!-- Index binary documents -->
> >     > > <fileType name="plaindocument">
> >     > > <extension>.doc</extension>
> >     > > <extension>.xls</extension>
> >     > > <extension>.pdf</extension>
> >     > >
> >     > >
<class>net.grcomputing.opencms.search.lucene.BodylessDocument</clas
> >     > >s> </fileType>
> >     > >
> >     > > </plainDocFactory>
> >     > > <jspDocFactory enabled="true">
> >     > >
> >     > > <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> >     > > </jspDocFactory>
> >     > > <xmlTemplateDocFactory enabled="false"/>
> >     > > </docFactories>
> >     > > <directories>
> >     > > <directory location="/release/">
> >     > > <section>Test</section>
> >     > > <subsearch>true</subsearch>
> >     > > </directory>
> >     > > <directory location="/RGLIntranet/">
> >     > > <section>Test2</section>
> >     > > <subsearch>true</subsearch>
> >     > > </directory>
> >     > > </directories>
> >     > > </luceneSearch>
> >     > >
> >     > > Notice the section beginning after the remark "Index binary
> >     > > documents".
> >     > >
> >     > > But I cannot get any hits when searching for document names that
> >     > > are in
> >     >
> >     > the
> >     >
> >     > > VFS. The other (HTML) searches are working ok. Is the "name"
> >     > > property of
> >     >
> >     > the
> >     >
> >     > > fileType tag important? I wasn't sure what to add here...I'm not
> >     > > quite
> >     >
> >     > sure
> >     >
> >     > > how to move forward. Maybe it would be an idea to add some
> >     > > debugging trace to the BodylessDocument class to see what is
going
> >     > > on inside it? I want to make sure my XML is correct first tho!
> >     > >
> >     > > Thanks for the help,
> >     > > Ben
> >     > >
> >     > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> >     > > > Hi Matt,
> >     > > >
> >     > > > Thanks for the reply. If I just want to get the document title
to
> >     > > > be included in the Lucene index, looking at the code in the
> >     > > > net.grcomputing.opencms.search.BodylessDocument class it
appears
> >     > > > to
> >     >
> >     > ignore
> >     >
> >     > > > what the CMSObject is, and attempt to index it regardless. Is
> >     > > > this
> >     > >
> >     > > correct?
> >     > >
> >     > >
> >     > > Correct. It will already index the title, but it will not
attempt
> >     > > to index the body.
> >     > >
> >     > > > If this is the case, is it simply a matter of instructing
Lucene
> >     > > > to
> >     >
> >     > index
> >     >
> >     > > > obects other than HTML files in the VFS  (i.e. Documents) ? Or
> >     > > > would I
> >     > >
> >     > > have
> >     > >
> >     > > > to create another class, something like
> >     > > > net.grcomputing.opencms.search.FileDocument and add a new hook
> >     > > > into that class via the registry.xml fragment?  Or does the
> >     > > > BodyLess document
> >     > >
> >     > > provide
> >     > >
> >     > > > this functionality, and it's just a matter of adding a new XML
> >     > > > fragment
> >     >
> >     > to
> >     >
> >     > > > the registry.xml are?
> >     > >
> >     > > Again, you are right -- simply adding the appropriate
configuration
> >     > > to the registry.xml file will suffice. I believe that you will
just
> >     > > need to extend the plainDocument tag set to include extensions
and
> >     > > processors... I _think_ that binary files get handled by the
plain
> >     > > handler.
> >     > >
> >     > > Matt
> >     > >
> >     > > _______________________________________________
> >     > > This mail is send to you from the opencms-dev mailing list
> >     > > To change your list options, or to unsubscribe from the list,
> >     > > please visit
http://mail.opencms.org/mailman/listinfo/opencms-dev
> >     >
> >     > Stephan Hartmann
> >     > Unternehmensberatung Währisch & Feykes GmbH
> >     > Gustav-Adolf-Str. 5
> >     > 47057 Duisburg
> >     >
> >     > Tel.: 0203-373070
> >     > Fax: 0203-376766
> >     > E-Mail: hartmann at wfnetz.de
> >     > Internet: www.wfnetz.de
> >     >
> >     > Über das Internet versandte E-Mails können unter fremden Namen
> >     > erstellt oder manipuliert werden. Aus diesem Grund enthalten
unsere
> >     > mit E-Mail verschickten Nachrichten grundsätzlich keine
> >     > rechtsverbindlichen Willenserklärungen.
>
> ----------------------------------------
> Content-Type: text/html; charset="iso-8859-1"; name="Anhang: 1"
> Content-Transfer-Encoding: quoted-printable
> Content-Description:
> ----------------------------------------
>
> --
> Stephan Hartmann
>
> Währisch & Feykes GmbH
> Gustav-Adolf-Str. 5
> 47057 Duisburg
> Tel. 0203 / 373 070
> Fax 0203 / 376 766
> hartmann at wfnetz.de
>
> ------------------------------------------------------
> Ausschlusserklärung (Disclaimer):
> Über das Internet versandte E-mails können unter fremden Namen erstellt
oder
> manipuliert werden. Aus diesem Grund enthalten unsere mit E-mail
verschickten
> Nachrichten grundsätzlich keine rechtsverbindlichen Willenserklärungen.
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: WordDocument.java
Type: application/octet-stream
Size: 1529 bytes
Desc: not available
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20031029/383fb828/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PDFDocument.java
Type: application/octet-stream
Size: 1449 bytes
Desc: not available
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20031029/383fb828/attachment-0001.obj>