[opencms-dev] Index pdf files with your content in lucene.

Fri Oct 24 10:16:01 CEST 2003

Hello Ernesto,

i assume you are using the unpatched version 1.3 of the search module.
As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type "plain" but not of type "binary". PDF files are stored as binary.
I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this:
...
<docFactories>
...
   <docFactory type="plain" enabled="true">
...
   </docFactory>
   <docFactory type="binary" enabled="true">
      <fileType name="pdftext">
         <extension>.pdf</extension>
         <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
      </fileType>
   </docFactory>
...
</docFactories>

Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml).

Bye,
Stephan

  ----- Original Message ----- 
  From: Ernesto De Santis 
  To: Lucene Users List 
  Cc: opencms-dev at opencms.org 
  Sent: Thursday, October 23, 2003 4:16 PM
  Subject: [opencms-dev] Index pdf files with your content in lucene.

  Hello

  I am new in opencms and lucene tecnology. 

  I won index pdf files, and index de content of this files.

  I work in this way:

  Make a PDFDocument class like JspDocument class. 
  use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs.

  and write my registry.xml for pdf document, in plainDocFactory tag.

                      <fileType name="pdftext">
                          <extension>.pdf</extension>
                          <!-- This will strip tags before processing -->
                          <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
                      </fileType>

  my PDFDocument content this code:
  I think that the probrem is how take the content from CmsFile?, what InputStream use?
  PDFExtractor work with extractText(InputStream) method.

  public class PDFDocument implements I_DocumentConstants, I_DocumentFactory {

  public PDFDocument(){

  }

  public Document Document(CmsObject cmsobject, CmsFile cmsfile)

  throws CmsException 

  {

  return Document(cmsobject, cmsfile, null);

  }

  public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap)

  throws CmsException

  {

  Document document=(new BodylessDocument()).Document(cmsobject, cmsfile);

  //put de content in the pdf file.

  String contenido = new String(cmsfile.getContents());

  StringBufferInputStream in = new StringBufferInputStream(contenido);

  // ByteArrayInputStream in = new ByteArrayInputStream(contenido.getBytes());

  /* try{

  FileInputStream in = new FileInputStream (cmsfile.getPath() + cmsfile.getName());

  */

  PDFExtractor extractor = new PDFExtractor();

  String body = extractor.extractText(in);

  document.add(Field.Text("body", body));

  /* }catch(FileNotFoundException e){

  e.toString();

  throw new CmsException();

  }

  */ 

  return (document);

  }

  thanks
  Ernesto
  PD: Sorry for my poor english.

  ----- Original Message ----- 
  From: "Hartmann, Waehrisch & Feykes GmbH" <hartmann at waehrisch-feykes.de>
  To: <opencms-dev at opencms.org>
  Sent: Wednesday, October 22, 2003 3:50 AM
  Subject: Re: [opencms-dev] (no subject)

  > Hi Ben,
  > 
  > i think this won't work since the plainDocFactory will only be used for
  > files of type "plain" but not for files of type "binary".
  > Recently we have done some additions to the module - by order of Lenord,
  > Bauer & Co. GmbH - that could meet your needs. It introduces a more flexible
  > way of defining docFactories that you can add new factories without having
  > to recompile the whole module. So other modules (like the news) can bring
  > their own docFactory and all you have to do is to edit the registry.xml.
  > Here is an example:
  > 
  >             <docFactories>
  >                 <docFactory enabled="true" type="plain">
  >                     <fileType name="plaintext">
  >                         <extension>.txt</extension>
  > 
  > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
  >                     </fileType>
  >                 </docFactory>
  >                 <docFactory enabled="true" type="news">
  > 
  > <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
  >                 </docFactory>
  >             </docFactories>
  > 
  > To index binary files all you need to add is this:
  > 
  >            <docFactory enabled="true" type="binary">
  > 
  > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
  >            </docFactory>
  > 
  > There should be no need for an extension mapping.
  > 
  > For the interested people:
  > For ContentDefinitions (like news) i introduced the following:
  >             <contentDefinitions>
  >                 <contentDefinition type="news"> <!-- must match docFactory
  > type -->
  > 
  > <class>com.opencms.modules.homepage.news.NewsContentDefinition</class>
  > 
  > <initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCla
  > ss>
  >                     <listMethod name="getNewsList">
  >                         <param type="java.lang.Integer">1</param>
  >                         <param type="java.lang.String">-1</param>
  >                     </listMethod>
  >                     <page uri="/news.html?__element=entry">
  >                         <param method="getIntId" name="newsid"/>
  >                     </page>
  >                 </contentDefinition>
  > 
  > In short:
  > initClass is optional: For the news the news classes have to be loaded to
  > initialize the db pool.
  > listMethod: a method of the content definition class that returns a List of
  > elements
  > page: the page that can display an entry. Here a jsp that has a template
  > element "entry". It also needs the id of the news item.
  > getIntId is a method of the content definition class and newsid is the url
  > parameter the page needs. A link like
  > news.html?__element=entry&newsid=xy
  > will be generated.
  > 
  > Best regards,
  > Stephan
  > 
  > 
  > ----- Original Message ----- 
  > From: "Ben Rometsch" <ben at solidstategroup.com>
  > To: <opencms-dev at opencms.org>
  > Sent: Wednesday, October 22, 2003 6:15 AM
  > Subject: [opencms-dev] (no subject)
  > 
  > 
  > > Hi Matt,
  > >
  > > I am not having any joy! I've updated my registry.xml file, with the
  > > appropriate section reading:
  > >
  > > <luceneSearch>
  > > <mergeFactor>100000</mergeFactor>
  > > <permCheck>true</permCheck>
  > > <indexDir>c:\search</indexDir>
  > >
  > > <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer>
  > > <subsearch>true</subsearch>
  > > <project>online</project>
  > > <docFactories>
  > > <pageDocFactory enabled="true">
  > >
  > > <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
  > > </pageDocFactory>
  > > <plainDocFactory enabled="true">
  > > <fileType name="plaintext">
  > > <extension>.txt</extension>
  > >
  > > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
  > > </fileType>
  > > <fileType name="taggedtext">
  > > <extension>.html</extension>
  > > <extension>.htm</extension>
  > > <extension>.xml</extension>
  > > <!-- This will strip tags before processing
  > > -->
  > >
  > > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class>
  > > </fileType>
  > >
  > > <!-- Index binary documents -->
  > > <fileType name="plaindocument">
  > > <extension>.doc</extension>
  > > <extension>.xls</extension>
  > > <extension>.pdf</extension>
  > >
  > > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
  > > </fileType>
  > >
  > > </plainDocFactory>
  > > <jspDocFactory enabled="true">
  > >
  > > <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
  > > </jspDocFactory>
  > > <xmlTemplateDocFactory enabled="false"/>
  > > </docFactories>
  > > <directories>
  > > <directory location="/release/">
  > > <section>Test</section>
  > > <subsearch>true</subsearch>
  > > </directory>
  > > <directory location="/RGLIntranet/">
  > > <section>Test2</section>
  > > <subsearch>true</subsearch>
  > > </directory>
  > > </directories>
  > > </luceneSearch>
  > >
  > > Notice the section beginning after the remark "Index binary documents".
  > >
  > > But I cannot get any hits when searching for document names that are in
  > the
  > > VFS. The other (HTML) searches are working ok. Is the "name" property of
  > the
  > > fileType tag important? I wasn't sure what to add here...I'm not quite
  > sure
  > > how to move forward. Maybe it would be an idea to add some debugging trace
  > > to the BodylessDocument class to see what is going on inside it? I want to
  > > make sure my XML is correct first tho!
  > >
  > > Thanks for the help,
  > > Ben
  > >
  > >
  > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
  > > > Hi Matt,
  > > >
  > > > Thanks for the reply. If I just want to get the document title to be
  > > > included in the Lucene index, looking at the code in the
  > > > net.grcomputing.opencms.search.BodylessDocument class it appears to
  > ignore
  > > > what the CMSObject is, and attempt to index it regardless. Is this
  > > correct?
  > > >
  > >
  > > Correct. It will already index the title, but it will not attempt to
  > > index the body.
  > >
  > > > If this is the case, is it simply a matter of instructing Lucene to
  > index
  > > > obects other than HTML files in the VFS  (i.e. Documents) ? Or would I
  > > have
  > > > to create another class, something like
  > > > net.grcomputing.opencms.search.FileDocument and add a new hook into that
  > > > class via the registry.xml fragment?  Or does the BodyLess document
  > > provide
  > > > this functionality, and it's just a matter of adding a new XML fragment
  > to
  > > > the registry.xml are?
  > >
  > > Again, you are right -- simply adding the appropriate configuration to
  > > the registry.xml file will suffice. I believe that you will just need to
  > > extend the plainDocument tag set to include extensions and processors...
  > > I _think_ that binary files get handled by the plain handler.
  > >
  > > Matt
  > >
  > > _______________________________________________
  > > This mail is send to you from the opencms-dev mailing list
  > > To change your list options, or to unsubscribe from the list, please visit
  > > http://mail.opencms.org/mailman/listinfo/opencms-dev
  > 
  > Stephan Hartmann
  > Unternehmensberatung Währisch & Feykes GmbH
  > Gustav-Adolf-Str. 5
  > 47057 Duisburg
  > 
  > Tel.: 0203-373070
  > Fax: 0203-376766
  > E-Mail: hartmann at wfnetz.de
  > Internet: www.wfnetz.de
  > 
  > Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
  > manipuliert werden. Aus diesem Grund enthalten unsere mit E-Mail
  > verschickten Nachrichten grundsätzlich keine rechtsverbindlichen
  > Willenserklärungen.
  > 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20031024/b03a05ff/attachment.htm>