<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<META content="MSHTML 6.00.2800.1264" name=GENERATOR>

<STYLE></STYLE>

</HEAD>

<BODY bgColor=#ffffff>

<DIV><FONT face=Arial size=2>Hello Ernesto,</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>i assume you are using the unpatched version 1.3 of 

the search module.</FONT></DIV>

<DIV><FONT face=Arial size=2>As i mentioned yesterday, the plainDocFactory does 

only index cmsFiles of type "plain" but not of type "binary". PDF files are 

stored as binary.</FONT></DIV>

<DIV><FONT face=Arial size=2>I suggest to use the version i posted yesterday. 

Then your registry.xml would have to look like this:</FONT></DIV>

<DIV><FONT face=Arial size=2>...</FONT></DIV>

<DIV><FONT face=Arial size=2><docFactories></FONT></DIV>

<DIV><FONT face=Arial size=2>...</FONT></DIV>

<DIV><FONT face=Arial size=2>   <docFactory type="plain" 

enabled="true"></FONT></DIV>

<DIV><FONT face=Arial size=2>...</FONT></DIV>

<DIV><FONT face=Arial size=2>   </docFactory></FONT></DIV>

<DIV><FONT face=Arial size=2>   <docFactory type="binary" 

enabled="true"></FONT></DIV>

<DIV><FONT face=Arial size=2>      <fileType 

name="pdftext"><BR>         

<extension>.pdf</extension><BR>         

<class>net.grcomputing.opencms.search.lucene.PDFDocument</class><BR>      

</fileType></FONT></DIV>

<DIV><FONT face=Arial size=2>   </docFactory></FONT></DIV>

<DIV><FONT face=Arial size=2>...</FONT></DIV>

<DIV><FONT face=Arial size=2></docFactories></FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Important: The type attribute must match the file 

types of OpenCms (also defined in the registry.xml).</FONT></DIV>

<DIV> </DIV>

<DIV><FONT face=Arial size=2>Bye,</FONT></DIV>

<DIV><FONT face=Arial size=2>Stephan</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<BLOCKQUOTE 

style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">

  <DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>

  <DIV 

  style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B> 

  <A title=edesantis@fibertel.com.ar 

  href="mailto:edesantis@fibertel.com.ar">Ernesto De Santis</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>To:</B> <A 

  title=lucene-user@jakarta.apache.org 

  href="mailto:lucene-user@jakarta.apache.org">Lucene Users List</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>Cc:</B> <A title=opencms-dev@opencms.org 

  href="mailto:opencms-dev@opencms.org">opencms-dev@opencms.org</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>Sent:</B> Thursday, October 23, 2003 4:16 

  PM</DIV>

  <DIV style="FONT: 10pt arial"><B>Subject:</B> [opencms-dev] Index pdf files 

  with your content in lucene.</DIV>

  <DIV><BR></DIV>

  <DIV><FONT face=Arial size=2>Hello</FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2>I am new in opencms and lucene tecnology. 

  </FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2>I won index pdf files, and index de content of 

  this files.</FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2>I work in this way:</FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2>Make a PDFDocument class like 

  JspDocument class. </FONT></DIV>

  <DIV><FONT face=Arial size=2>use org.textmining.text.extraction.PDFExtractor 

  class, this class work fine out of vfs.</FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2>and write my registry.xml for pdf document, in 

  plainDocFactory tag.</FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial 

  size=2>                    

  <fileType 

  name="pdftext"><BR>                        

  <extension>.pdf</extension><BR>                        

  <!-- This will strip tags before processing 

  --><BR>                        

  <class>net.grcomputing.opencms.search.lucene.PDFDocument</class><BR>                    

  </fileType><BR></DIV></FONT>

  <DIV><FONT face=Arial size=2>my PDFDocument content this code:</FONT></DIV>

  <DIV><FONT face=Arial size=2>I think that the probrem is how take the content 

  from CmsFile?, what InputStream use?</FONT></DIV>

  <DIV>

  <DIV><FONT face=Arial size=2>PDFExtractor work with extractText(InputStream) 

  method.</FONT></DIV></DIV>

  <DIV><FONT face=Arial size=2><B><FONT color=#7f0055 

  size=2></FONT></B></FONT> </DIV>

  <DIV><FONT face=Arial size=2><B><FONT color=#7f0055 

  size=2>public</B></FONT><FONT size=2> </FONT><B><FONT color=#7f0055 

  size=2>class</B></FONT><FONT size=2> PDFDocument </FONT><B><FONT color=#7f0055 

  size=2>implements</B></FONT><FONT size=2> I_DocumentConstants, 

  I_DocumentFactory {</DIV>

  <DIV>

  <P></P>

  <P></FONT><B><FONT color=#7f0055 size=2>public</B></FONT><FONT size=2> 

  PDFDocument(){</P>

  <P>}</P>

  <P></P>

  <P></FONT><B><FONT color=#7f0055 size=2>public</B></FONT><FONT size=2> 

  Document Document(CmsObject cmsobject, CmsFile cmsfile)</P>

  <P></FONT><B><FONT color=#7f0055 size=2>throws</B></FONT><FONT size=2> 

  CmsException </P>

  <P>{</P>

  <P></FONT><B><FONT color=#7f0055 size=2>return</B></FONT><FONT size=2> 

  Document(cmsobject, cmsfile, </FONT><B><FONT color=#7f0055 

  size=2>null</B></FONT><FONT size=2>);</P>

  <P>}</P>

  <P></FONT><B><FONT color=#7f0055 size=2>public</B></FONT><FONT size=2> 

  Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap)</P>

  <P></FONT><B><FONT color=#7f0055 size=2>throws</B></FONT><FONT size=2> 

  CmsException</P>

  <P>{</P>

  <P>Document document=(</FONT><B><FONT color=#7f0055 size=2>new</B></FONT><FONT 

  size=2> BodylessDocument()).Document(cmsobject, cmsfile);</P>

  <P></P>

  <P></FONT><FONT color=#3f7f5f size=2>//put de content in the pdf 

  file.</P></FONT><FONT size=2>

  <P>String contenido = </FONT><B><FONT color=#7f0055 size=2>new</B></FONT><FONT 

  size=2> String(cmsfile.getContents());</P>

  <P>StringBufferInputStream in = </FONT><B><FONT color=#7f0055 

  size=2>new</B></FONT><FONT size=2> 

  StringBufferInputStream(contenido);</P></FONT><FONT color=#3f7f5f size=2>

  <P>// ByteArrayInputStream in = new 

  ByteArrayInputStream(contenido.getBytes());</P></FONT><FONT size=2>

  <P></P></FONT><FONT color=#3f7f5f size=2>

  <P>/* try{</P>

  <P>FileInputStream in = new FileInputStream (cmsfile.getPath() + 

  cmsfile.getName());</P>

  <P>*/</P></FONT><FONT size=2>

  <P>PDFExtractor extractor = </FONT><B><FONT color=#7f0055 

  size=2>new</B></FONT><FONT size=2> PDFExtractor();</P>

  <P>String body = extractor.extractText(in);</P>

  <P></P>

  <P>document.add(Field.Text(</FONT><FONT color=#2a00ff 

  size=2>"body"</FONT><FONT size=2>, body));</P></FONT><FONT color=#3f7f5f 

  size=2>

  <P>/* }catch(FileNotFoundException e){</P>

  <P>e.toString();</P>

  <P>throw new CmsException();</P>

  <P>}</P>

  <P></P>

  <P>*/</FONT><FONT size=2> </P>

  <P></FONT><B><FONT color=#7f0055 size=2>return</B></FONT><FONT size=2> 

  (document);</P>

  <P>}</P></FONT></FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2>thanks<BR>Ernesto</FONT></DIV>

  <DIV><FONT face=Arial size=2>PD: Sorry for my poor english.</FONT></DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2></FONT> </DIV>

  <DIV><FONT face=Arial size=2>----- Original Message ----- </FONT>

  <DIV><FONT face=Arial size=2>From: "Hartmann, Waehrisch & Feykes GmbH" 

  <</FONT><A href="mailto:hartmann@waehrisch-feykes.de"><FONT face=Arial 

  size=2>hartmann@waehrisch-feykes.de</FONT></A><FONT face=Arial 

  size=2>></FONT></DIV>

  <DIV><FONT face=Arial size=2>To: <</FONT><A 

  href="mailto:opencms-dev@opencms.org"><FONT face=Arial 

  size=2>opencms-dev@opencms.org</FONT></A><FONT face=Arial 

  size=2>></FONT></DIV>

  <DIV><FONT face=Arial size=2>Sent: Wednesday, October 22, 2003 3:50 

  AM</FONT></DIV>

  <DIV><FONT face=Arial size=2>Subject: Re: [opencms-dev] (no 

  subject)</FONT></DIV></DIV>

  <DIV><FONT face=Arial><BR><FONT size=2></FONT></FONT></DIV><FONT face=Arial 

  size=2>> Hi Ben,<BR>> <BR>> i think this won't work since the 

  plainDocFactory will only be used for<BR>> files of type "plain" but not 

  for files of type "binary".<BR>> Recently we have done some additions to 

  the module - by order of Lenord,<BR>> Bauer & Co. GmbH - that could 

  meet your needs. It introduces a more flexible<BR>> way of defining 

  docFactories that you can add new factories without having<BR>> to 

  recompile the whole module. So other modules (like the news) can bring<BR>> 

  their own docFactory and all you have to do is to edit the 

  registry.xml.<BR>> Here is an example:<BR>> <BR>> 

              

  <docFactories><BR>> 

                  

  <docFactory enabled="true" type="plain"><BR>> 

                      

  <fileType name="plaintext"><BR>> 

                          

  <extension>.txt</extension><BR>> <BR>> 

  <class>net.grcomputing.opencms.search.lucene.PlainDocument</class><BR>> 

                      

  </fileType><BR>> 

                  

  </docFactory><BR>> 

                  

  <docFactory enabled="true" type="news"><BR>> <BR>> 

  <class>net.grcomputing.opencms.search.lucene.NewsDocument</class><BR>> 

                  

  </docFactory><BR>> 

              

  </docFactories><BR>> <BR>> To index binary files all you need to 

  add is this:<BR>> <BR>> 

             <docFactory 

  enabled="true" type="binary"><BR>> <BR>> 

  <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class><BR>> 

             

  </docFactory><BR>> <BR>> There should be no need for an extension 

  mapping.<BR>> <BR>> For the interested people:<BR>> For 

  ContentDefinitions (like news) i introduced the following:<BR>> 

              

  <contentDefinitions><BR>> 

                  

  <contentDefinition type="news"> <!-- must match docFactory<BR>> 

  type --><BR>> <BR>> 

  <class>com.opencms.modules.homepage.news.NewsContentDefinition</class><BR>> 

  <BR>> 

  <initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</initCla<BR>> 

  ss><BR>> 

                      

  <listMethod name="getNewsList"><BR>> 

                          

  <param type="java.lang.Integer">1</param><BR>> 

                          

  <param type="java.lang.String">-1</param><BR>> 

                      

  </listMethod><BR>> 

                      

  <page uri="/news.html?__element=entry"><BR>> 

                          

  <param method="getIntId" name="newsid"/><BR>> 

                      

  </page><BR>> 

                  

  </contentDefinition><BR>> <BR>> In short:<BR>> initClass is 

  optional: For the news the news classes have to be loaded to<BR>> 

  initialize the db pool.<BR>> listMethod: a method of the content definition 

  class that returns a List of<BR>> elements<BR>> page: the page that can 

  display an entry. Here a jsp that has a template<BR>> element "entry". It 

  also needs the id of the news item.<BR>> getIntId is a method of the 

  content definition class and newsid is the url<BR>> parameter the page 

  needs. A link like<BR>> news.html?__element=entry&newsid=xy<BR>> 

  will be generated.<BR>> <BR>> Best regards,<BR>> Stephan<BR>> 

  <BR>> <BR>> ----- Original Message ----- <BR>> From: "Ben Rometsch" 

  <</FONT><A href="mailto:ben@solidstategroup.com"><FONT face=Arial 

  size=2>ben@solidstategroup.com</FONT></A><FONT face=Arial size=2>><BR>> 

  To: <</FONT><A href="mailto:opencms-dev@opencms.org"><FONT face=Arial 

  size=2>opencms-dev@opencms.org</FONT></A><FONT face=Arial size=2>><BR>> 

  Sent: Wednesday, October 22, 2003 6:15 AM<BR>> Subject: [opencms-dev] (no 

  subject)<BR>> <BR>> <BR>> > Hi Matt,<BR>> ><BR>> > I 

  am not having any joy! I've updated my registry.xml file, with the<BR>> 

  > appropriate section reading:<BR>> ><BR>> > 

  <luceneSearch><BR>> > 

  <mergeFactor>100000</mergeFactor><BR>> > 

  <permCheck>true</permCheck><BR>> > 

  <indexDir>c:\search</indexDir><BR>> ><BR>> > 

  <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</analyzer><BR>> 

  > <subsearch>true</subsearch><BR>> > 

  <project>online</project><BR>> > 

  <docFactories><BR>> > <pageDocFactory 

  enabled="true"><BR>> ><BR>> > 

  <class>net.grcomputing.opencms.search.lucene.PageDocument</class><BR>> 

  > </pageDocFactory><BR>> > <plainDocFactory 

  enabled="true"><BR>> > <fileType name="plaintext"><BR>> > 

  <extension>.txt</extension><BR>> ><BR>> > 

  <class>net.grcomputing.opencms.search.lucene.PlainDocument</class><BR>> 

  > </fileType><BR>> > <fileType name="taggedtext"><BR>> 

  > <extension>.html</extension><BR>> > 

  <extension>.htm</extension><BR>> > 

  <extension>.xml</extension><BR>> > <!-- This will strip 

  tags before processing<BR>> > --><BR>> ><BR>> > 

  <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</class><BR>> 

  > </fileType><BR>> ><BR>> > <!-- Index binary 

  documents --><BR>> > <fileType name="plaindocument"><BR>> 

  > <extension>.doc</extension><BR>> > 

  <extension>.xls</extension><BR>> > 

  <extension>.pdf</extension><BR>> ><BR>> > 

  <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class><BR>> 

  > </fileType><BR>> ><BR>> > 

  </plainDocFactory><BR>> > <jspDocFactory 

  enabled="true"><BR>> ><BR>> > 

  <class>net.grcomputing.opencms.search.lucene.JspDocument</class><BR>> 

  > </jspDocFactory><BR>> > <xmlTemplateDocFactory 

  enabled="false"/><BR>> > </docFactories><BR>> > 

  <directories><BR>> > <directory 

  location="/release/"><BR>> > 

  <section>Test</section><BR>> > 

  <subsearch>true</subsearch><BR>> > 

  </directory><BR>> > <directory 

  location="/RGLIntranet/"><BR>> > 

  <section>Test2</section><BR>> > 

  <subsearch>true</subsearch><BR>> > 

  </directory><BR>> > </directories><BR>> > 

  </luceneSearch><BR>> ><BR>> > Notice the section beginning 

  after the remark "Index binary documents".<BR>> ><BR>> > But I 

  cannot get any hits when searching for document names that are in<BR>> 

  the<BR>> > VFS. The other (HTML) searches are working ok. Is the "name" 

  property of<BR>> the<BR>> > fileType tag important? I wasn't sure 

  what to add here...I'm not quite<BR>> sure<BR>> > how to move 

  forward. Maybe it would be an idea to add some debugging trace<BR>> > to 

  the BodylessDocument class to see what is going on inside it? I want 

  to<BR>> > make sure my XML is correct first tho!<BR>> ><BR>> 

  > Thanks for the help,<BR>> > Ben<BR>> ><BR>> ><BR>> 

  > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:<BR>> > > Hi 

  Matt,<BR>> > ><BR>> > > Thanks for the reply. If I just want 

  to get the document title to be<BR>> > > included in the Lucene 

  index, looking at the code in the<BR>> > > 

  net.grcomputing.opencms.search.BodylessDocument class it appears to<BR>> 

  ignore<BR>> > > what the CMSObject is, and attempt to index it 

  regardless. Is this<BR>> > correct?<BR>> > ><BR>> 

  ><BR>> > Correct. It will already index the title, but it will not 

  attempt to<BR>> > index the body.<BR>> ><BR>> > > If this 

  is the case, is it simply a matter of instructing Lucene to<BR>> 

  index<BR>> > > obects other than HTML files in the VFS  (i.e. 

  Documents) ? Or would I<BR>> > have<BR>> > > to create another 

  class, something like<BR>> > > 

  net.grcomputing.opencms.search.FileDocument and add a new hook into 

  that<BR>> > > class via the registry.xml fragment?  Or does the 

  BodyLess document<BR>> > provide<BR>> > > this functionality, 

  and it's just a matter of adding a new XML fragment<BR>> to<BR>> > 

  > the registry.xml are?<BR>> ><BR>> > Again, you are right -- 

  simply adding the appropriate configuration to<BR>> > the registry.xml 

  file will suffice. I believe that you will just need to<BR>> > extend 

  the plainDocument tag set to include extensions and processors...<BR>> > 

  I _think_ that binary files get handled by the plain handler.<BR>> 

  ><BR>> > Matt<BR>> ><BR>> > 

  _______________________________________________<BR>> > This mail is send 

  to you from the opencms-dev mailing list<BR>> > To change your list 

  options, or to unsubscribe from the list, please visit<BR>> > </FONT><A 

  href="http://mail.opencms.org/mailman/listinfo/opencms-dev"><FONT face=Arial 

  size=2>http://mail.opencms.org/mailman/listinfo/opencms-dev</FONT></A><BR><FONT 

  face=Arial size=2>> <BR>> Stephan Hartmann<BR>> Unternehmensberatung 

  W�hrisch & Feykes GmbH<BR>> Gustav-Adolf-Str. 5<BR>> 47057 

  Duisburg<BR>> <BR>> Tel.: 0203-373070<BR>> Fax: 0203-376766<BR>> 

  E-Mail: </FONT><A href="mailto:hartmann@wfnetz.de"><FONT face=Arial 

  size=2>hartmann@wfnetz.de</FONT></A><BR><FONT face=Arial size=2>> Internet: 

  </FONT><A href="http://www.wfnetz.de"><FONT face=Arial 

  size=2>www.wfnetz.de</FONT></A><BR><FONT face=Arial size=2>> <BR>> �ber 

  das Internet versandte E-Mails k�nnen unter fremden Namen erstellt 

  oder<BR>> manipuliert werden. Aus diesem Grund enthalten unsere mit 

  E-Mail<BR>> verschickten Nachrichten grunds�tzlich keine 

  rechtsverbindlichen<BR>> Willenserkl�rungen.<BR>> 

</FONT></BLOCKQUOTE></BODY></HTML>