[opencms-dev] Lucene problem indexing pdf content

Thomas Fabbricante tom_fabbricante at wunderman.com
Mon Mar 8 16:00:02 CET 2004


I successfully  imported net.grcomputing.opencms.search.lucene_1.5.zip,
configured the registry.xml file, scheduled a task to index content and ran
the simple_search page.

All document types (html,xml,word,plain) return hits on the simple search
except pdfs.  The only hits I get are on the pdf titles.

Content inside the pdf seems to be missed by the indexing process.

I've seen the pdf section in the registry.xml file written 2 ways:
<fileType name="PDF">
  <extension>.pdf</extension>
  <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
</fileType>

or

<fileType name="pdftext">
  <extension>.pdf</extension>
  <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
</fileType>

Tried them both but got the same results. No content was indexed.

Question 1:  Which form of the name attribute is correct, PDF or pdftext?

Question 2:  How do I get my pdf content indexed?

Thanks
-tom



===============================================
This transmission is confidential and intended
solely for the person or organization to whom
it is addressed.  It may contain privileged and
confidential information.  If you are not the
intended recipient, you should not copy,
distribute or take any action in reliance on it.

If you have received this transmission in error,
please notify the sender at the e-mail address above.
================================================




More information about the opencms-dev mailing list