[opencms-dev] Lucene - PDF exception during index creation

Thu Mar 18 18:24:02 CET 2004

Ralf,

The PDF Box classes think the PDF file is corrupt. Those classes are 
outside of the development work that we do, but it is possible that a 
newer version of the PDF Box classes will fix the issue.

Matt

Ralf Emanuel wrote:
> Dear opencms list,
> 
> we use lucene 1.5 and opencms 5 in a current project on Windows 2003 
> Server. Each time the index run the below mentioned exception appears.
> 
> Can anybody help me?
> 
> --snip--
> java.io.IOException: Error: Expected an integer type, actual='endobj'
>         at org.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:943)
>         at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:253)
>         at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:93)
>         at 
> org.textmining.text.extraction.PDFExtractor.extractText(PDFExtractor.java:37) 
> 
>         at 
> net.grcomputing.opencms.search.lucene.PDFDocument.Document(Unknown Source)
>         at 
> net.grcomputing.opencms.search.lucene.IndexManager.processFile(Unknown 
> Source)
>         at 
> net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown 
> Source)
>         at 
> net.grcomputing.opencms.search.lucene.IndexManager.processDir(Unknown 
> Source)
>         at 
> net.grcomputing.opencms.search.lucene.IndexManager.doIndex(Unknown Source)
>         at 
> net.grcomputing.opencms.search.lucene.CronIndexManager.launch(Unknown 
> Source)
>         at 
> com.opencms.core.CmsCronScheduleJob.run(CmsCronScheduleJob.java:68)
> --snip--
> 
> Thanks in advance.
> 
> 
> Ralf Emanuel
> 
> 
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev