[opencms-dev] Adobe 9 and pdfBox

Tony Thul TTHUL at regina.ca
Tue Mar 22 20:07:08 CET 2011


I replaced the pdfbox jar 0.7.2 with 1.5.0 and added fontbox-1.5.0.jar. It would not work unless I changed the import statements in CmsExtractorPdf.java from
 
org.pdfbox.pdfparser.PDFParser;
 
to
 
org.apache.pdfbox.pdfparser.PDFParser
 
and rebuilt OpenCMS from source. This seems to work, the content was indexed.
 
Are there any plans to upgrade to a newer version of pdfbox in the future?
 
Thanks!
Tony

>>> Graeme Kidd <coolkidd3 at hotmail.com> 22/Mar/2011 12:19 pm >>>


Hi,

It appears this is a known issue that appeared in PDFBOX before version 0.8 (OpenCms uses 0.7.2):
https://issues.apache.org/jira/browse/PDFBOX-361 

You could try and download the latest version of PDFBOX (1.5.0) from here:
http://pdfbox.apache.org/download.html 

However I am not sure how much the PDFBOX API has changed so it may be that this version is not supported by OpenCms 7.

Graeme

________________________________
> Date: Tue, 22 Mar 2011 11:14:27 -0600
> From: TTHUL at regina.ca 
> To: opencms-dev at opencms.org 
> Subject: [opencms-dev] Adobe 9 and pdfBox
>
> Are there any fixes available for 7.x that will allow the content to be
> indexed in pdf files created with adobe 9?
>
> This an example of the errors we are getting:
>
> 22 Mar 2011 09:07:07,050 ERROR [rch.documents.A_CmsVfsDocument: 166]
> Extracting text from resource
> "/sites/Insite/hr/job_descriptions/Public_Works_Division/Water_and_Sewer_Services_Department/Water_Operations/Tradesperson_II.pdf"
> failed.
> org.opencms.search.CmsIndexException: Extracting text from resource
> "/sites/Insite/hr/job_descriptions/Public_Works_Division/Water_and_Sewer_Services_Department/Water_Operations/Tradesperson_II_x_Plumber_Cross_Connection.pdf"
> failed.
> at
> org.opencms.search.documents.CmsDocumentPdf.extractContent(CmsDocumentPdf.java:91)
> at
> org.opencms.search.documents.A_CmsVfsDocument.createDocument(A_CmsVfsDocument.java:159)
> at org.opencms.search.CmsIndexingThread.run(CmsIndexingThread.java:129)
> Caused by: java.lang.NullPointerException
> at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
> at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
> at
> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:162)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)
> at
> org.opencms.search.extractors.CmsExtractorPdf.extractText(CmsExtractorPdf.java:104)
> at
> org.opencms.search.extractors.A_CmsTextExtractor.extractText(A_CmsTextExtractor.java:72)
> at
> org.opencms.search.extractors.A_CmsTextExtractor.extractText(A_CmsTextExtractor.java:62)
> at
> org.opencms.search.documents.CmsDocumentPdf.extractContent(CmsDocumentPdf.java:78)
> ... 2 more
>
>
> DISCLAIMER: The information transmitted is intended only for the
> addressee and may contain confidential, proprietary and/or privileged
> material. Any unauthorized review, distribution or other use of or the
> taking of any action in reliance upon this information is prohibited.
> If you received this in error, please contact the sender and delete or
> destroy this message and any copies.
>
> _______________________________________________ This mail is sent to
> you from the opencms-dev mailing list To change your list options, or
> to unsubscribe from the list, please visit
> http://lists.opencms.org/mailman/listinfo/opencms-dev 
     

_______________________________________________
This mail is sent to you from the opencms-dev mailing list
To change your list options, or to unsubscribe from the list, please visit
http://lists.opencms.org/mailman/listinfo/opencms-dev


DISCLAIMER: The information transmitted is intended only for the addressee and may contain confidential, proprietary and/or privileged material. Any unauthorized review, distribution or other use of or the taking of any action in reliance upon this information is prohibited. If you received this in error, please contact the sender and delete or destroy this message and any copies. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20110322/406a267b/attachment.htm>


More information about the opencms-dev mailing list