[opencms-dev] use PDFBox in net.grcomputing.opencms.search.lucene module

Wed Aug 25 17:55:26 CEST 2004

hi there,

as the textmining cannot parse PDF documents in simplified chinese correctly, i tried PDFBox 0.6.6 and it worked (for PDF files produced by Acrobat 5).

i changed the PDFDocument.java as following:

 public Document Document(CmsObject cmso, CmsFile f) throws CmsException {
  String bodyText = null;
  Document doc = super.Document(cmso, f);
  f = cmso.readFile(f.getAbsolutePath());
  InputStream in = new ByteArrayInputStream(f.getContents());
  PDDocument pdfDocument = null;

  //create a tmp output stream with the size of the content.
  ByteArrayOutputStream out = new ByteArrayOutputStream();
  OutputStreamWriter writer = new OutputStreamWriter( out );
  PDFTextStripper stripper = new PDFTextStripper();
  try {
      stripper.writeText( pdfDocument.getDocument(), writer );
      writer.close();
  } catch(IOException e) {
  }

  byte[] contents = out.toByteArray();
  InputStreamReader input = new InputStreamReader( new ByteArrayInputStream( contents ) );
  doc.add(Field.Text("contents", input ));

  return doc;
 }

that's all.

Shi Yusen
Beijing Langhua Ltd.