[opencms-dev] use PDFBox in net.grcomputing.opencms.search.lucene module
???
shiys at langhua.cn
Wed Aug 25 17:55:26 CEST 2004
hi there,
as the textmining cannot parse PDF documents in simplified chinese correctly, i tried PDFBox 0.6.6 and it worked (for PDF files produced by Acrobat 5).
i changed the PDFDocument.java as following:
public Document Document(CmsObject cmso, CmsFile f) throws CmsException {
String bodyText = null;
Document doc = super.Document(cmso, f);
f = cmso.readFile(f.getAbsolutePath());
InputStream in = new ByteArrayInputStream(f.getContents());
PDDocument pdfDocument = null;
//create a tmp output stream with the size of the content.
ByteArrayOutputStream out = new ByteArrayOutputStream();
OutputStreamWriter writer = new OutputStreamWriter( out );
PDFTextStripper stripper = new PDFTextStripper();
try {
stripper.writeText( pdfDocument.getDocument(), writer );
writer.close();
} catch(IOException e) {
}
byte[] contents = out.toByteArray();
InputStreamReader input = new InputStreamReader( new ByteArrayInputStream( contents ) );
doc.add(Field.Text("contents", input ));
return doc;
}
that's all.
Shi Yusen
Beijing Langhua Ltd.
More information about the opencms-dev
mailing list