[opencms-dev] pdf export

Christian Steinert christian_steinert at web.de
Sun Feb 18 13:08:05 CET 2007


Michael Leitner schrieb:
> hello.
>
> is there a way to export all articles in an opencms to pdf  in a patch
> process ?
> i would like to create a cronjob, which daily exports all articles to
> pdf. is this possible ?
>
I am currently playing around with htmldoc (www.htmldoc.org). Htmldoc is
a C program that converts HTML pages into PDF documents. In their
documentation they also have code samples for how their program can be
called from Java. They can even download a page via http but I don't
know if they can also do so with images that are referenced by a page -
this you would have to try. (I am currently suppressing any images,
because the information that I am rendering into pdf is mainly textual.)

Htmldoc does not support CSS in the current stable version (the current
unstable seems to support some css, but I have not tried to find out how
much) If your content is XHTML then you can pre-transform your content
by writing a small XSLT transformation that will format it with the type
of html formats that htmldoc understands (if your content is generated
only for being converted into PDF, then there is not even a reason to be
ashamed of writing <font> tags ;-). If you want my XSLT that filters my
pages so that they can be processed with htmldoc, then you can contact
me and I will email it to you.

Personally, I am very happy with htmldoc and it's easy to just try it
out on one or two test pages, but if your content is very visually
oriented, then you'll have to see if htmldoc suffices for you.

Then I push the resulting simplified html through htmldoc to generate
PDF content. This is the approach that I am taking (although I must
admit that I am doing it with PHP, because in this way I could re-use
some existing code I had recently written for a different purpose - but
triggering the conversion from Java will give you better integration
into opencms and it will work with things like static export).

There are probably other commandline programs that can also convert html
to pdf that can be called by issuing shell commands and grabbing their
output from Java. Of course it's nicer to use a java library, but if you
look at the samples in the htmldoc homepage, its also not very difficult
to call a commandline tool from java for doing the conversion.

If you want to do everything "by hand" and if you have XML or XHTML
content, you could of course transform your content into XSL:FO by using
a self-written XSLT (you will find some basic versions of such
html-to-xslfo converter-XSLTs on the net, too) and then you can render
the resulting XSL:FO with apache FOP to generate PDF. This may be a
*lot* more work, depending on how complex the formatting of your input
data and the desired formatting of your PDF documents is but you will
have lots of control over the resulting document.

There seem to be some commercial html to pdf converters, too, some of
which will evaluate some CSS formats. Adequate rendering of modern html
which contains a lot of complex CSS formatting into any other format
will probably not be possible without making use of a full-blown browser
rendering engine (Yesterday I stumbled across a page where somebody had
done a html2pdf conversion by doing some hacks on top of mozillas
XULrunner, but I did not find the page again and the process seemed to
still require a windowing system being present).



hth
christian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3269 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://webmail.opencms.org/pipermail/opencms-dev/attachments/20070218/1a847a86/attachment.bin>


More information about the opencms-dev mailing list