[opencms-dev] newer version of JTidy for Opencms 6.21 [ fix for: "invalid XML character (Unicode: 0x0)" ]

Christian Steinert christian_steinert at web.de
Fri Jun 30 15:14:56 CEST 2006


Hi,


Various people (including myself) faced the following issue after 
upgrading to opencms 6.2.1 - Whe saving some in FCKedit the following 
message occurred:
===============================
Error Unmarshalling xml document failed.
Reason: Error on line 31 of document : An invalid XML character 
(Unicode: 0x0) was found in the CDATA section. Nested exception: An 
invalid XML character (Unicode: 0x0) was found in the CDATA section.
==============================

I use opencms with UTF-8 (on tomcat 4.1/Mysql 4.0/java 1.5 and tomcat 
5.0/Mysql 4.0/java 1.5), maybe UTF-8 has something do do with this problem.

The error report is correct: under certain conditions that I have not 
precisely pinpointed, the JTidy library will add 0x00 characters to the 
HTML code. The code gets inserted by Jtidy - this was very clear in the 
debugger -  yet *still* I did not have this problem when using the 
HTMLarea editor instead.

For me the error had something to do with HTML entities that represent 
special characters (for example fancy quotes like “  ”  or 
„). Directly after such entities, a NULL (0x00) character was 
inserted by Jtidy, but I was not able to pinpoint the exact code 
location where this happened.

Also, the problem was not there *every* time when one of these html 
entities appeared. It seems to have something to do the *exact position* 
of *some* html entities in the file. Maybe there is some problem, when 
certain Html entities hit the end of some internal buffer within Jtidy - 
but this is just a guess. (Because Jtidy is ported from C, it handles 
character encodings by itself and does not just use Java-based string 
processing).

I have temporarily uploaded a Jar file which is working for me to
    http://www.berzinarchives.com/temp/jtidy

The Jar contains both the original source and the compiled classes.

I have downloaded the Jtidy source from the HEAD of their Subversion 
repository. I have not changed *anything substantial* in the code. (I 
have compared the code again to the original one that I had downloaded 
from SVN, just to make sure). The only change I have made at all is an 
additional safety check in class org.w3c.jtidy.OutJavaImpl

My code insertion is trivial and enclosed by the comment
    //test berzinarchives

For quite a while now the library has worked for me in my development 
environment.  My additional check did not fire again after I had added 
the new library and restarted tomcat completely.


So in short:
- If you run into this problem then you might want to try to download 
this newer jar and put it into your opencms WEB-INF/libs folder.
- Don't forget to move the original jar OUT of this folder. There should 
be only one version of Jtidy in your WEB-INF/libs folder


If it *doesn't* work for you, then  I'm  sorry.
But: I did not invent Jtidy,  I have written down here all that I know 
about this problem and after having wasted a lot of time on this problem 
I still do not  understand how Jtidy's nasty character handling works.

Hope that helps
Regards
Christian





More information about the opencms-dev mailing list