[opencms-dev] We add unescapeNonAscii(String) in com.opencms.util.Encoder

Thu Dec 9 21:07:15 CET 2004

Hi there,

As we use GB2312 rather than UTF-8, we have to add unescapeNonAscii(String) in com.opencms.util.Encoder to retrieve the text of  HTML files in OpenCMS and make lucene index. Here is the code which has been tested on OpenCMS 5.0.1 with OpenCMS-Lucene 1.5:

	/**
	 * Unescapes non ASCII characters in a HTML-String from their number-based 
	 * entity representation, for example &#38; becomes &.<p>
	 * 
	 * <code>&#num;</code> is replaced by a character string<p>
	 * 
	 * @param source the String to unescape
	 * @return String the escaped String
	 * 
	 * @see #escapeNonAscii(String)
	 */
	public static String unescapeNonAscii(String source) {
		if (source == null) return null;
		StringBuffer result = new StringBuffer(source.length());
		for(int i = 0;i < source.length();i++) {
			String temp = "";
			if(source.charAt(i) == '&' && source.charAt(i+1) == '#') {
				String tempString = null;
				if(source.charAt(i+3) == ';') {
					tempString = source.substring(i+2, i+3);
					i += 3;
				}else if(source.charAt(i+4) == ';') {
					tempString = source.substring(i+2, i+4);
					i += 4;
				}else if(source.charAt(i+5) == ';') {
					tempString = source.substring(i+2, i+5);
					i += 5;
				}else if(source.charAt(i+6) == ';') {
					tempString = source.substring(i+2, i+6);
					i += 6;
				}else if(source.charAt(i+7) == ';') {
					tempString = source.substring(i+2, i+7);
					i += 7;
				}else if(source.charAt(i+8) == ';') {
					tempString = source.substring(i+2, i+8);
					i += 8;
				}
				if(tempString != null) {
					int ch = Integer.parseInt(tempString);
					result.append((char)ch);
					continue;
				}
			}
			result.append(source.charAt(i));
		}
		return new String(result);
	}

Regards,

Shi Yusen/Beijing Langhua Ltd.