net.grcomputing.opencms.search.lucene
Class FastTagStripper

java.lang.Object
  |
  +--net.grcomputing.opencms.search.lucene.FastTagStripper

public class FastTagStripper
extends java.lang.Object

This class implements a very fast tag stripper. It is not exactly careful about stripping tags... it just rips them all out.

Author:
Matt Butcher mbutcher@grcomputing.net
See Also:
http://grcomputing.net

Constructor Summary
FastTagStripper()
           
 
Method Summary
static boolean isWhitespace(char c)
           
static void main(java.lang.String[] argv)
           
static java.lang.String strip(char[] doc)
          Ruthlessly strips all tags out of a string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FastTagStripper

public FastTagStripper()
Method Detail

main

public static void main(java.lang.String[] argv)

strip

public static java.lang.String strip(char[] doc)
Ruthlessly strips all tags out of a string. Not good if you are trying to capture data inside of the tags. Tags/Elements are considered anything that begins with a >. Yes, this is pretty shallow, but it works, and it's fast. If you don't care about speed, look at the HTMLParser that comes with Lucene.

Note that this will include contents of script or style tags that do not use comment tags to hide their contents.

This will work for XML, too, but it will strip out the contents of CDATA elements as well.

Returns:
String stripped of all tags/elements.

isWhitespace

public static boolean isWhitespace(char c)


Copyright © 2003 Matt Butcher of Global Resources for Computing. Reporoduction and modification of this documents are allowed as in accordance with the GPL v2. Refer to COPYING.txt for information on acceptible use