[opencms-dev] lucene search module and multiple content body

Fri Aug 8 10:28:01 CEST 2003

Hy Matt,

ok got it. Here's the source code, I just retrieved all body names and
traversed the list to get each body text for indexing.
Works for me. Thanks for the great module, Matt :)

Jens

  /*
   * Now we move on to the contents of the Page. What we need to do is
   * get the body file (from CmsXmlControlFile), then get the
   * <TEMPLATE<![CDATA[]]></TEMPLATE> from it. That info is in HTML,
   * so we need to parse that out and get just the text.
   */
    CmsXmlControlFile xCntrl = new CmsXmlControlFile(cmso, f);
  String contentsName = xCntrl.getElementTemplate("body");

  CmsFile contents = cmso.readFile(contentsName);
  CmsXmlTemplateFile xcContents = new CmsXmlTemplateFile(cmso, contents);

  // We want the all existing bodies.
    // get all body selector names
    Iterator itSelect = xcContents.getAllSections().iterator();
    String cdata, cleanCdata;

    // for each body, add its content to the document
    while (itSelect.hasNext()) {
    cdata = xcContents.getTemplateContent(null, null,
(String)itSelect.next());

    // HTMLParser p = new HTMLParser
    cleanCdata = FastTagStripper.strip(cdata.toCharArray());

    // Need either a reader or string. CmsFile supports byteArray.
    doc.add(Field.UnStored(FIELD_BODY, cleanCdata ));
    }

  return doc;

----- Original Message -----
From: "M Butcher" <mbutcher at grcomputing.net>
To: <opencms-dev at opencms.org>
Sent: Thursday, August 07, 2003 7:24 PM
Subject: Re: [opencms-dev] lucene search module and multiple content body

> Jens,
>
> The class net.grcomputing.opencms.search.lucene.PageDocument is what
> handles grabbing page content. It only indexes what gets returned from
> CmsXmlTemplateFile.getTemplateContent().
>
> I'm not sure how it should deal with the multiple content sections,
> since I've never used them. Take a look at the code and send my any
> fixes if you have them. The relevant code starts on line 116.
>
> Thanks,
>
> Matt
>
> On Thu, 2003-08-07 at 08:46, Jens Rickhoff wrote:
> > Hello list,
> >
> > I successfully installed the lucene module and it is indexing the pages
> > correctly.
> > Well, not all of them. Only those which do not depend on a multiple
content
> > body:
> > http://mail.opencms.org/pipermail/opencms-dev/2003q2/005734.html
> >
> > If I completely leave out the default body element:
> >     <TEMPLATE><![CDATA[ ]]></TEMPLATE>
> >     <edittemplate><![CDATA[ ]]></edittemplate>
> > (which makes it look nicely in the WYSIWYG editor), the search module
> > returns
> > the exception:
> >
> > [CmsXmlTemplateFile] Template definition file /system/bodies/foo.html is
> > corrupt. cannot find default section.
> > IndexManager: CMS error processing file foo.html:
> > com.opencms.core.CmsException: 25 XML tag
> > missing. Detailed error: Corrupt template file /system/bodies/foo.html.
> > Cannot find default section.
> >
> > That exception is clear to me. I got rid of this exception by just
adding
> > the default body element.
> > BUT: the search module just indexes everything in the default body
element
> > :-(
> > The text in body2 through body9 (see link above) is completely ignored
for
> > indexing.
> >
> > ->> How can I tell the module to scan these parts as well?
> > ->> And: is there any way to get rid of the above exception w/o adding
the
> > default body element?
> >
> > Thanks a lot,
> >
> > Jens
> >
> > _______________________________________________
> > This mail is send to you from the opencms-dev mailing list
> > To change your list options, or to unsubscribe from the list, please
visit
> > http://mail.opencms.org/mailman/listinfo/opencms-dev
> --
> M Butcher <mbutcher at grcomputing.net>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev