[opencms-dev] RE: Lucene-search: stop words

Thu Jun 1 08:35:44 CEST 2006

Christian -

I've realised there was one question I managed not to answer in my prolix
response!

When a Field is added to a Lucene Document using Index.TOKENIZED and
Store.YES hints, its contents are split into tokens according to the
prevailing Analyzer, _but_ the value stored in the Field is the original
content (stop words and all).  You can also use Store.COMPRESS to store
contents in a compressed form - handy for large text content which you
nevertheless want to keep hold of for search retrieval.

Jon

-----Original Message-----
From: opencms-dev-bounces at opencms.org
[mailto:opencms-dev-bounces at opencms.org] On Behalf Of Jonathan Woods
Sent: 30 May 2006 16:27
To: 'The OpenCms mailing list'
Subject: RE: [opencms-dev] Lucene-search: stop wordsaren'tdisplayed
insearchresultlist

Christian -

I'm a little in the dark about what OpenCms does in retrieval code which
comes 'out of the box'.  I confess I don't plan to use it directly, so I
haven't looked into it yet.

The unit of storage in Lucene is the Document, and that's the unit which the
OpenCms machinery also deals with.  'Publish' events are trapped, and the
VFS resources concerned are passed individually to
I_CmsXmlDocument.newInstance(), which is required to return a new Lucene
Document instance which OpenCms then adds to the Lucene index.  [Lucene]
Documents have Fields which are added to them at indexing time, and Fields
may be tokenized or not, stored or not (so e.g. tokenized AND stored is
possible).  You can add what you like.  For VFS resource types of interest
(in my case, a couple of custom XML content types) you're right - I am
storing two flavours of Fields for certain pieces of data in the original
VFS resource - so <one VFS resource> maps to <one Lucene Document> which
maps to <multiple Fields, some slightly redundant>.

And yes, the retrieval code does need to know which Fields to deal with:

(i) to create Lucene Query objects which query against the right Fields.
(ii) to decide which Fields, if any, are involved in ordering results.
Results come out as Hits, in Lucene language; and a Hit has a Document. By
default Hits are sorted by Lucene's assessment of relevance, which itself
can be customised.
(iii) to decide what information to use and/or display from each Hit.
OpenCms (and my code) stores a Field which contains the VFS path of the
associated VFS resource, so you can get back to the VFS resource from the
Hit's Document.  You could also store e.g. the document's title or its
description, and retrieve these as necessary.

I bought "Lucene in Action" (Gospodnetic & Hatcher, pub Manning) which is a
great read if you like that kind of thing.  There are sample chapters at
http://www.manning.com/hatcher2/ which give a great introduction to the
innards of Lucene, though the book is written for v1.3 and we're now at
1.9.x

The trouble with OpenCms's usage of Lucene is that it's very hidden, and you
have to do some work to get at the starting point, which for me was the
IndexSearcher.  After that, you'll find the Lucene API wonderfully simple,
well documented and surprisingly powerful:

Query query = new BooleanQuery();
... // add query clauses
Hits hits = indexSearcher.search(query); for (Hit hit: hits.iterator(); //
unchecked conversion in 1.5, because Lucene API doesn't use generics String
hitDescription = hit.getDocument().getField("description").stringValue(); //
for example int hitScore = hit.getScore(); ... // display what you like

Lucene can be persuaded to make a really good job of dealing with the
typical CMS mix of relatively unstructured documents and structured XML
content - it's good at marrying the two together.

I've just detected that I'm avoiding work.  Back to the grindstone!

Jon

-----Original Message-----
From: opencms-dev-bounces at opencms.org
[mailto:opencms-dev-bounces at opencms.org] On Behalf Of Christian Steinert
Sent: 30 May 2006 15:41
To: The OpenCms mailing list
Subject: RE: [opencms-dev] Lucene-search: stop words aren'tdisplayed
insearchresultlist

> 
> I'm adopting this approach at the moment.  I've written a Lucene 
> Document factory, implementing I_CmsDocumentFactory, which indexes all 
> the usual OpenCms Fields then adds a few of its own for good measure.
> The ones used for specialised searching are indexed with 
> Field.Store.NO and Field.Index.TOKENIZED/Field.Index.UNTOKENIZED; and 
> the others which save me having to hit the VFS unnecessarily are 
> indexed with Field.STORE.YES, Field.Index.NO.
> 

Dear Jonathan

This sounds interesting. If Opencms really tries to display the information
as it was indexed, then this woule certainly be strange.

Jason: could you post your patched code? Where is it that opencms uses an
analyzer, where it shouldn't? During index time or while retrieving result
information?

=====

Jonathan: Does that mean that you want store each page in lucene two
flavors, when indexing it - storing it once in its original form and once
again in an analyzer-filtered form? Would sound like a good idea. I just did
not know that this is possible at all with lucene to store non-tokenized
information along with the tokenized information that actually gets
searched. As this seems to be possible, it'd of course be easier and
probably a small bit faster to store the pages redundantly in lucene and not
read them from the VFS for this purpose.

But: would the retrieval code not also have to know about this so that it
could take the result from the right - unfiltered - fields before extracting
the excerpts?

C.
______________________________________________________________
Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193

_______________________________________________
This mail is sent to you from the opencms-dev mailing list To change your
list options, or to unsubscribe from the list, please visit
http://lists.opencms.org/mailman/listinfo/opencms-dev

_______________________________________________
This mail is sent to you from the opencms-dev mailing list To change your
list options, or to unsubscribe from the list, please visit
http://lists.opencms.org/mailman/listinfo/opencms-dev