[opencms-dev] Lucene-search: stop words aren'tdisplayed insearchresultlist

Tue May 30 15:14:28 CEST 2006

I haven't looked at OpenCms's own search retrieval code - but from looking
at the indexing code, I reckon the 'problem' is at indexing time.

I think the best way in general is to pull summary content from Lucene,
having stored it in the right way without running it through any kind of
analyser.  The set of Lucene Hits has to be retrieved anyway, as does the
Lucene Document associated with each Hit that you're going to render; and
displaying the contents of a stored Field which (I believe) is already in
memory is going to be faster than pulling it out of the the VFS.  I know
this sounds like a lot of redundancy, but as you know that's what indexes
are for - runtime speed time after time, for the price of some cheap storage
space and some initial effort.

I'm adopting this approach at the moment.  I've written a Lucene Document
factory, implementing I_CmsDocumentFactory, which indexes all the usual
OpenCms Fields then adds a few of its own for good measure.  The ones used
for specialised searching are indexed with Field.Store.NO and
Field.Index.TOKENIZED/Field.Index.UNTOKENIZED; and the others which save me
having to hit the VFS unnecessarily are indexed with Field.STORE.YES,
Field.Index.NO.

Jon

-----Original Message-----
From: opencms-dev-bounces at opencms.org
[mailto:opencms-dev-bounces at opencms.org] On Behalf Of Christian Steinert
Sent: 30 May 2006 11:22
To: The OpenCms mailing list
Subject: Re: [opencms-dev] Lucene-search: stop words aren'tdisplayed
insearchresultlist

Jonathan Woods schrieb:
> Jason -
>
> I can tell this problem is in my near future too.  Is it really 
> necessary to create a patch?  I was hoping to specify an analyser 
> class in opencms-search.xml and get round the problem that way.
>
> Jon
>   
Dear Jason, dear Jonathan,

I found this overview somewhere on the web, which shows that each analyzer
uses a fixed filter/analyzer configuration, so it seems that each analyzer
may contain both filters as well as stemmers.

Class:   Tokenizer and TokenFilter

* GermanAnalyzer:    StandardTokenizer, StandardFilter, StopFilter 
(deutsch alsStandard, alternative Wortliste möglich), GermanStemFilter
(variable Exclude-Liste), LowerCaseFilter
* SimpleAnalyzer:      LowerCaseTokenizer
* StandardAnalyzer:     StandardTokenizer, StandardFilter, 
LowerCaseFilter, StopFilter (englisch als Standard, alternative Wortliste
möglich)
* StopAnalyzer:           LowerCaseTokenizer, StopFilter (englisch als 
Standard, alternative Wortliste möglich)
* WhitespaceAnalyzer:     WhitespaceTokenizer

The clean way would be to pull the preview content not from lucene, but from
opencms.
Is this the way it's done? Is the mistake just that opencms filters the
content badly before displaying it?

Jason - I would be *very* interested in taking a look at your patched code.
Maybe the whitespace removal inside of opencms could just be done with a
more primitive maybe the whitespace Analyzer (whitespace analyzer). I think
it's quite mistaken to filter the preview through the same analyzer that is
used for indexing.

Christian

_______________________________________________
This mail is sent to you from the opencms-dev mailing list To change your
list options, or to unsubscribe from the list, please visit
http://lists.opencms.org/mailman/listinfo/opencms-dev