[opencms-dev] Lucene 1.4, Spanish Analyzer

M Butcher mbutcher at grcomputing.net
Thu Mar 4 18:22:02 CET 2004


Ernesto De Santis wrote:
> Hi Stephan
> 
> 
>>seems that your analyzer reduces the words to their stem, i.e. it removes
>>the inflections and declinations. This is just what a stemmer should do.
> 
> 
> Well, How to others language analyzers work?
> With Stemmer's?

The one I use, the StandardAnalyzer, also uses a stemmer, and can mangle 
the English language almost as effectively as the SpanishAnalyzer 
mangles Spanish. :)

I guess the idea is that the stemmer helps more than it hurts.

Sometimes stemming rules can be a little on the arbitrary side. For 
instance, the English analyzer may remove the trailing 's' on any word 
(since that makes plurals in English) unless the word is short. That 
way, words like 'his', 'is', and 'as' don't get truncated, but words 
like 'buckets' get correctly truncated to 'bucket'.

>>To get the right results when you are searching for your words you should
>>use your analyzer also to parse the search query
> 
> 
> Yes, i search with my analyzer, and found the results. But if I search with
> the striped word, found the document! and this word don´t exist in spanish.

Yes -- an odd artifact of the stemming mechanism. It would be 
interesting to try the same test on other search engines and see what 
happens. I wouldn't be surprised if you got similar results on them, as 
well.

> 
> 
>>(SearchHelper.doSimpleSearch only uses a StopAnalyzer).
> 
> 
> I don´t use SearchHelper. I think that this don´t work for my application. I
> parse the string query with

Good idea. Stephan is right about analyzers, and when I initially wrote 
SearchHelper, I was trying to provide an example more than a 
one-size-fits-all solution. I guess I should have written it better. :)

Matt

On an interesting side note, when I write about parsing other languages, 
I realize that I should write English that is easier to read/translate.



More information about the opencms-dev mailing list