[opencms-dev] Lucene 1.4, Spanish Analyzer
M Butcher
mbutcher at grcomputing.net
Thu Mar 4 18:22:02 CET 2004
Ernesto De Santis wrote:
> Hi Stephan
>
>
>>seems that your analyzer reduces the words to their stem, i.e. it removes
>>the inflections and declinations. This is just what a stemmer should do.
>
>
> Well, How to others language analyzers work?
> With Stemmer's?
The one I use, the StandardAnalyzer, also uses a stemmer, and can mangle
the English language almost as effectively as the SpanishAnalyzer
mangles Spanish. :)
I guess the idea is that the stemmer helps more than it hurts.
Sometimes stemming rules can be a little on the arbitrary side. For
instance, the English analyzer may remove the trailing 's' on any word
(since that makes plurals in English) unless the word is short. That
way, words like 'his', 'is', and 'as' don't get truncated, but words
like 'buckets' get correctly truncated to 'bucket'.
>>To get the right results when you are searching for your words you should
>>use your analyzer also to parse the search query
>
>
> Yes, i search with my analyzer, and found the results. But if I search with
> the striped word, found the document! and this word don´t exist in spanish.
Yes -- an odd artifact of the stemming mechanism. It would be
interesting to try the same test on other search engines and see what
happens. I wouldn't be surprised if you got similar results on them, as
well.
>
>
>>(SearchHelper.doSimpleSearch only uses a StopAnalyzer).
>
>
> I don´t use SearchHelper. I think that this don´t work for my application. I
> parse the string query with
Good idea. Stephan is right about analyzers, and when I initially wrote
SearchHelper, I was trying to provide an example more than a
one-size-fits-all solution. I guess I should have written it better. :)
Matt
On an interesting side note, when I write about parsing other languages,
I realize that I should write English that is easier to read/translate.
More information about the opencms-dev
mailing list