When searching for content based on search terms entered by the user, Confluence splits the text of the content into tokens, and then filters and modifies those tokens according to the following rules.
Tokenisation
Confluence uses Lucene's Standard Tokenizer. This splits the text into tokens as follows:
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by white space is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognises email addresses and internet host names as one token.
An example: The string 'foo-bar5' won't be split into 'foo' and 'bar5', so a search for 'bar5' or 'bar*' will not find any results.
Filtering
Confluence then:
- Removes "'s" from the ends of words.
- Removes the dots from acronyms, e.g. I.B.M. becomes IBM.
- Converts everything to lower case.
- Removes common words like 'the' and 'or' are removed.
- Converts words to their stems. For example, 'fishing' and 'fishes' both become 'fish'.
RELATED TOPICS
Searching Confluence
Comments (1)
May 18, 2007
Grigorio V. Moshkin says:
We are probing Confluence to maintain Russian data. So almost all works fine wit...We are probing Confluence to maintain Russian data. So almost all works fine with Russian language. Almost, but not all.
I go to the Confluence admin menu, and set Administration->General Configuration->Indexing Language to be Russian. Yes! Morphology of Russian language starts to work!
But one could not find digits by Confluence standard context search: e.g. queries like 2 16 2000 won't work - just an empty result is returned.
Then, I switch Administration->General Configuration->Indexing Language to be english - wow, digits found in docements! But Russian morphology won't work
Then I switch back to Russian - Russian morphology works, digits won't work. Why? What's the matter?
Add Comment