Stemming for better matching

Pair this blog with:

Hot Toddy

Recipe

Stemming is a way to automatically attempt to get the root of words for better search results. Without it, you would run into situations where documents contain “frogs” and some with “frog” but if you search for either you won’t get the full list of documents containing both terms. In comes stemming filters.

There are a few options here depending on what you want. None of them are perfect because if it was then we would all use that option, right? So what to consider is what languages you need to support, and just how aggressive you want your stemming algorithm.

All you have to do is add the proper filter to your schema.xml fieldType you are searching against.

For instance:

    
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
		<filter class="solr.KStemFilterFactory"/> <!-- Stemming -->
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.LowerCaseFilterFactory" />
		<filter class="solr.KStemFilterFactory"/> <!-- Stemming -->
      </analyzer>
    </fieldType>

You will want to reindex your documents so the stemming is stored. You will also want to add it to your query so you don’t end up searching for “frogs” and not get ANY matches when “frog” is now what is stored.

Check out the official wiki to help you pick out the stemming that is most appropriate.