Introduction

Elasticsearch is an awesome tool for indexing and searching JSON documents. Early on, we knew that it was going to be a key part of the technology that we’re developing at Citrine. Like many developers, I realized that we needed to include our own customizations, but found little help when getting started. It took me a while to figure things out, so I wrote this post in hopes that everyone reading it can get started a little faster.

All of the code below, as well as instructions for installation and use, is available on github at https://github.com/CitrineInformatics/plugin-plussign.

Getting started

Elasticsearch is built on top of Lucene. In this post, I’m going to focus on creating a custom tokenenizer, custom token filter, and custom analyzer, each implemented in Lucene. By the end, we will have built an analyzer that breaks a string into tokens wherever + symbols are found, removes tokens that contain empty strings, then converts all tokens to lowercase values. For example, this analyzer will take the input string “This+is++some+text” and generate the tokens “this”, “is”, “some”, and “text”.

In a later post, I’ll go over methods to integrate this code as a plugin to Elasticsearch.

Building a custom Lucene tokenizer

Tokenizers perform the task of breaking a string into separate tokens. Our custom tokenizer will split a string wherever a + sign is found. It’s purpose is to take a string such as “This+is++some+text” and break it into “This”, “is”, “”, “some”, and “text”. Notice that this leaves an empty string where successive + symbols are found. We’ll deal with empty strings using the token filter that we make further on in this post.


public class PlusSignTokenizer extends Tokenizer {

    /* Lucene uses attributes to store information about a single token. For
     * this tokenizer, the only attribute that we are going to use is the
     * CharTermAttribute, which can store the text for the token that is
     * generated. Other types of attributes exist (see interfaces and classes
     * derived from org.apache.lucene.util.Attribute); we will use some of
     * these other attributes when we build our custom token filter. It is
     * important that you register attributes, whatever their type, using the
     * addAttribute() function.
     */
    protected CharTermAttribute charTermAttribute =
        addAttribute(CharTermAttribute.class);

    /* This is the important function to override from the Tokenizer class. At
     * each call, it should set the value of this.charTermAttribute to the text
     * of the next token. It returns true if a new token is generated and false
     * if there are no more tokens remaining.
     */
    @Override
    public boolean incrementToken() throws IOException {

        // Clear anything that is already saved in this.charTermAttribute
        this.charTermAttribute.setEmpty();

        // Get the position of the next + symbol
        int nextIndex = this.stringToTokenize.indexOf('+', this.position);

        // Execute this block if a plus symbol was found. Save the token
        // and the position to start at when incrementToken() is next
        // called.
        if (nextIndex != -1) {
            String nextToken = this.stringToTokenize.substring(
                this.position, nextIndex);
            this.charTermAttribute.append(nextToken);
            this.position = nextIndex + 1;
            return true;
        }

        // Execute this block if no more + signs are found, but there is
        // still some text remaining in the string. For example, this saves
        // “text" in "This+is++some+text".
        else if (this.position < this.stringToTokenize.length()) {
            String nextToken =
                this.stringToTokenize.substring(this.position);
            this.charTermAttribute.append(nextToken);
            this.position = this.stringToTokenize.length();
            return true;
        }

        // Execute this block if no more tokens exist in the string.
        else {
            return false;
        }
    }

    /* This is the constructor for our custom tokenizer class. It takes all
     * information from a java.io.Reader object and stores it in a string. If
     * you are expecting very large blocks of text, you might want to think
     * about using a buffer and saving chunks from the reader whenever
     * incrementToken() is called. This function throws a RuntimeException when
     * an IOException is raised - you can choose how you want to deal with the
     * IOException, but for our purposes, we do not need to try to recover
     * from it.
     */
    public PlusSignTokenizer(Reader reader) {
        super(reader);
        int numChars;
        char[] buffer = new char[1024];
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                reader.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        this.stringToTokenize = stringBuilder.toString();
    }

    /* Reset the stored position for this object when reset() is called.
     */
    @Override
    public void reset() throws IOException {
        super.reset();
        this.position = 0;
    }

    /* This object stores the string that we are turning into tokens. We will
     * process its content as we call the incrementToken() function.
     */
    protected String stringToTokenize;

    /* This stores the current position in this.stringToTokenize. We will
     * increment its value as we call the incrementToken() function.
     */
    protected int position = 0;
}

Building a custom Lucene token filter

Token filters act on each token that is generated by a tokenizer and apply some set of operations that alter or normalize them. In our example, the custom token filter simply removes empty strings from a token stream. For example, if a stream contains the tokens "This", "is", "", "some", and "text" it would remove the third token while letting all other tokens pass through.


public class EmptyStringTokenFilter extends TokenFilter {

    /* The constructor for our custom token filter just calls the TokenFilter
     * constructor; that constructor saves the token stream in a variable named
     * this.input.
     */
    public EmptyStringTokenFilter(TokenStream tokenStream) {
        super(tokenStream);
    }

    /* Like the PlusSignTokenizer class, we are going to save the text of the
     * current token in a CharTermAttribute object. In addition, we are going
     * to use a PositionIncrementAttribute object to store the position
     * increment of the token. Lucene uses this latter attribute to determine
     * the position of a token. Given a token stream with "This", "is", "",
     * ”some", and "text", we are going to ensure that "This" is saved at
     * position 1, "is" at position 2, "some" at position 3, and "text" at
     * position 4. Note that we have completely ignored the empty string at
     * what was position 3 in the original stream.
     */
    protected CharTermAttribute charTermAttribute =
        addAttribute(CharTermAttribute.class);
    protected PositionIncrementAttribute positionIncrementAttribute =
        addAttribute(PositionIncrementAttribute.class);

    /* Like we did in the PlusSignTokenizer class, we need to override the
     * incrementToken() function to save the attributes of the current token.
     * We are going to pass over any tokens that are empty strings and save
     * all others without modifying them. This function should return true if
     * a new token was generated and false if the last token was passed.
     */
    @Override
    public boolean incrementToken() throws IOException {

        // Loop over tokens in the token stream to find the next one
        // that is not empty
        String nextToken = null;
        while (nextToken == null) {

            // Reached the end of the token stream being processed
            if ( ! this.input.incrementToken()) {
                return false;
            }

            // Get text of the current token and remove any
            // leading/trailing whitespace.
            String currentTokenInStream =
                this.input.getAttribute(CharTermAttribute.class)
                    .toString().trim();

            // Save the token if it is not an empty string
            if (currentTokenInStream.length() > 0) {
                nextToken = currentTokenInStream;
            }
        }

        // Save the current token
        this.charTermAttribute.setEmpty().append(nextToken);
        this.positionIncrementAttribute.setPositionIncrement(1);
        return true;
    }
}

Building a Custom Lucene Analyzer

Now that we've built our custom tokenizer and custom token filter, creating a new analyzer is simple. In this class, we're going to use the PlusSignTokenizer tokenizer that we built, then filter those tokens with our EmptyStringTokenFilter, and finally convert all strings to lowercase using the org.apache.lucene.analysis.core.LowerCaseFilter filter that comes with Lucene. Overall, this has the effect that the input string "This+is++some+text" will be converted to the tokens "this", "is", "some", and "text". 


 public class PlusSignAnalyzer extends Analyzer {

        /* This is the only function that we need to override for our analyzer.
         * It takes in a java.io.Reader object and saves the tokenizer and list
         * of token filters that operate on it.
         */
        @Override
        protected TokenStreamComponents createComponents(String field,
        Reader reader) {
            Tokenizer tokenizer = new PlusSignTokenizer(reader);
            TokenStream filter = new EmptyStringTokenFilter(tokenizer);
            filter = new LowerCaseFilter(filter);
            return new TokenStreamComponents(tokenizer, filter);
        }
    }

Next steps

If you're using Lucene directly, and not Elasticsearch, the PlusSignAnalyzer class is all that you need. If you are using Elasticsearch, you'll need to register it as a plugin; we'll go over that process in our next post.

4 thoughts on “Building a custom analyzer in Lucene

Comments are closed.