Class Token

  • All Implemented Interfaces:
    java.lang.Appendable, java.lang.CharSequence, java.lang.Cloneable, CharTermAttribute, FlagsAttribute, OffsetAttribute, PayloadAttribute, PositionIncrementAttribute, PositionLengthAttribute, TermToBytesRefAttribute, TypeAttribute, Attribute

    public class Token
    extends CharTermAttributeImpl
    implements TypeAttribute, PositionIncrementAttribute, FlagsAttribute, OffsetAttribute, PayloadAttribute, PositionLengthAttribute
    A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.

    The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC display, etc.

    The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".

    A Token can optionally have metadata (a.k.a. payload) in the form of a variable length byte array. Use DocsAndPositionsEnum.getPayload() to retrieve the payloads from the index.

    NOTE: As of 2.9, Token implements all Attribute interfaces that are part of core Lucene and can be found in the tokenattributes subpackage. Even though it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.

    Tokenizers and TokenFilters should try to re-use a Token instance when possible for best performance, by implementing the TokenStream.incrementToken() API. Failing that, to create a new Token you should first use one of the constructors that starts with null text. To load the token from a char[] use CharTermAttributeImpl.copyBuffer(char[], int, int). To load from a String use CharTermAttributeImpl.setEmpty() followed by CharTermAttributeImpl.append(CharSequence) or CharTermAttributeImpl.append(CharSequence, int, int). Alternatively you can get the Token's termBuffer by calling either CharTermAttributeImpl.buffer(), if you know that your text is shorter than the capacity of the termBuffer or CharTermAttributeImpl.resizeBuffer(int), if there is any possibility that you may need to grow the buffer. Fill in the characters of your term into this buffer, with String.getChars(int, int, char[], int) if loading from a string, or with System.arraycopy(Object, int, Object, int, int), and finally call CharTermAttributeImpl.setLength(int) to set the length of the term text. See LUCENE-969 for details.

    Typical Token reuse patterns:

    • Copying text from a string (type is reset to TypeAttribute.DEFAULT_TYPE if not specified):
          return reusableToken.reinit(string, startOffset, endOffset[, type]);
        
    • Copying some text from a string (type is reset to TypeAttribute.DEFAULT_TYPE if not specified):
          return reusableToken.reinit(string, 0, string.length(), startOffset, endOffset[, type]);
        
    • Copying text from char[] buffer (type is reset to TypeAttribute.DEFAULT_TYPE if not specified):
          return reusableToken.reinit(buffer, 0, buffer.length, startOffset, endOffset[, type]);
        
    • Copying some text from a char[] buffer (type is reset to TypeAttribute.DEFAULT_TYPE if not specified):
          return reusableToken.reinit(buffer, start, end - start, startOffset, endOffset[, type]);
        
    • Copying from one one Token to another (type is reset to TypeAttribute.DEFAULT_TYPE if not specified):
          return reusableToken.reinit(source.buffer(), 0, source.length(), source.startOffset(), source.endOffset()[, source.type()]);
        
    A few things to note:
    • clear() initializes all of the fields to default values. This was changed in contrast to Lucene 2.4, but should affect no one.
    • Because TokenStreams can be chained, one cannot assume that the Token's current type is correct.
    • The startOffset and endOffset represent the start and offset in the source text, so be careful in adjusting them.
    • When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again.

    Please note: With Lucene 3.1, the toString() method had to be changed to match the CharSequence interface introduced by the interface CharTermAttribute. This method now only prints the term text, no additional information anymore.

    • Field Detail

      • TOKEN_ATTRIBUTE_FACTORY

        public static final AttributeSource.AttributeFactory TOKEN_ATTRIBUTE_FACTORY
        Convenience factory that returns Token as implementation for the basic attributes and return the default impl (with "Impl" appended) for all other attributes.
        Since:
        3.0
    • Constructor Detail

      • Token

        public Token()
        Constructs a Token will null text.
      • Token

        public Token​(int start,
                     int end)
        Constructs a Token with null text and start & end offsets.
        Parameters:
        start - start offset in the source text
        end - end offset in the source text
      • Token

        public Token​(int start,
                     int end,
                     java.lang.String typ)
        Constructs a Token with null text and start & end offsets plus the Token type.
        Parameters:
        start - start offset in the source text
        end - end offset in the source text
        typ - the lexical type of this Token
      • Token

        public Token​(int start,
                     int end,
                     int flags)
        Constructs a Token with null text and start & end offsets plus flags. NOTE: flags is EXPERIMENTAL.
        Parameters:
        start - start offset in the source text
        end - end offset in the source text
        flags - The bits to set for this token
      • Token

        public Token​(java.lang.String text,
                     int start,
                     int end)
        Constructs a Token with the given term text, and start & end offsets. The type defaults to "word." NOTE: for better indexing speed you should instead use the char[] termBuffer methods to set the term text.
        Parameters:
        text - term text
        start - start offset in the source text
        end - end offset in the source text
      • Token

        public Token​(java.lang.String text,
                     int start,
                     int end,
                     java.lang.String typ)
        Constructs a Token with the given text, start and end offsets, & type. NOTE: for better indexing speed you should instead use the char[] termBuffer methods to set the term text.
        Parameters:
        text - term text
        start - start offset in the source text
        end - end offset in the source text
        typ - token type
      • Token

        public Token​(java.lang.String text,
                     int start,
                     int end,
                     int flags)
        Constructs a Token with the given text, start and end offsets, & type. NOTE: for better indexing speed you should instead use the char[] termBuffer methods to set the term text.
        Parameters:
        text - term text
        start - start offset in the source text
        end - end offset in the source text
        flags - token type bits
      • Token

        public Token​(char[] startTermBuffer,
                     int termBufferOffset,
                     int termBufferLength,
                     int start,
                     int end)
        Constructs a Token with the given term buffer (offset & length), start and end offsets
        Parameters:
        startTermBuffer - buffer containing term text
        termBufferOffset - the index in the buffer of the first character
        termBufferLength - number of valid characters in the buffer
        start - start offset in the source text
        end - end offset in the source text