Class QueryAutoStopWordAnalyzer

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public final class QueryAutoStopWordAnalyzer
    extends AnalyzerWrapper
    An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.

    For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.

    • Field Detail

      • defaultMaxDocFreqPercent

        public static final float defaultMaxDocFreqPercent
        See Also:
        Constant Field Values
    • Constructor Detail

      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(Version matchVersion,
                                         Analyzer delegate,
                                         IndexReader indexReader)
                                  throws java.io.IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        Throws:
        java.io.IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(Version matchVersion,
                                         Analyzer delegate,
                                         IndexReader indexReader,
                                         int maxDocFreq)
                                  throws java.io.IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        maxDocFreq - Document frequency terms should be above in order to be stopwords
        Throws:
        java.io.IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(Version matchVersion,
                                         Analyzer delegate,
                                         IndexReader indexReader,
                                         float maxPercentDocs)
                                  throws java.io.IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
        Throws:
        java.io.IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(Version matchVersion,
                                         Analyzer delegate,
                                         IndexReader indexReader,
                                         java.util.Collection<java.lang.String> fields,
                                         float maxPercentDocs)
                                  throws java.io.IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        fields - Selection of fields to calculate stopwords for
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
        Throws:
        java.io.IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(Version matchVersion,
                                         Analyzer delegate,
                                         IndexReader indexReader,
                                         java.util.Collection<java.lang.String> fields,
                                         int maxDocFreq)
                                  throws java.io.IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        fields - Selection of fields to calculate stopwords for
        maxDocFreq - Document frequency terms should be above in order to be stopwords
        Throws:
        java.io.IOException - Can be thrown while reading from the IndexReader
    • Method Detail

      • getStopWords

        public java.lang.String[] getStopWords​(java.lang.String fieldName)
        Provides information on which stop words have been identified for a field
        Parameters:
        fieldName - The field for which stop words identified in "addStopWords" method calls will be returned
        Returns:
        the stop words identified for a field
      • getStopWords

        public Term[] getStopWords()
        Provides information on which stop words have been identified for all fields
        Returns:
        the stop words (as terms)