Class WordlistLoader


  • public class WordlistLoader
    extends java.lang.Object
    Loader for text files that represent a list of stopwords.
    See Also:
    to obtain instances
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static java.util.List<java.lang.String> getLines​(java.io.InputStream stream, java.nio.charset.Charset charset)
      Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding.
      static CharArraySet getSnowballWordSet​(java.io.Reader reader, CharArraySet result)
      Reads stopwords from a stopword list in Snowball format.
      static CharArraySet getSnowballWordSet​(java.io.Reader reader, Version matchVersion)
      Reads stopwords from a stopword list in Snowball format.
      static CharArrayMap<java.lang.String> getStemDict​(java.io.Reader reader, CharArrayMap<java.lang.String> result)
      Reads a stem dictionary.
      static CharArraySet getWordSet​(java.io.Reader reader, java.lang.String comment, CharArraySet result)
      Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).
      static CharArraySet getWordSet​(java.io.Reader reader, java.lang.String comment, Version matchVersion)
      Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).
      static CharArraySet getWordSet​(java.io.Reader reader, CharArraySet result)
      Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace).
      static CharArraySet getWordSet​(java.io.Reader reader, Version matchVersion)
      Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace).
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • getWordSet

        public static CharArraySet getWordSet​(java.io.Reader reader,
                                              CharArraySet result)
                                       throws java.io.IOException
        Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
        Parameters:
        reader - Reader containing the wordlist
        result - the CharArraySet to fill with the readers words
        Returns:
        the given CharArraySet with the reader's words
        Throws:
        java.io.IOException
      • getWordSet

        public static CharArraySet getWordSet​(java.io.Reader reader,
                                              Version matchVersion)
                                       throws java.io.IOException
        Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
        Parameters:
        reader - Reader containing the wordlist
        matchVersion - the Lucene Version
        Returns:
        A CharArraySet with the reader's words
        Throws:
        java.io.IOException
      • getWordSet

        public static CharArraySet getWordSet​(java.io.Reader reader,
                                              java.lang.String comment,
                                              Version matchVersion)
                                       throws java.io.IOException
        Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
        Parameters:
        reader - Reader containing the wordlist
        comment - The string representing a comment.
        matchVersion - the Lucene Version
        Returns:
        A CharArraySet with the reader's words
        Throws:
        java.io.IOException
      • getWordSet

        public static CharArraySet getWordSet​(java.io.Reader reader,
                                              java.lang.String comment,
                                              CharArraySet result)
                                       throws java.io.IOException
        Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
        Parameters:
        reader - Reader containing the wordlist
        comment - The string representing a comment.
        result - the CharArraySet to fill with the readers words
        Returns:
        the given CharArraySet with the reader's words
        Throws:
        java.io.IOException
      • getSnowballWordSet

        public static CharArraySet getSnowballWordSet​(java.io.Reader reader,
                                                      CharArraySet result)
                                               throws java.io.IOException
        Reads stopwords from a stopword list in Snowball format.

        The snowball format is the following:

        • Lines may contain multiple words separated by whitespace.
        • The comment character is the vertical line (|).
        • Lines may contain trailing comments.

        Parameters:
        reader - Reader containing a Snowball stopword list
        result - the CharArraySet to fill with the readers words
        Returns:
        the given CharArraySet with the reader's words
        Throws:
        java.io.IOException
      • getSnowballWordSet

        public static CharArraySet getSnowballWordSet​(java.io.Reader reader,
                                                      Version matchVersion)
                                               throws java.io.IOException
        Reads stopwords from a stopword list in Snowball format.

        The snowball format is the following:

        • Lines may contain multiple words separated by whitespace.
        • The comment character is the vertical line (|).
        • Lines may contain trailing comments.

        Parameters:
        reader - Reader containing a Snowball stopword list
        matchVersion - the Lucene Version
        Returns:
        A CharArraySet with the reader's words
        Throws:
        java.io.IOException
      • getStemDict

        public static CharArrayMap<java.lang.String> getStemDict​(java.io.Reader reader,
                                                                 CharArrayMap<java.lang.String> result)
                                                          throws java.io.IOException
        Reads a stem dictionary. Each line contains:
        word\tstem
        (i.e. two tab separated words)
        Returns:
        stem dictionary that overrules the stemming algorithm
        Throws:
        java.io.IOException - If there is a low-level I/O error.
      • getLines

        public static java.util.List<java.lang.String> getLines​(java.io.InputStream stream,
                                                                java.nio.charset.Charset charset)
                                                         throws java.io.IOException
        Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding.

        A comment line is any line that starts with the character "#"

        Returns:
        a list of non-blank non-comment lines with whitespace trimmed
        Throws:
        java.io.IOException - If there is a low-level I/O error.