Class StringUtil


  • public class StringUtil
    extends java.lang.Object
    • Constructor Summary

      Constructors 
      Constructor Description
      StringUtil()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static void computeShortestEditScript​(java.lang.String wordForm, java.lang.String lemma, int[][] distance, java.lang.StringBuffer permutations)
      Computes the Shortest Edit Script (SES) to convert a word into its lemma.
      static java.lang.String decodeShortestEditScript​(java.lang.String wordForm, java.lang.String permutations)
      Read predicted SES by the lemmatizer model and apply the permutations to obtain the lemma from the wordForm.
      static java.lang.String getShortestEditScript​(java.lang.String wordForm, java.lang.String lemma)
      Get the SES required to go from a word to a lemma.
      static boolean isEmpty​(java.lang.CharSequence theString)
      Returns true if CharSequence.length() is 0 or null.
      static boolean isWhitespace​(char charCode)
      Determines if the specified character is a whitespace.
      static boolean isWhitespace​(int charCode)
      Determines if the specified character is a whitespace.
      static int[][] levenshteinDistance​(java.lang.String wordForm, java.lang.String lemma)
      Computes the Levenshtein distance of two strings in a matrix.
      static java.lang.String toLowerCase​(java.lang.CharSequence string)
      Converts to lower case independent of the current locale via Character.toLowerCase(int) which uses mapping information from the UnicodeData file.
      static java.lang.String toUpperCase​(java.lang.CharSequence string)
      Converts to upper case independent of the current locale via Character.toUpperCase(char) which uses mapping information from the UnicodeData file.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • StringUtil

        public StringUtil()
    • Method Detail

      • isWhitespace

        public static boolean isWhitespace​(char charCode)
        Determines if the specified character is a whitespace. A character is considered a whitespace when one of the following conditions is meet:
        • Its a Character.isWhitespace(int) whitespace.
        • Its a part of the Unicode Zs category (Character.SPACE_SEPARATOR).
        Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
        Parameters:
        charCode -
        Returns:
        true if white space otherwise false
      • isWhitespace

        public static boolean isWhitespace​(int charCode)
        Determines if the specified character is a whitespace. A character is considered a whitespace when one of the following conditions is meet:
        • Its a Character.isWhitespace(int) whitespace.
        • Its a part of the Unicode Zs category (Character.SPACE_SEPARATOR).
        Character.isWhitespace(int) does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.
        Parameters:
        charCode -
        Returns:
        true if white space otherwise false
      • toLowerCase

        public static java.lang.String toLowerCase​(java.lang.CharSequence string)
        Converts to lower case independent of the current locale via Character.toLowerCase(int) which uses mapping information from the UnicodeData file.
        Parameters:
        string -
        Returns:
        lower cased String
      • toUpperCase

        public static java.lang.String toUpperCase​(java.lang.CharSequence string)
        Converts to upper case independent of the current locale via Character.toUpperCase(char) which uses mapping information from the UnicodeData file.
        Parameters:
        string -
        Returns:
        upper cased String
      • isEmpty

        public static boolean isEmpty​(java.lang.CharSequence theString)
        Returns true if CharSequence.length() is 0 or null.
        Returns:
        true if CharSequence.length() is 0, otherwise false
        Since:
        1.5.1
      • levenshteinDistance

        public static int[][] levenshteinDistance​(java.lang.String wordForm,
                                                  java.lang.String lemma)
        Computes the Levenshtein distance of two strings in a matrix. Based on pseudo-code provided here: https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance which in turn is based on the paper Wagner, Robert A.; Fischer, Michael J. (1974), "The String-to-String Correction Problem", Journal of the ACM 21 (1): 168-173
        Parameters:
        wordForm - the form
        lemma - the lemma
        Returns:
        the distance
      • computeShortestEditScript

        public static void computeShortestEditScript​(java.lang.String wordForm,
                                                     java.lang.String lemma,
                                                     int[][] distance,
                                                     java.lang.StringBuffer permutations)
        Computes the Shortest Edit Script (SES) to convert a word into its lemma. This is based on Chrupala's PhD thesis (2008).
        Parameters:
        wordForm - the token
        lemma - the target lemma
        distance - the levenshtein distance
        permutations - the number of permutations
      • decodeShortestEditScript

        public static java.lang.String decodeShortestEditScript​(java.lang.String wordForm,
                                                                java.lang.String permutations)
        Read predicted SES by the lemmatizer model and apply the permutations to obtain the lemma from the wordForm.
        Parameters:
        wordForm - the wordForm
        permutations - the permutations predicted by the lemmatizer model
        Returns:
        the lemma
      • getShortestEditScript

        public static java.lang.String getShortestEditScript​(java.lang.String wordForm,
                                                             java.lang.String lemma)
        Get the SES required to go from a word to a lemma.
        Parameters:
        wordForm - the word
        lemma - the lemma
        Returns:
        the shortest edit script