Class ExternalParser

  • All Implemented Interfaces:
    java.io.Serializable, Parser
    Direct Known Subclasses:
    TensorflowImageRecParser

    public class ExternalParser
    extends AbstractParser
    Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.
    See Also:
    Serialized Form
    • Field Detail

      • INPUT_FILE_TOKEN

        public static final java.lang.String INPUT_FILE_TOKEN
        The token, which if present in the Command string, will be replaced with the input filename. Alternately, the input data can be streamed over STDIN.
        See Also:
        Constant Field Values
      • OUTPUT_FILE_TOKEN

        public static final java.lang.String OUTPUT_FILE_TOKEN
        The token, which if present in the Command string, will be replaced with the output filename. Alternately, the output data can be collected on STDOUT.
        See Also:
        Constant Field Values
    • Constructor Detail

      • ExternalParser

        public ExternalParser()
    • Method Detail

      • getSupportedTypes

        public java.util.Set<MediaType> getSupportedTypes​(ParseContext context)
        Description copied from interface: Parser
        Returns the set of media types supported by this parser when used with the given parse context.
        Parameters:
        context - parse context
        Returns:
        immutable set of media types
      • getSupportedTypes

        public java.util.Set<MediaType> getSupportedTypes()
      • setSupportedTypes

        public void setSupportedTypes​(java.util.Set<MediaType> supportedTypes)
      • getCommand

        public java.lang.String[] getCommand()
      • setCommand

        public void setCommand​(java.lang.String... command)
        Sets the command to be run. This can include either of INPUT_FILE_TOKEN or OUTPUT_FILE_TOKEN if the command needs filenames.
        See Also:
        Runtime.exec(String[])
      • setIgnoredLineConsumer

        public void setIgnoredLineConsumer​(ExternalParser.LineConsumer ignoredLineConsumer)
        Set a consumer for the lines ignored by the parse functions
        Parameters:
        ignoredLineConsumer - consumer instance
      • getMetadataExtractionPatterns

        public java.util.Map<java.util.regex.Pattern,​java.lang.String> getMetadataExtractionPatterns()
      • setMetadataExtractionPatterns

        public void setMetadataExtractionPatterns​(java.util.Map<java.util.regex.Pattern,​java.lang.String> patterns)
        Sets the map of regular expression patterns and Metadata keys. Any matching patterns will have the matching metadata entries set. Set this to null to disable Metadata extraction.
      • parse

        public void parse​(java.io.InputStream stream,
                          org.xml.sax.ContentHandler handler,
                          Metadata metadata,
                          ParseContext context)
                   throws java.io.IOException,
                          org.xml.sax.SAXException,
                          TikaException
        Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler. Metadata is only extracted if setMetadataExtractionPatterns(Map) has been called to set patterns.
        Parameters:
        stream - the document stream (input)
        handler - handler for the XHTML SAX events (output)
        metadata - document metadata (input and output)
        context - parse context
        Throws:
        java.io.IOException - if the document stream could not be read
        org.xml.sax.SAXException - if the SAX events could not be processed
        TikaException - if the document could not be parsed
      • check

        public static boolean check​(java.lang.String checkCmd,
                                    int... errorValue)
        Checks to see if the command can be run. Typically used with something like "myapp --version" to check to see if "myapp" is installed and on the path.
        Parameters:
        checkCmd - The check command to run
        errorValue - What is considered an error value?
      • check

        public static boolean check​(java.lang.String[] checkCmd,
                                    int... errorValue)