Class EmbeddedDocumentUtil

  • All Implemented Interfaces:
    java.io.Serializable

    public class EmbeddedDocumentUtil
    extends java.lang.Object
    implements java.io.Serializable
    Utility class to handle common issues with embedded documents.

    Use statically if all that is needed is getting the EmbeddedDocumentExtractor. Otherwise, instantiate an instance.

    Note: This is not thread safe. Make sure to instantiate one per thread.

    See Also:
    Serialized Form
    • Constructor Detail

      • EmbeddedDocumentUtil

        public EmbeddedDocumentUtil​(ParseContext context)
    • Method Detail

      • getEmbeddedDocumentExtractor

        public static EmbeddedDocumentExtractor getEmbeddedDocumentExtractor​(ParseContext context)
        This offers a uniform way to get an EmbeddedDocumentExtractor from a ParseContext. As of Tika 1.15, an AutoDetectParser will automatically be added to parse embedded documents if no Parser.class is specified in the ParseContext.

        If you'd prefer not to parse embedded documents, set Parser.class to EmptyParser in the ParseContext.

        Parameters:
        context -
        Returns:
        EmbeddedDocumentExtractor
      • getDetector

        public Detector getDetector()
      • getMimeTypes

        public MimeTypes getMimeTypes()
      • getTikaConfig

        public TikaConfig getTikaConfig()
        Returns:
        Returns a TikaConfig -- trying to find it first in the ParseContext that was included during initialization, and then creating a new one from via TikaConfig.getDefaultConfig() if it can't find one in the ParseContext. This caches the default config so that it only has to be created once.
      • getConfig

        @Deprecated
        public TikaConfig getConfig()
        Deprecated.
        as of 1.17, use getTikaConfig() instead
        Returns:
        Returns a TikaConfig -- trying to find it first in the ParseContext that was included in the initialization, and then creating a new one from via TikaConfig.getDefaultConfig() if it can't find one in the ParseContext.
      • recordException

        public static void recordException​(java.lang.Throwable t,
                                           Metadata m)
      • recordEmbeddedStreamException

        public static void recordEmbeddedStreamException​(java.lang.Throwable t,
                                                         Metadata m)
      • shouldParseEmbedded

        public boolean shouldParseEmbedded​(Metadata m)
      • parseEmbedded

        public void parseEmbedded​(java.io.InputStream inputStream,
                                  org.xml.sax.ContentHandler handler,
                                  Metadata metadata,
                                  boolean outputHtml)
                           throws java.io.IOException,
                                  org.xml.sax.SAXException
        Throws:
        java.io.IOException
        org.xml.sax.SAXException
      • tryToFindExistingLeafParser

        public static Parser tryToFindExistingLeafParser​(java.lang.Class clazz,
                                                         ParseContext context)
        Tries to find an existing parser within the ParseContext. It looks inside of CompositeParsers and ParserDecorators. The use case is when a parser needs to parse an internal stream that is _part_ of the document, e.g. rtf body inside an msg.

        Can return null if the context contains no parser or the correct parser can't be found.

        Parameters:
        clazz - parser class to search for
        context -
        Returns: