Class AbstractRecursiveParserWrapperHandler

  • All Implemented Interfaces:
    java.io.Serializable, org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler
    Direct Known Subclasses:
    RecursiveParserWrapperHandler

    public abstract class AbstractRecursiveParserWrapperHandler
    extends org.xml.sax.helpers.DefaultHandler
    implements java.io.Serializable
    This is a special handler to be used only with the RecursiveParserWrapper. It allows for finer-grained processing of embedded documents than in the legacy handlers. Subclasses can choose how to process individual embedded documents.
    See Also:
    Serialized Form
    • Field Detail

      • TIKA_CONTENT

        public static final Property TIKA_CONTENT
      • TIKA_CONTENT_HANDLER

        public static final Property TIKA_CONTENT_HANDLER
        Simple class name of the content handler
      • PARSE_TIME_MILLIS

        public static final Property PARSE_TIME_MILLIS
      • WRITE_LIMIT_REACHED

        public static final Property WRITE_LIMIT_REACHED
      • EMBEDDED_RESOURCE_LIMIT_REACHED

        public static final Property EMBEDDED_RESOURCE_LIMIT_REACHED
      • EMBEDDED_EXCEPTION

        public static final Property EMBEDDED_EXCEPTION
      • CONTAINER_EXCEPTION

        public static final Property CONTAINER_EXCEPTION
      • EMBEDDED_RESOURCE_PATH

        public static final Property EMBEDDED_RESOURCE_PATH
      • EMBEDDED_DEPTH

        public static final Property EMBEDDED_DEPTH
    • Constructor Detail

      • AbstractRecursiveParserWrapperHandler

        public AbstractRecursiveParserWrapperHandler​(ContentHandlerFactory contentHandlerFactory)
      • AbstractRecursiveParserWrapperHandler

        public AbstractRecursiveParserWrapperHandler​(ContentHandlerFactory contentHandlerFactory,
                                                     int maxEmbeddedResources)
    • Method Detail

      • getNewContentHandler

        public org.xml.sax.ContentHandler getNewContentHandler()
      • getNewContentHandler

        public org.xml.sax.ContentHandler getNewContentHandler​(java.io.OutputStream os,
                                                               java.nio.charset.Charset charset)
      • startEmbeddedDocument

        public void startEmbeddedDocument​(org.xml.sax.ContentHandler contentHandler,
                                          Metadata metadata)
                                   throws org.xml.sax.SAXException
        This is called before parsing each embedded document. Override this for custom behavior. Make sure to call this in your custom classes because this tracks the number of embedded documents.
        Parameters:
        contentHandler - local handler to be used on this embedded document
        metadata - embedded document's metadata
        Throws:
        org.xml.sax.SAXException
      • endEmbeddedDocument

        public void endEmbeddedDocument​(org.xml.sax.ContentHandler contentHandler,
                                        Metadata metadata)
                                 throws org.xml.sax.SAXException
        This is called after parsing each embedded document. Override this for custom behavior. This is currently a no-op.
        Parameters:
        contentHandler - content handler that was used on this embedded document
        metadata - metadata for this embedded document
        Throws:
        org.xml.sax.SAXException
      • endDocument

        public void endDocument​(org.xml.sax.ContentHandler contentHandler,
                                Metadata metadata)
                         throws org.xml.sax.SAXException
        This is called after the full parse has completed. Override this for custom behavior. Make sure to call this as super.endDocument(...) in subclasses because this adds whether or not the embedded resource maximum has been hit to the metadata.
        Parameters:
        contentHandler - content handler that was used on the main document
        metadata - metadata that was gathered for the main document
        Throws:
        org.xml.sax.SAXException
      • hasHitMaximumEmbeddedResources

        public boolean hasHitMaximumEmbeddedResources()
        Returns:
        whether this handler has hit the maximum embedded resources during the parse