org.openxml.parser
Class HTMLParser
java.lang.Object
|
+--org.openxml.parser.BaseParser
|
+--org.openxml.parser.ContentParser
|
+--org.openxml.parser.HTMLParser
- public final class HTMLParser
- extends org.openxml.parser.ContentParser
Implements a parser for HTML documents and nodes. The HTML document is created
with DOMFactory
, loads the DTD document specified, and assures that
HTML, HEAD and BODY elements exist in its structure.
- Version:
- $Revision: 1.9 $ $Date: 1999/04/18 01:53:32 $
- Author:
- Assaf Arkin
- See Also:
ContentParser
,
SAXException
Fields inherited from class org.openxml.parser.ContentParser |
_currentNode,
_docType |
Fields inherited from class org.openxml.parser.BaseParser |
_curChar,
_document,
_tokenText,
CR,
EOF,
LF,
SPACE,
TOKEN_CDATA,
TOKEN_CLOSE_TAG,
TOKEN_COMMENT,
TOKEN_DTD,
TOKEN_ENTITY_REF,
TOKEN_EOF,
TOKEN_OPEN_TAG,
TOKEN_PE_REF,
TOKEN_PI,
TOKEN_SECTION,
TOKEN_SECTION_END,
TOKEN_TEXT |
Constructor Summary |
HTMLParser(java.io.Reader reader,
java.lang.String sourceURI)
Parser constructor. |
HTMLParser(java.io.Reader reader,
java.lang.String sourceURI,
short mode,
short stopAtSeverity)
Parser constructor. |
Methods inherited from class org.openxml.parser.ContentParser |
getEntityContents,
parseAttrEntity,
parseAttributes,
parseContentEntity,
readTokenContent |
Methods inherited from class org.openxml.parser.BaseParser |
advanceLineNumber,
canReadName,
close,
error,
fatalError,
getColumnNumber,
getErrorHandler,
getErrorReport,
getLastException,
getLineNumber,
getLocator,
getMode,
getPublicId,
getReader,
getSourcePosition,
getSourceURI,
getSystemId,
isClosed,
isMode,
isNamePart,
isSpace,
isTokenAllSpace,
parseDocumentDecl,
parseGeneralEntity,
pushBack,
pushBack,
readChar,
readTokenEntity,
readTokenMarkup,
readTokenName,
readTokenPERef,
readTokenQuoted,
setEncoding,
setErrorHandler,
setErrorSink,
slicePITokenText,
warning |
Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
HTMLParser
public HTMLParser(java.io.Reader reader,
java.lang.String sourceURI,
short mode,
short stopAtSeverity)
- Parser constructor. Requires source text in the form of a
Reader
object and as an identifier. The parsing mode consists of a
combination of MODE_.. flags. The constructor specifies the error
severity level at which to stop parsing, either Parser.STOP_SEVERITY_FATAL
,
Parser.STOP_SEVERITY_VALIDITY
or Parser.STOP_SEVERITY_WELL_FORMED
.
- Parameters:
reader
- Any Reader
from which entity text can be readsourceURI
- URI of entity sourcemode
- The parsing mode in effectstopAtSeverity
- Severity level at which to stop parsing
HTMLParser
public HTMLParser(java.io.Reader reader,
java.lang.String sourceURI)
- Parser constructor. Constructor will operate in the default mode of
Parser.MODE_HTML_PARSER
with Parser.STOP_SEVERITY_FATAL
.
- Parameters:
reader
- Any Reader
from which entity text can be readsourceURI
- URI of entity source
parseDocument
public Document parseDocument()
throws SAXException
parseNode
public final Node parseNode(Node node)
throws SAXException
- Parses a document fragment. A document fragment by definition does not
contain a header or DTD and is not subject for validation. An empty
document fragment (created from an existing document) must be supplied
and the non-empty fragment is returned.
- Parameters:
fragment
- A DocumentFragment
that is empty and compatible- Returns:
- The same
DocumentFragment
object - Throws:
- SAXException - A parsing error has been encountered, and based on
it severity, an exception is thrown to terminate parsing
parseNextNode
protected boolean parseNextNode(int token)
throws SAXException,
java.io.IOException
- Parses the next node based on the supplied token. This method is called
with a read token, parses a node and appends it to
ContentParser._currentNode
.
If plain text is read, it is accumulated and later on converted into a
Text
. If the node is an element, the element
is created and it's full contents read (recursively).
The return value indicates if the current element (in ContentParser._currentNode
)
has been closed with a closing tag (false), or should parsing continue at
the same level (true). False is also returned if the end of file has been
reached.
The following rules govern how tokens are translated into nodes:
- CDATA sections are stored as
CDATASection
if
in mode Parser.MODE_STORE_CDATA
, converted to plain text otherwise
- Comments are stored as
Comment
if in mode Parser.MODE_STORE_COMMENT
, ignored otherwise
- Processing instructions are stored as
ProcessingInstruction
if in mode Parser.MODE_STORE_PI
,
ignored otherwise
- All whitespaces are converted to space (0x20) and multiple whitespaces
are consolidated in text, except for a few space preserving elements
- Entity references are stored as text, unresolved references are stored
as textual presentation, regardless of
Parser.MODE_PARSE_ENTITY
- Attributes are read according to the rules set forth in
ContentParser.parseAttributes(org.w3c.dom.Element, boolean)
- For the HTML elements <PRE>, <SCRIPT> and <STYLE>,
all text until the closing tag is consumed and stored as is, without
parsing markup or character references
- For HTML elements with an optional closing tag, if the closing tag is
missing, an empty element is stored
- A single orphan closing tag is supported (while issuing a well formed
error); a closing orphan tag is one that is misplaced relative to the open
tag, e.g. '<P><FONT><B>text</P></B></FONT>'
- White space immediately after the opening tag and immediately before
the closing tag is discarded
The proper way to use this method is:
_currentNode = ...;
token = readTokenContent();
while ( parseNextNode( token ) )
token = readTokenContent();
- Parameters:
token
- The last token read with ContentParser.readTokenContent()
- Returns:
- True if continue parsing, false if current element has been closed
or reached end of file
- Throws:
- SAXException - A parsing error has been encountered, and based on
it severity, an exception is thrown to terminate parsing
- java.io.IOException - An I/O exception has been encountered when reading
from the input stream
- See Also:
ContentParser.parseAttributes(org.w3c.dom.Element, boolean)
,
ContentParser.readTokenContent()
,
ContentParser._currentNode
,
#_orphanClosingTag
parseDTDSubset
protected final void parseDTDSubset()
throws SAXException,
java.io.IOException
- Parser the external DTD subset. A new
DTDDocument
is created,
the external subset is optinally cached in memory, and public identifiers
are possibly converted to URIs, as per the installed HolderFinder
.
This method is called after '<!DOCTYPE' has been consumed and returns
after the terminating '>' has been read.