HTML_Node(3) HTML_Node(3)
NAME
HTML_Node - Abstract base class for nodes in an HTML tree
SYNOPSIS
namespace HTML_Tree {
class HTML_Node {
public:
class visitor {
public:
virtual ~visitor();
virtual bool operator()(
HTML_Node*, int depth, bool is_end_tag
) = 0;
};
virtual ~HTML_Node() = 0;
class iterator : public
std::iterator< std::forward_iterator_tag, HTML_Node > {
public:
iterator();
iterator& operator++();
iterator operator++(int);
HTML_Node& operator* () const;
HTML_Node* operator->() const;
friend bool operator==(iterator const&, iterator const&);
friend bool operator!=(iterator const&, iterator const&);
};
iterator begin();
iterator end();
class const_iterator : public
std::iterator< std::forward_iterator_tag, HTML_Node const > {
public:
const_iterator();
const_iterator& operator++();
const_iterator operator++(int);
HTML_Node& operator* () const;
HTML_Node* operator->() const;
friend bool operator==(
const_iterator const&, const_iterator const&
);
friend bool operator!=(
const_iterator const&, const_iterator const&
);
};
const_iterator begin() const;
const_iterator end() const;
std::string as_string( int pretty_print = -1 ) const;
Content_Node* parent() const;
void parent( Content_Node *new_parent );
virtual void visit( visitor&, int depth = 0 );
std::ostream& write( std::ostream&, int pretty_print = -1 ) const;
virtual bool write_node(
std::ostream&, int spaces, bool is_end_tag
) const = 0;
class manip {
public:
typedef
std::ostream& (HTML_Node::*function)(std::ostream&,int) const;
manip( HTML_Node const&, function f, int arg );
friend std::ostream& operator<<( std::ostream&, manip const& );
};
manip write( int pretty_print = -1 ) const;
friend bool operator==( HTML_Node const&, HTML_Node const& );
friend bool operator!=( HTML_Node const&, HTML_Node const& );
protected:
HTML_Node( Content_Node *parent = 0 );
virtual bool similar_to( HTML_Node const& ) const;
};
Content_Node* html_parse(
char const *begin, char const *end, bool include_comments = false
);
}
DESCRIPTION
HTML_Node is an abstract base class for nodes in an HTML
tree that was built by parsing an HTML file into a tree
structure like the HTML DOM (Document Object Model). Once
built, the nodes of the tree (elements and text from the
HTML file) can be traversed either by a user-defined visi
tor class or by an iterator.
Public Interface
string as_string( int pretty_print = -1 ) const
Returns the HTML tree converted (back) to an HTML
string. The pretty_print argument, when zero or
greater, specifies that the HTML is to be ``pretty-
printed'': text nodes are trimmed of leading and
trailing whitespace and have a single newline
appended; all other nodes appear on lines by them
selves indented by their depth. The indentation
per line is incremented by the number of spaces
given by 2 * pretty_print.
iterator begin()
const_iterator begin() const
iterator end()
const_iterator end()
Return either an iterator or const_iterator,
respectively, either at the beginning or one past
the end (in STL style) of the HTML tree. The iter
ators can be used with all STL algorithms.
Content_Node* parent() const
Returns a pointer to the current parent node for
this node, or null if this node has no parent.
void parent( Content_Node *new_parent )
If this node already has a parent that is not the
current parent, this node is first removed from
that parent's list of child nodes. Then, this
node's parent node is set to new_value. If
new_parent is not null, adds this node to the par
ent's list of child nodes.
virtual void visit( visitor&, int depth = 0 )
Performs an in-order tree traversal starting at
this node. For each node, the visitor's operator()
is called once.
std::ostream& write( std::ostream&, int pretty_print = -1
) const
Writes the HTML text representation of the tree to
the given ostream. The pretty_print has the same
meaning as for as_string().
manip write( int pretty_print = -1 ) const
This is a specialized version of write() above that
allows this to be done:
some_ostream << node->write();
i.e., writing to an ostream using ``insertion
style.''
virtual bool write_node( std::ostream&, int spaces, bool
is_end_tag ) const = 0
Write the XML text representation of the node to
the given ostream preceded by the given number of
spaces. If is_end_tag is true, write the end tag
for the element; otherwise the start tag. Returns
false only if nothing was written.
friend bool operator==( HTML_Node const&, HTML_Node const&
)
friend bool operator!=( HTML_Node const&, HTML_Node const&
)
Compares two HTML_Nodes (or objects of classes
derived from HTML_Node) for equality or inequality,
respectively, and returns that result.
Protected Interface
HTML_Node( Content_Node *parent = 0 )
Default constructor. If parent is not null, sets
the parent and adds this node to that parent's list
of child nodes.
virtual bool similar_to( HTML_Node const& ) const
Returns true only if this node is the same node as
the given one, i.e., their addresses are equal.
(This is overridden by ``semantically better''
functions in derived classes.)
Global Functions
Content_Node* html_parse( char const *begin, char const
*end, bool include_comments = false )
Parses the HTML in the buffer between [begin,end)
into an HTML tree and returns a pointer to the root
node of an HTML tree.
Iterator Classes
The classes iterator and const_iterator are STL for
ward_iterators and can be used in the same way including
in all STL algorithms.
The Visitor Class
HTML_Node::visitor is an abstract base class for object
that ``visit'' nodes.
Public Interface
virtual ~visitor()
Destructor. It does nothing. It's defined only to
ensure it's virtual as it should be for an abstract
base class.
virtual bool operator()( HTML_Node*, int depth, bool
is_end_tag )
The visit function. A derived class must override
this since it's pure virtual. The depth indicates
how ``deep'' the current node is in the tree.
Depths start at zero. The is_end_tag argument is
not used by HTML_Node, so it always passes false.
Iterators vs. Visitors
The iterator and visitor classes are similar in that they
can both be used to iterate over (or visit) every node in
the tree. However, the differences are:
1. An iterator iterates over every node exactly once.
2. A visitor visits non-empty nodes twice: once each for
the start and end tags.
3. A visitor, based on the visitor function's return val
ues, can either skip nodes by not descending into por
tions of the tree or loop back from end tags to start
tags and repeat portions of the tree.
SEE ALSO
Comment_Node(3), Content_Node(3), Element_Node(3),
Text_Node(3).
World Wide Web Consortium Document Object Model Working
Group. Document Object Model, December 1998.
http://www.w3.org/DOM/
AUTHOR
Paul J. Lucas
HISTORY
The HTML parser is derived from code in SWISH++, a really
fast file indexing and searching engine (also by the
author).
HTML Tree March 17, 2003 HTML_Node(3)