Tree(3) User Contributed Perl Documentation Tree(3) NAME `HTML::Tree' - Perl extension for quickly parsing HTML files into trees SYNOPSIS use HTML::Tree; $tree1 = HTML::Tree->from_file( 'file.html' ); $aref = $tree1->as_array(); $tree2 = HTML::Tree->from_array( $aref ); $str = $tree2->as_string(); $tree3 = HTML::Tree->from_string( $str ); $tree3->write( 'new_file.html' ); then: sub visitor { my( $node, $depth, $is_end_tag ) = @_; # ... } $tree1->visit( \&visitor ); or: sub visitor { my( $hash_ref, $node, $depth, $is_end_tag ) = @_; # ... } %my_hash; # ... $tree1->visit( \%my_hash, \&visitor ); also: $aref = $node->children(); $node->delete(); $node = $node->find_if( \&predicate_function ); $node = $node->find_name( 'name' ); $bool = $node->is_element(); $bool = $node->is_comment(); $bool = $node->is_text(); $name = $node->name(); $text = $node->text(); DESCRIPTION `HTML::Tree' is a fast parser that parses an HTML file into a tree structure like the HTML DOM (Document Object Model). Once built, the nodes of the tree (elements and text from the HTML file) can be traversed by a user- defined visitor function or compiled into an array-of- hashes data structure. `HTML::Tree' is very similar to the `HTML::Parser' and `HTML::TreeBuilder' modules by Gisle Aas, except that it: 1. Is several times faster. `HTML::Tree' owes its speed to two things: using mmap(2) to read the HTML file bypassing conventional I/O and buffering, and being written entirely in C++ as opposed to Perl. 2. Isn't a strict DTD (Document Type Definition) parser. The goal is to parse HTML files fast, not check for validity. (You should check the validity of your HTML files with other tools before you put them on your web site anyway.) For example, `HTML::Tree' couldn't care less what attributes a given HTML element has just so long as the syntax is correct. This is actually simi- lar to browsers in that both are very permissive in what they accept. 3. Offers simple conditional and looping mechanisms assisting in the generation of dynamic HTML content. Methods For the methods below, the kind of node a method may be called on is indicated; `$node' means "any kind of node." Calling a method for a node of the wrong kind is a fatal error. `$parent_node = HTML::Tree->from_file(' file_name [ `{' param_hash_ref `}' ] `)' Parse the given HTML file and return a reference to a new `HTML::Tree' object. If, for any reason, the file can not be parsed (file does not exist, insufficient permissions, etc.), `undef' is returned. Parameters that control how the data structure is built may be passed via a reference to a hash. If `Include_Comments' is given with a non-zero value, then comment nodes are included; otherwise, they are elided. `$array_ref = $node->as_array(' [ `{' param_hash_ref `}' ] `)' Returns a reference to an array-of-hashes data struc- ture representing the nodes in the HTML tree starting at the specified node. Parameters that control how the data structure is built may be passed via a refer- ence to a hash. The parameters are the same as for `from_file()' above. For example, given this HTML: Text A Text B Text C the `Data::Dumper' representation of the resulting data structure would be: $ref = [ { 'name' => 'a', 'atts' => { 'href' => 'file.html' }, 'content' => [ 'Text A', { 'name' => 'b', 'content' => [ 'Text B' ] }, { 'name' => 'i', 'content' => [ 'Text C' ] }, ] } ] Every HTML element at the same "depth" or "level" is contained in the same array, i.e., they are "siblings" in the tree. The order of the elements in the array matches the order of the HTML elements in the file. A node is either a string (representing text or a com- ment) or a reference to a hash (representing an HTML element). Strings are tied scalars, so modifying them changes the underlying tree. Strings in the HTML file that are entirely whitespace are elided from the data structure. A hash always has a `name' key whose value is the name of the HTML element and may also have an `atts' key and/or a `content' key. The value of the `atts' key is a reference to a tied hash where the hash keys are attribute names and the hash values are the attribute values. Attribute names are returned in lower case (regardless of how they are in the HTML file). Because the hash is tied, assign- ing to a hash attribute changes that attribute's value; similarly, deleting an element deletes the attribute. The value of the `content' key is a reference to an array containing all of the node's child nodes at the next level down. Note: Modifying the arrays themselves (adding ele- ments, deleting, etc.) does not modify the underlying tree. To do that, either use the `children()' method or "walk" the tree using a visitor function. `$parent_node = HTML::Tree->from_array(' array_ref `)' Create a new `HTML::Tree' object from a data structure in the form returned by `as_array()'. If, for any reason, the data structure isn't in the right form, the function will croak with an error message. `$string = $node->as_string(' [ `{' param_hash_ref `}' ] `)' Return the HTML text representation of the portion of the tree starting at the given node as a single string. Parameters that control how the HTML tree is converted to a string may be passed via a reference to a hash. If the `Pretty_Print' parameter is given with a value greater than or equal to zero, then text nodes have leading and trailing whitespace removed, are indented according to their depth, and have a single newline appended. All other nodes appear on lines by them- selves and are also indented according to their depth. Indentation is done by spaces where the number of spaces at a given depth is `(Pretty_Print + depth) * 2'. Note: pretty-printing is suspended inside `
' ele-
	   ments to preserve the original formatting.

       `$parent_node = HTML::Tree->from_string(' string `{'
       param_hash_ref `}' ] `)'
	   This is the same as `from_file()' except that the HTML
	   is parsed from the given string rather than a file.

       `$value = $element_node->att(' name `)'
	   Returns the value of the element node's name attribute
	   or `undef' if said node does not have one.  Attribute
	   names must be specified in lower case (regardless of
	   how they are in the HTML file).

       `$element_node->att(' name`, 'new_value `)'
	   Sets the value of the element node's name attribute to
	   new_value.  If new_value is `undef', then the
	   attribute is deleted.  Attribute names must be speci-
	   fied in lower case (regardless of how they are in the
	   HTML file).	If no name attribute existed, it is
	   added.

       `$attributes_ref = $element_node->atts()'
	   Returns a reference to a tied hash of all of an ele-
	   ment node's attribute/value pairs or a reference to an
	   empty hash if said node does not have any.  Attribute
	   names are returned in lower case (regardless of how
	   they are in the HTML file).	Because the hash is tied,
	   assigning to a hash element changes that attribute's
	   value; similarly, deleting an element deletes the
	   attribute.

       `$child_nodes_ref = $parent_node->children()'
	   Returns a reference to a tied array of all of an ele-
	   ment node's child nodes.  Because the array is tied,
	   the Perl array manipulation functions pop, push,
	   shift, and unshift work and affect the structure of
	   the HTML::Tree.  For example:

		   $orphan = unshift @{ $node1->children() };

	   "detaches" the first child node of $node1 from the
	   tree structure and returns a reference to it now as
	   its own distinct HTML::Tree.	 Conversely:

		   push @{ $node2->children() }, $orphan;

	   "reattaches" the sub-tree but now at the end of the
	   child nodes of $node2 elsewhere in the tree.

	   Additionally, a child node can also be replaced by
	   assignment as in:

		   $node->children()->[0] = expression

	   where expression is one of: a reference to a data
	   structure in the form returned by `as_array()', a ref-
	   erence to an HTML::Tree (in which case the whole tree
	   is "inserted"), or a string (in which case the string
	   is parsed as HTML).

       `$node->delete()'
	   Delete the node and all of its child nodes, if any,
	   from the tree.  Once deleted, the reference to the
	   node must not be used.

       `$node = $node->find_if(' func_ref `)'
	   Find the first node for which the given predicate
	   function is true starting the find from the given
	   node.  Returns `undef' if no such node is found.  Clo-
	   sures work well to generate the predicate function
	   since additional parameters can be used during the
	   find.  For example:

		   sub pred_att_re {
			   my( $att, $re ) = @_;
			   return sub {
				   my $node = shift;
				   return  $node->is_element() &&
					   $node->att( $att ) =~ /$re/;
			   }
		   }

		   $node = $html->find_if( pred_att_re( 'href', '\.jpg$' ) );

	   This would find an element node having an attribute
	   `href' that matches the regular expression `\.jpg$'.

       `$element_node = $node->find_name(' name `)'
	   Find the first element node having the given name
	   starting the find from the given node.  The name must
	   be specified in lower case.	Returns `undef' if no
	   such element node is found.	(This function is a spe-
	   cial case of `find_if()' and is much faster for find-
	   ing by name alone.

       `$bool = $node->is_comment()'
	   Returns true (1) only if the current node is a comment
	   node; false (0), otherwise.

       `$bool = $node->is_text()'
	   Returns true (1) only if the current node is a text
	   node; false (0), otherwise.	(If a node isn't a text
	   node, it must be an element node.)

       `$name = $element_node->name()'
	   Returns the HTML element name of an element node,
	   e.g., `title'.  All names are returned in lower case
	   (regardless of how they are in the HTML file).

       `$text = $text_node->text(' [ new_text ] `)'
	   Returns the text of a text node as a string.	 If
	   new_text is given, the text is set to that first.

       `$node->visit( \&'visitor` )'
	   Traverse the HTML tree by calling the visitor function
	   for every node starting at the given node previously
	   returned by a constructor.



       `$node->visit( \%'hash`, \&'visitor` )'
	   Same as the previous method except that a hash refer-
	   ence is passed along (see Arguments below).

       `$success = $node->write( 'file_name` '[`, {
       'param_hash_ref` } ']` )'
	   Write the HTML text representation of the portion of
	   the tree starting at the given node as a single string
	   to a file.  Returns 1 upon sucess, 0 otherwise.

	   Parameters that control how the HTML is written may be
	   passed via a reference to a hash.  The `Pretty_Print'
	   parameter has the same meaning as it does for
	   `as_string()'.

The Visitor Function
       The user supplies a visitor function: a Perl function that
       is called when every node is visited (i.e., a "call-back")
       during an in-order tree traversal.

       For HTML elements that have end tags, the visitor function
       may be called more than once for a given node based on the
       function's return value.	 (See Return Value below.)

       Note that this occurs for such HTML elements even if said
       element's end tag is optional and was not present in the
       HTML file.

       Arguments


       `$hash_ref'    A reference to a hash that is passed only
		      if the two-argument form of the `visit()'
		      method is used.  This provides a mechanism
		      for additional data (or a blessed object)
		      to be passed to and among the calls to the
		      visitor function.	 The argument is not used
		      at all by `HTML::Tree'.

       `$node'	      A reference to the current node.

       `$depth'	      An integer specifying how "deep" the node
		      is in the tree.  (Depths start at zero.)

       `$is_end_tag'  True (1) only if the tag is an end tag of
		      an HTML element; false (0), otherwise.

       Return Value

       The visitor function is expected to return a Boolean value
       (zero or non-zero for false or true, respectively).  There
       are two meanings for the return value:

       1.  If the $is_end_tag argument is false, returning false
	   means: do not visit any of the current node's child
	   nodes, i.e., skip them and proceed directly to the
	   current node's next sibling and also do not call the
	   visitor again for the end tag; returning true means:
	   do visit all child nodes and call the visitor again
	   for the end tag.

       2.  If the $is_end_tag argument is true, returning false
	   means: proceed normally to the next sibling; returning
	   true means: loop back and repeat the visit cycle from
	   the beginning by revisiting the start tag of the
	   current element node (case 1 above).

EXAMPLE
       Here is a sample visitor function that "pretty prints" an
       HTML file:

	       sub visitor {
		       my( $node, $depth, $is_end_tag ) = @_;
		       print "	  " x $depth;
		       if ( $node->is_text() ) {
			       my $text = $node->text();
			       $text =~ s/(?:^\n|\n$)//g;
			       print "$text\n";
			       return 1;
		       }
		       if ( $is_end_tag ) {
			       print "name(), ">\n";
			       return 0;
		       }
		       print '<', $node->name();
		       my $atts = $node->atts();
		       while ( my( $att, $val ) = each %{ $atts } ) {
			       print " $att=\"$val\"";
		       }
		       print ">\n";
		       return 1;
	       }


NOTES
       In order for an HTML file to be properly parsed, scripting
       languages must be "comment hidden" as in:

	       


SEE ALSO
       perl(1), mmap(2), Data::Dumper(3), HTML:\fIs0:Parser(3),
       HTML:\fIs0:TreeBuilder(3).

       World Wide Web Consortium Document Object Model Working
       Group.  Document Object Model, December 1998.
       `http://www.w3.org/DOM/'

AUTHOR
       Paul J. Lucas 

HISTORY
       The HTML parser of the C++ part of the module is derived
       from code in SWISH++, a really fast file indexing and
       searching engine (also by the author).



2002-10-28		   perl v5.6.0			  Tree(3)