Validator User’s Guide

Validator is a cross-platform drag-and-drop XML validator. Drag an XML file onto the Validator icon. If the file has a document type declaration, Validator reports the file as valid, not valid, or not well-formed. If the file does not have a doctype, Validator reports the file as well-formed or not well-formed. You can drop multiple files and folders on the icon. For dropped folders, Validator recurses through folders and subfolders looking for files to validate. Validator validates any file dropped directly on the icon, but while scanning folders, Validator validates only files with the following extensions: dita, ditamap, fo, hs, htm, html, jhm, jnlp, plist, rss, wml, xht, xhtml, xml, xsd, xsl, and xslt. Validator can be installed as a command-line utility on Linux and similar platforms. The latest information about Validator can be found at http://homepage.mac.com/rcrews/software/validator/.

Validator skips documents with an internal list of known SGML doctypes; however, Validator assumes any *.htm or *.html file without a doctype is XML and will check it for well-formedness using XML rules, not SGML (“HTML”) rules as you might expect. If you don’t want this, either add a doctype to your files as required by the HTML Specification or don’t pass these files to Validator. HTML Tidy converts HTML to XHTML, fixing potentially hundreds of common mark-up errors in the process. If you work with HTML, do yourself a favor and introduce yourself to Tidy.

This document contains the following sections:

  1. Installing and Upgrading Validator on Mac OS X
  2. Installing and Upgrading Validator on Windows
  3. Installing and Upgrading Validator on Linux and Similar Platforms
  4. Doctype Overview
  5. Correct Mark-Up
  6. Support for XML Schema
  7. Understanding Validator Error Messages
  8. Enhancements for XHTML
  9. Enhancements for DocBook 5.0
  10. Validator for XML Experts
    1. Validator for CVS
    2. Validator for Command-Line Saxon
    3. Validator for xsltproc
    4. Validator for Perl
    5. Validator for Java
  11. Command Line Interface
  12. Native Support for Public Identifiers
  13. Release History
  14. License

Installing and Upgrading Validator on Mac OS X

To install Validator on Mac OS X, drag the Validator icon to your Applications folder. You can drag an alias to your Desktop or your Dock as well, if you want.

For access to Validator from the command line, add the following to your ~/.bashrc:

Validator ( ) {
  /usr/bin/perl /Applications/Validator.app/Contents/Resources/script $@
}

To upgrade, drag the new Validator to your Applications folder, allowing it to overwrite the previous version.

Installing and Upgrading Validator on Windows

To install Validator on Windows, run the installer. If you’re behind a firewall, Start > Control Panel > System > Advanced > Environment Variables and set a system variable called http_proxy with a value of your HTTP proxy (format: “http://proxy.company.com:80”).

You can drop any number of items on the Validator shortcut on your Desktop. Some or all of them can be folders containing XML files. Validator writes to a file called Validator.log on your Desktop. When the validation is complete, Validator launches Microsoft Write to display the log.

To upgrade from earlier versions of Validator, delete C:\Validator.app, then run the installer. For subsequent upgrades, run the uninstaller, then run the installer.

Installing and Upgrading Validator on Linux and Similar Platforms

To install Validator on Linux and similar platforms:

  1. Confirm that you have Perl 5.8.1 or later installed with the following modules: Compress::Zlib, Cwd, Encode, File::Find, File::Spec, FindBin, Getopt::Long, IO::File, Pod::Usage, URI::file, XML::LibXML, XML::LibXML::Common, XML::NamespaceSupport, XML::SAX, and XML::SAX::Base.

    Most of these are standard modules. You will likely need to install only XML::LibXML and its prerequisites (XML::NamespaceSupport, XML::SAX::Base, XML::SAX, and XML::LibXML::Common). You’ll know you have XML::LibXML installed when the following command does not return an error:

    perl -e "use XML::LibXML;"
    
  2. Put Validator.app anywhere on your system.

  3. Add the following to your ~/.bashrc:

    Validator ( ) {
      /path/to/perl /path/to/Validator.app/Contents/Resources/script $@
    }
    

    You could create a shell alias if you’d prefer. You cannot, however, launch the program from a file system link (or symbolic link).

Invoke Validator on the command line, following it with paths to any number of XML files or paths to directories containing XML files. Validator writes output to STDOUT, so you might want to use shell redirection to write your report to a file if you’re expecting a lot of data.

To upgrade, delete Validator.app, then put the new Validator.app in its place.

Doctype Overview

The document type declaration generates a lot of confusion; however, the declaration itself serves simply to identify and optionally expand the document type definition for the current document. Don’t confuse the document type declaration (often “doctype”) with the document type definition (DTD). A doctype associates a particular document with a particular DTD, and a DTD defines the document structure for a class of documents.

A document type declaration always begins with <!DOCTYPE and ends with a simple >. Letter case matters. A doctype is neither a tag nor an element, so don’t end it with />. After “DOCTYPE” is the name of the root element for the current document. Following this, a doctype takes one of the following forms:

<!DOCTYPE root>
This form identifies only the root element. For this form, Validator ignores the doctype and reports only whether the document is well-formed or not well-formed.
<!DOCTYPE root>
<!DOCTYPE root [ … ]>
This form defines the entire structure for the current document between square brackets.
<!DOCTYPE sEc [
  <!ELEMENT Para  (#PCDATA)>
  <!ELEMENT TITLE (#PCDATA)>
  <!ELEMENT sEc   (TITLE, Para+, sEc*)>
  <!ATTLIST sEc
    relation  CDATA  #IMPLIED
    subject   CDATA  #IMPLIED >
  <!ENTITY ddagger "&#x2021;">
]>
<!DOCTYPE root SYSTEM "URI" [ … ]>
This form identifies the DTD for the current document by URI. The URI must be delimited by quotes. The optional square brackets, if present, enclose additional structural information.
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<!DOCTYPE root PUBLIC "FPI" "URI" [ … ]>
This form identifies the DTD for the current document with a formal public identifier and a URI. Both the FPI and the URI must be delimited by quotes. The optional square brackets, if present, enclose additional structural information. The URI is not optional: doctype sniffing (used by Mozilla, Internet Explorer, and Opera) is for tag-soup mark-up.
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.1//EN"
  "http://www.w3.org/2005/SMIL21/SMIL21.dtd">

Correct Mark-Up

Validation is the middle tier on the pyramid of correct mark-up. Before validation, the document must be well-formed, and after validation, the document must be checked against the specific rules of the mark-up language specification. Here are the three tiers:

Well-formed
The basic requirement for XML is well-formedness. A document is well-formed when an XML parser fails to discover any “fatal errors” in the document. The XML specification says that a proper parser must immediately stop processing the document and return no useful content data when it encounters a fatal error, assuring well-formedness as the baseline requirement for XML processing. Typical fatal errors are misuse of the >, <, or & characters, improperly nested elements, and improperly closed elements.
Valid
Validation is the middle tier. Each element in a valid document contains only the attributes defined by the DTD and is positioned directly within only those elements defined to enclose it. DTDs can define other items of correct structure as well: named character entities, the correct location of text, and the proper use of element IDs.
Correct
The apex of the pyramid is a document that is well-formed, valid, and satisfies all the requirements of the mark-up language specification. The types of requirements that a validating XML parser cannot check mostly relate to the syntax of attribute values and structural exclusions.

Validator’s validation is based on DTDs. DTDs don’t handle some features of modern XML, notably namespaces and the technologies that rely on them. XML Schema and Relax NG are technologies that can do most things DTDs can do and more; however, DTDs have a history of success in defining and verifying document structure. People who eschew DTDs are often iconoclasts reluctant to check their mark-up in any way beyond “eye-balling” a rendering in a popular browser. Remember that both XML Schema and Relax NG provide more checking than DTDs, meaning they are each more restrictive than DTDs. Additionally, DTDs provide advantages over other validation methods:

Correct syntax should be the price of admission for mark-up. After learning the simple techniques required to master it, immediately proceed to the “best practices” guides available for your mark-up language. For XHTML, start with the W3C Web Accessibility Initiative:

Support for XML Schema

Validator evaluates XML documents associated with XML Schema rules as long as the files define either the schemaLocation or noNamespaceSchemaLocation attributes as prescribed in the XML Schema specification. The noNamespaceSchemaLocation attribute identifies the location of the schema to use for mark-up that is not in a namespace, and the schemaLocation attribute identifies the location of schemas associated with the various namespaces in your XML document. Validator evaluates XML Schema documents themselves against the XML Schema DTD as long as the file uses the appropriate doctype and formal public identifier.

Although the noNamespaceSchemaLocation attribute takes a single URI value, and the schemaLocation attribute takes any number of namespace and schema location pairs with all items separated from one another by whitespace, Validator currently supports only one XML Schema validation per file. If your document defines a noNamespaceSchemaLocation attribute, Validator validates the file using only that identified schema. If your document defines a schemaLocation attribute (and does not define a noNamespaceSchemaLocation attribute), Validator validates the file using only the first namespace/location pair in the attribute value.

Additionally—odd as it sounds—the specification for XML Schema includes a (nonnormative) DTD for XML Schema documents and says,

Authoring XML Schema documents using this DTD and DTD-based authoring tools and specifying it as the DOCTYPE of documents intended to be XML Schema documents and validating them with a validating XML parser, are sensible development strategies which users are encouraged to adopt….

Therefore, XML Schema documents—such as this one—can be validated with Validator:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE schema PUBLIC "-//W3C//DTD XMLSCHEMA 200102//EN"
  "http://www.w3.org/2001/XMLSchema.dtd">
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> 

  <xs:element name="calendar">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="event" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="title" type="xs:string"/>
    </xs:complexType>
  </xs:element>

  <xs:element name="event">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="booth" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="date" type="xs:dateTime"/>
    </xs:complexType>
  </xs:element>

  <xs:element name="booth">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="organization" type="xs:string"/>
      </xs:sequence>
      <xs:attribute name="location" type="xs:string"/>
    </xs:complexType>
  </xs:element>

</xs:schema>

However, note that the DTD for XML Schema defines the preferred namespace prefix as “xs.” If your XML Schema document uses a different namespace prefix, you'll need to redefine the namespace prefix for the DTD to validate your schemas, as is done in this valid document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE schema PUBLIC "-//W3C//DTD XMLSCHEMA 200102//EN"
  "http://www.w3.org/2001/XMLSchema.dtd" [
<!ENTITY % p "xsd:">
<!ENTITY % s ":xsd">
  ]>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="dwarves">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="names" type="xsd:string" maxOccurs="7"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>
</xsd:schema>

An XML document references XML Schema rules as in the following example which associates mark-up not in a namespace with the calendar.xsd schema listed above:

<?xml version="1.0" encoding="UTF-8"?>
<calendar title="Additions"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="calendar.xsd">

  <event date="2008-12-22T00:00:00.000Z">
    <booth location="B17">
      <organization>PuppyWash.com</organization>
    </booth>
    <booth location="E45">
      <organization>Lighted Collars, Inc.</organization>
    </booth>
  </event>

  <event date="2009-02-16T00:00:00.000Z">
    <booth location="Q29">
      <organization>Rick's Tattoos</organization>
    </booth>
    <booth location="T34">
      <organization>Artistic Tattoo</organization>
    </booth>
  </event>

</calendar>

Validator validates the files like this:

selkie:Desktop rcrews$ Validator calendar.xsd schema.xml dwarves.xsd
calendar.xsd... Valid.
schema.xml... XML Schema... Valid.
dwarves.xsd... Valid.

Understanding Validator Error Messages

Validator error messages are concise and specific. The messages describe what’s wrong with the file, not how to fix it. Keep the following in mind as you fix your files:

Each log entry begins with the file name followed by an ellipsis. Log entries for files without errors are all on one line and end with either “Well-formed” or “Valid.” Log entries for files with errors always take more than one line and end with a blank line.

Syntax errors begin with a line number. On the same line as the line number is the explanation of the error. The next line shows the error in context, and the line following contains a caret (^) pointing to the location of the error. Here’s an example:

/Users/rcrews/Desktop/toc.htm... 
:49: parser error : expected '>'
</p class="BookTitle">
    ^

This shows an error in a file at /Users/rcrews/Desktop/toc.htm. The error is at line 49 of that file. The error is that Validator was expecting a >, but instead found something else, in this case characters after a space. The problem is that end tags cannot contain attributes. Remove the attribute and the error goes away. If you immediately saw the > at the end of the line and were confused, remember that whole line shows the error in context. In this case, this is the entire content of line 49. Direct your attention at the location identified by the caret, the space after the p, because that’s where the error is, not at the end of the line.

Validation errors do not contain a line number, making them harder to locate. To locate a validation error, you will need to know the structure of your document. Here’s a contrived example:

<!DOCTYPE sample [
<!ELEMENT sample (item)+>
<!ELEMENT item (name)>
<!ELEMENT name (#PCDATA)>
]>
<sample>
  <item>
    <name>Mojave Desert</name>
  </item>
  <name/>
</sample>

In this example, the root element is sample. The element sample can contain one or more item elements, and item elements can contain one and only one name element. The name element must contain parsed character data. The file is well-formed, meaning there are no syntax errors, but there is a structural problem: There is a name element that is not inside an item element. Here’s the Validator error message:

/Users/rcrews/Desktop/sample.xml... 
Element sample content does not follow the DTD, expecting (item)+, got
(item name) 

The message says Validator was expecting the content defined for the element named sample, specifically (item)+, which means one or more item elements. What it got was an item element followed by a name element. The message is concise and specific.

Here’s one not so contrived:

/Users/rcrews/Desktop/sample3.html... 
No declaration for attribute target of element a

In this message, Validator is telling you it found an attribute target on an element a and this is an error. You might be thinking, “I’ve worked with HTML for years, and I know that a target attribute is allowed on an a element.” The problem here is not your memory, but the doctype. This document—though the error message doesn’t mention this—is identified as XHTML 1.0. If you need the target attribute, simply change your doctype to XHTML 1.0 Transitional and all will be well.

Here’s another typical validation error:

/Users/rcrews/Desktop/sample2.xml... 
Element Chapter content does not follow the DTD, expecting (Title , (Caution |
Note | Tip | Warning | BridgeHead | Example | Figure | Table | ItemizedList |
OrderedList | SegmentedList | SimpleList | VariableList | InformalEquation |
InformalExample | InformalTable | Graphic | FormalPara | Para | Comment |
MsgSet | HelpEntry)+ , Sect1* , RefEntry*), got (Title Para Para ItemizedList
Para Sect1 Sect1 Sect1 Sect1 Sect1 Para Sect1 Sect1)

Here, you can see there is a problem with the Chapter element. It does not follow the DTD. The content model is blah, blah, blah, blah and the document contains blah, blah, blah. It’s not as hopeless as it looks. Focus on the “got” clause. This particular Chapter element—which we know has an error—starts with a Title, followed by a Para, a Para, an ItemizedList, another Para, then several Sect1s…. Wait a minute. What’s that Para doing among those Sect1s? This is indeed the problem. You could verify this by carefully reading the content model (or by moving the Para into one of the Sect1s and validating the file again). Remember that a little common sense will direct you to the error without a lot of work.

You’ll get the hang of it with practice.

Enhancements for XHTML

Because of the popularity of XHTML, Validator checks several items of XHTML correctness as a separate step after validation. In particular, Validator

Here is a Validator message reporting on a valid XHTML 1.0 Transitional document, a file that meets the technical requirements of XML validity, with one error, a violation of the XHTML 1.0 Appendix B prohibitions, and several warnings about the use of deprecated elements:

/Users/rcrews/Sites/B13866_04/extras.904/b12239/deployment.htm... 
Element pre must not contain sup elements.
Warning: The following mark-up is deprecated in XHTML 1.0:
    The align attribute of the div element (16).
    The font element (83).
    The border attribute of the img element (16).
    The type attribute of the li element (49).
    The start attribute of the ol element (4).
    The type attribute of the ol element (4).
    The align attribute of the table element (2).
Valid, with warnings and errors.

Be aware that the iframe element and the target attribute of the a, area, base, form, and link elements are excluded from XHTML 1.0 without being deprecated. Use of this mark-up in XHTML 1.0 Transitional or XHTML 1.0 Frameset will not cause warnings or errors, but use of these in XHTML 1.0 (nontransitional) will simply cause the file to fail to validate.

Note that warnings about deprecated mark-up are not errors and do not trigger the Validator failure result code for scripting purposes; however, violations of the Appendix B prohibitions are errors and do cause Validator to return the error result code. (Validator returns 1 if it detects errors and 0 if it does not.)

Enhancements for DocBook 5.0

The normative schema language for DocBook 5.0 is Relax NG. Validator, therefore, evaluates DocBook 5.0 documents against the DocBook 5.0 Relax NG schema any time it locates DocBook 5.0 documents. Note, however, that the Validator interface requires XML files to identify their document types within each file. The most efficient way to do this is to use doctypes as defined in the XML spec. Rather than the following beginning to a DocBook 5.0 document…

<?xml version="1.0" encoding="UTF-8"?>
<article xmlns="http://docbook.org/ns/docbook" version="5.0" xml:lang="en">

the following remains correct, and speeds document identification:

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V5.0//EN"
  "http://www.oasis-open.org/docbook/xml/5.0/docbook.dtd">
<article xmlns="http://docbook.org/ns/docbook" version="5.0" xml:lang="en">

Without a document type declaration, Validator identifies DocBook 5.0 documents when both the following are true:

DocBook 5.0 no longer includes the historical ISO character entities. To continue using them, you must reference them using DTD syntax:

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V5.0//EN"
  "http://www.oasis-open.org/docbook/xml/5.0/dtd/docbook.dtd" [
<!ENTITY % w3centities PUBLIC
  "-//W3C//ENTITIES Combined Set//EN//XML"
  "http://www.w3.org/2003/entities/2007/w3centities-f.ent">
%w3centities;
]>
<book xmlns="http://docbook.org/ns/docbook" version="5.0" xml:lang="en">

If you omit the formal public identifier (“-//OASIS//DTD DocBook V5.0//EN”), Validator evaluates the document against the DTD only, not against the Relax NG schema. Here’s an example showing a different doctype. This one uses the system-only format and an alternate acceptable URL:

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE article SYSTEM "http://docbook.org/xml/5.0/dtd/docbook.dtd" [
<!ENTITY % w3centities PUBLIC
  "-//W3C//ENTITIES Combined Set//EN//XML"
  "http://www.w3.org/2003/entities/2007/w3centities-f.ent">
%w3centities;
]>
<article xmlns="http://docbook.org/ns/docbook" version="5.0" xml:lang="en">

Validator provides a token to alert you when it is using the Relax NG parser:

/Users/rcrews/docbook-5.0/docs/howto.xml... Relax NG... Valid.
/Users/rcrews/docbook-5.0/docs/docbook-5.0-spec-cd-01.xml... Relax NG... Valid.

Evaluating documents against a Relax NG schema generates messages in a format different from DTD evaluations. If the Relax NG evaluation produces messages, Validator additionally evaluates the document against the DTD and, if any additional messages are therefore generated, shows those as well. Introducing a bogus element (robert) like this…

<para>This <robert>document</robert> is targeted at DocBook users who
are considering switching from DocBook V4.x to DocBook V5.0. It
describes differences between DocBook V4.x and V5.0 and provides some
suggestions about how to edit and process DocBook V5.0 documents. There
is also a section devoted to conversion of legacy documents from DocBook
4.x to DocBook V5.0.</para>

produces these messages:

/Users/rcrews/Desktop/docbook-5.0/docs/howto.xml... 
Relax NG evaluation:
Did not expect element para there
Expecting element example, got para
Expecting element bridgehead, got para
Element para has extra content: text
Expecting element annotation, got para
Element article failed to validate content

DTD evaluation:
No declaration for attribute xmlns:xl of element article
Element robert is not declared in para list of possible children
No declaration for element robert
…

Introducing a bogus attribute (robert="robert") like this…

<para robert="robert">At the time this was written the current version
of DocBook V5.0 was &version;. However, almost all of the information in
this document is general and applies to any newer version of DocBook
V5.0.</para>

produces these messages:

/Users/rcrews/Desktop/docbook-5.0/docs/howto.xml... 
Relax NG evaluation:
Did not expect element para there
Element article has extra content: para

DTD evaluation:
No declaration for attribute xmlns:xl of element article
No declaration for attribute robert of element para
…

Validator for XML Experts

Because of Validator's cache of DTDs, XML processing of various sorts can be enabled or sped up significantly with assistance from Validator. Validator can assure your XML is correct before you commit it to your source control system, and can help XML processing in various programming languages, such as XSLT, Perl, and Java.

Validator for CVS

The Concurrent Versions System is a version control system for recording changes to computer files. CVS is typically used to manage text files, such as those containing source code and XML mark-up. (It can manage binary files, too, though less well.) CVS is easy to set up and administer and is widely used by distributed teams developing software and Web sites.

Primarily though a configuration file called commitinfo, CVS can prevent files from being stored in the shared repository if they don’t meet pre-established workgroup guidelines. Since it makes little sense to share incorrect XML among authors and developers, this section describes how to configure CVS to require all new and updated XML to pass Validator’s checks before CVS stores these files in the repository.

To configure CVS to use Validator:

  1. Check out the CVSROOT module, then cd to your working copy of the module:

    cvs -d /path/to/cvs/repository checkout CVSROOT
    cd CVSROOT
    
  2. Create the following file as Validator_cvs.pl:

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my $return_value = 0;
    my @filtered_args = ();
    
    $ENV{'PATH'} = '/bin:/usr/bin';
    delete @ENV{qw(IFS CDPATH ENV BASH_ENV)};
    
    for (@ARGV) {
      if (m{\.(?:dita|ditamap|fo|hs|html?|jhm|jnlp|plist|rng|rss|wml|xht|xhtml|
        xml|xsd|xslt?)\z}xms && m{\A([-+\@\w.\x20]+)\z}xms) {
        push(@filtered_args, $1);
      }
    }
    
    if (@filtered_args) {
      $return_value = system('/usr/bin/perl',
        '/Applications/Validator.app/Contents/Resources/script',
        @filtered_args)/256;
    }
    
    exit($return_value);
    __END__
    

    Make sure the file paths are correct for your installation. The paths shown are correct for a typical Mac OS X installation. Validator_cvs.pl does three things:

    1. Cleans the Perl $ENV{'PATH'} variable to make execution of system utilities safer.
    2. Filters input by suffix, passing to Validator only file names with known XML extensions.
    3. Filters input for safety, disallowing insecure values, such as ';rm *'.
  3. Add the following line to commitinfo:

    ALL /usr/bin/perl -T $CVSROOT/CVSROOT/Validator_cvs.pl %r/%p %{s}
    

    The -T flag enables Perl’s taint checking mode, which prevents Perl from executing programs not explicitly coded to filter for possible insecure input. This Perl feature—not available in most other languages—helps keep distributed systems that run arbitrary programs for verification, logging, and so on—such as CVS—safer than they would be if those programs were coded in languages other than Perl or in Perl without taint checking enabled.

  4. Add the following line to checkoutlist:

    Validator_cvs.pl
    
  5. Register Validator_cvs.pl with your repository:

    cvs add Validator_cvs.pl
    
  6. Commit your changes:

    cvs commit -m "Adding Validator support to CVS."
    

You can easily adapt this procedure to create a “pre-commit hook” for Subversion, another popular version control system.

Validator for Command-Line Saxon

If you do validation from other apps, you can usually configure your software to use the DTDs from Validator to avoid unnecessary network traffic. For example, here’s how you can set up Saxon to explore XSLT.

  1. Copy saxon.jar from Michael Kay’s Saxon 6.5.5 package to your extensions directory, /Library/Java/Extensions on Mac OS X and %JAVA_HOME\Lib\ext on Windows, etc.
  2. Copy resolver.jar from Norm Walsh’s Resolver 1.2 package to your extensions directory.
  3. Create a file at that contains the following text:
    allow-oasis-xml-catalog-pi=yes
    catalog-class-name=org.apache.xml.resolver.Resolver
    catalogs=catalog.xml;file:///Applications/Validator.app/Contents/Resources/dtds/catalog.xml
    prefer=public
    relative-catalogs=yes
    static-catalog=yes
    verbosity=1
    

    put the file in a JAR, for example:

    rcrews$ cd /Library/Java/Extensions
    rcrews$ jar cf CatalogManager.jar CatalogManager.properties
    rcrews$ rm CatalogManager.properties
    

    and copy the JAR to your extensions directory.

  4. Configure a method to start Saxon from the command line.

    Bash users: add the following to your ~/.bashrc:

    saxon ( ) {
      java com.icl.saxon.StyleSheet -u \
        -x org.apache.xml.resolver.tools.ResolvingXMLReader \
        -y org.apache.xml.resolver.tools.ResolvingXMLReader \
        -r org.apache.xml.resolver.tools.CatalogResolver $@
    }
    

    then source your ~/.bashrc (or restart):

    rcrews$ source ~/.bashrc
    

Saxon validates source documents before transforming them if they contain a doctype. After setting this up, you’ll find your transformations finish much faster than before. In fact, you’ll find your transformations actually finish if before you weren’t connected to the Internet, since Saxon then couldn’t get the DTD files from the identified Internet site.

The Resolver instructions describe getting this going for Xalan, XP, and XT. For example, here is a bat file to help Windows users run Apache Xalan from the command line:

@ECHO OFF
SET CLASSPATH=C:\cp\xalan-j_2_7_0\xalan.jar
SET CLASSPATH=%CLASSPATH%;C:\cp\xalan-j_2_7_0\serializer.jar
SET CLASSPATH=%CLASSPATH%;C:\cp\xml-commons-resolver-1.2\resolver.jar
SET CLASSPATH=%CLASSPATH%;C:\cp\classes
java org.apache.xalan.xslt.Process
  -EntityResolver org.apache.xml.resolver.tools.CatalogResolver
  -URIResolver org.apache.xml.resolver.tools.CatalogResolver %*

The last line—beginning with java and ending with %*—absolutely must be all on one long line. Also, make sure the paths to xalan.jar, serializer.jar, and resolver.jar are correct for your system.

Here is a Cygwin bash script to do the same:

#!/bin/bash
CLASSPATH="C:/cp/xalan-j_2_7_0/xalan.jar"
CLASSPATH="$CLASSPATH;C:/cp/xalan-j_2_7_0/serializer.jar"
CLASSPATH="$CLASSPATH;C:/cp/xml-commons-resolver-1.2/resolver.jar"
CLASSPATH="$CLASSPATH;C:/cp/classes"
CLASSPATH="$CLASSPATH;C:/docbook-xsl-1.69.1/extensions/xalan25.jar"
java \
  org.apache.xalan.xslt.Process \
  -EntityResolver org.apache.xml.resolver.tools.CatalogResolver \
  -URIResolver org.apache.xml.resolver.tools.CatalogResolver \
  $@

Validator for xsltproc

Setting your XML_CATALOG_FILES environment variable will make Validator's DTD cache available to other programs, such as xmllint and xsltproc (standard on Mac OS X and most Linux distributions and readily availble on for Windows from http://www.zlatkovic.com/libxml.en.html):

rcrews$ export XML_CATALOG_FILES="file:///Applications/Validator.app/Contents/Resources/dtds/catalog.xml"

Separate entries by a space:

rcrews$ export XML_CATALOG_FILES="catalog.xml $XML_CATALOG_FILES"

Validator for Perl

Validator speeds Perl XML processing when Perl programs reference Validator's cached DTDs for local, rather than network, DTD resolution. Using the excellent XML::LibXML module, speed is improved by simply identifying Validator's catalog.xml file in the parser's load_catalog() method, as shown in the highlighted region below.

#!/usr/local/perl/bin/perl
use XML::LibXML;
use strict;
use warnings;

my $parser = new XML::LibXML();
$parser->load_catalog(
  'file:///Applications/Validator.app/Contents/Resources/dtds/catalog.xml');
my $dom = '';

eval { $dom = $parser->parse_file($ARGV[0]); };
if ($@) {
  print $@;
  exit(1);
}

print $dom->documentElement()->nodeName() . "\n";
exit(0);
__END__

Validator for Java

To allow Validator to speed Java XML processing, you need a couple of items:

  • Apache's XML Commons Resolver software on the classpath.
  • A CatalogManager.properties file on the classpath that identifies the location of Validator's catalog.xml file.
  • Use of the setEntityResolver(EntityResolver) method of the javax.xml.parsers.DocumentBuilder class or org.xml.sax.XMLReader interface to process your CatalogManager.properties file.

The CatalogManager.properties file is as described elsewhere in this document:

allow-oasis-xml-catalog-pi=yes
catalog-class-name=org.apache.xml.resolver.Resolver
catalogs=file:///Applications/Validator.app/Contents/Resources/dtds/catalog.xml
prefer=public
relative-catalogs=yes
static-catalog=yes
verbosity=1

Pass an instance of Apache's XML Commons Resolver class to the setEntityResolver(EntityResolver) method, as shown in the highlighted region below. The Resolver locates and processes the first CatalogManager.properties file it finds on the classpath. (Counterintuitively, to reference the catalog.xml file using a relative URL, set relative-catalogs in the CatalogManager.properties file to no.)

public class XmlParse implements org.xml.sax.ErrorHandler {

  public void error(org.xml.sax.SAXParseException e) {
    System.err.println("XML not valid at line " + e.getLineNumber() +
      ", column " + e.getColumnNumber() + ": " + e.getLocalizedMessage());
  }

  public void fatalError(org.xml.sax.SAXParseException e) {
    System.err.println("XML not well-formed at line " + e.getLineNumber() +
      ", column " + e.getColumnNumber() + ": " + e.getLocalizedMessage());
  }

  public void warning(org.xml.sax.SAXParseException e) {
    System.err.println("XML warning at line " + e.getLineNumber() +
      ", column " + e.getColumnNumber() + ": " + e.getLocalizedMessage());
  }

  XmlParse (java.io.File file) {
    try {

      javax.xml.parsers.DocumentBuilderFactory domFactory =
        javax.xml.parsers.DocumentBuilderFactory.newInstance();
      javax.xml.parsers.DocumentBuilder builder =
        domFactory.newDocumentBuilder();
      builder.setErrorHandler(this);
      builder.setEntityResolver(new
        org.apache.xml.resolver.tools.CatalogResolver());
      org.w3c.dom.Document doc = builder.parse(file);
      System.out.println(doc.getDocumentElement().getNodeName());

    }
    catch (org.xml.sax.SAXException e) {
      e.printStackTrace();
    }
    catch (javax.xml.parsers.ParserConfigurationException e) {
      e.printStackTrace();
    }
    catch (java.io.IOException e) {
      e.printStackTrace();
    }
  }

  public static void main(String[] args) {
    java.io.File file = new java.io.File(args[0]);
    XmlParse xfile = new XmlParse(file);
  }

}

Command Line Interface

VALIDATOR(1)          User Contributed Perl Documentation         VALIDATOR(1)



NAME
       Validator.pl - Validate XML files

SYNOPSIS
        Validator.pl ( --help | --version | --list_extensions ) |
            ([--add_extension <ext>] ... [--remove_extension <ext>] ... ) |
            (<XML-file-or-dir> ... )

DESCRIPTION
       Reads and parses an XML file. If the file has a system ID (either with
       the SYSTEM declaration or as part of a PUBLIC declaration) in its
       document type declaration, the file will be validated.

       -a --add_extension
           Add an extension to the list of extensions Validator uses to
           identify XML files as it recursively processes folders. Repeat the
           flag with each extension you want to add.

       -h --help
           Print the man page.

       -l --list_extensions
           Lists the extensions Validator uses to identify XML files as it
           recursively processes folders.

       -r --remove_extension
           Remove an extension from the list of extensions Validator uses to
           identify XML files as it recursively processes folders. Repeat the
           flag with each extension you want to remove.

       -v --version
           Print the version information for this program.

RETURN VALUE
       Returns 0 if no errors are found in any file; otherwise, returns 1.

ENVIRONMENT
       Processes the http_proxy environment variable for when HTTP access is
       needed. Reads the USERPROFILE environment variable when running on
       Microsoft Windows to help locate the appropriate "Desktop" folder.

AUTHOR
       Robert Crews <rcrews@mac.com>

COPYRIGHT
       Copyright 2005-9 by Robert Crews.



perl v5.10                        2008-12-26                      VALIDATOR(1)

Native Support for Public Identifiers

Aside from ease of use, one of Validator’s primary features is its native support for more than 500 common XML document types. This means you can validate most XML documents extremely quickly whether you’re currently connected to the Internet or not.

To put this into perspective, in order to validate an XHTML document, most validators would download the following files before starting a validation and would then download them each again for each subsequent XHTML validation:

Nevertheless, Validator doesn’t know about every XML document type available. To validate documents with unknown public identifiers, or any document without a public identifier, associated with a URL available on the Web, you’ll need to be connected to the Internet. If you’re behind a firewall, identify your Web proxy with the http_proxy environment variable.

Validator checks files with the following public identifiers from local files, meaning it does not download the DTDs associated with these public identifiers from the Web:

For convenience, Validator also maps the following system identifiers to local files:

Release History

The following entries itemize the major enhancements for each Validator release:

1.4.1 (2009-01-16)
  • Resolved problem that prevented launching on some versions of Mac OS X.
  • Added XHTML Modularization 1.1, XHTML Basic 1.1, and XHTML+RDFa 1.0 FPIs.
  • Reports on checked files to assure a message appears for each run.
  • Continues processing when encountering a file or directory multiple times due to symbolic linking.
1.4 (2008-12-26)
  • Packaged into an installer for Windows.
  • Enhanced support for XML Schema validation.
  • Added DocBook 5.0 FPIs.
  • Added Relax NG support for DocBook 5.0 documents.
  • Added W3C (ISO) Combined Entities FPI.
  • Added JavaHelp 1.1.3 and 2.0 FPIs.
  • Added *.dita, *.ditamap, *.hs, *jhm, *.jnlp, *.xht, and *.xhtml to list of XML extensions.
  • Added older Sun 1.0 and 1.1 JNLP FPIs.
  • Reads gzip’d and compress’d single-file documents.
  • Follows symbolic links on Unix-like systems.
  • Skips non-XML Apple property list files.
  • Fixed chdir errors that prevented command-line validation on Linux.
  • Mac version no longer automatically includes app path as the first item on the command line.
    Validator bash function and CVS commitinfo script correspondingly updated in the documentation.
  • Added new file name extensions to the Validator CVS commitinfo script.
  • Fixed "undefined value" occurring for some XHTML a elements.
  • Normalized catalog processing.
  • Updated the documentation.
1.3 (2007-12-20)
  • Fixed error related to XHTML &MultiLength; entities.
  • Added DITA 1.1 FPIs.
  • Added TEI P5 FPIs.
  • Added W3C XML Spec 2.10 FPI.
  • Added Sun JDO and JNLP FPIs.
  • Updated the documentation.
1.2 (2006-10-30)
  • Added command-line interface with documentation.
  • Added checks for deprecated mark-up in XHTML 1.0 Transitional and XHTML 1.0 Frameset documents.
  • Fixed bug relating to changing directories on Linux and Solaris.
  • Added missing “datatypes” section of XML Schema DTD.
  • Added XHTML-Print 1.0 FPIs.
  • Added DocBook 4.5 FPIs.
  • Added Oasis XML Catalogs 1.1 FPI.
  • Updated DITA 1.0 with DITA Document Definitions 1.0.1.
  • Minor updates to the Validator CVS commitinfo script.
  • Updated the documentation.
1.1 (2006-01-24)
1.01 (2005-12-16)
  • Added checks supporting the XHTML restriction that anchor names share the same name space as element IDs.
  • Added SMIL 2.1 and various Sun and Apache FPIs.
  • Fixed typo in error message for the XHTML prohibition against pre elements containing img, object, big, small, sub, or sup elements.
  • Updated the documentation.
1.0 (2005-09-30)
  • Initial release.

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Validator caches unmodified, publicly available DTDs that are covered by their own licenses.