Parse a file and represent it as tree
The basic piece of code to parse "filename.xml" is:
let config = Pxp_types.default_config
let spec = Pxp_tree_parser.default_spec
let source = Pxp_types.from_file "filename.xml"
let doc = Pxp_tree_parser.parse_document_entity config source spec
As you can see, a some defaults are loaded (Pxp_types.default_config,
and Pxp_tree_parser.default_spec). These defaults have these effects
(as far as being important for an introduction):
doc#root, is the top-most element.from_file, it is immediately converted into a SYSTEM
ID which is essentially a URL of the form
file:///dir1/.../dirN/filename.xml. This ID can be processed -
especially it is now clear how to treat releative SYSTEM ID's that
occur in the parsed document. For instance, if another file is
included by "filename.xml", and the SYSTEM ID is "parts/part1.xml",
the usual rules for resolving relative URL's say that the effective
file to read is file:///dir1/.../dirN/parts/part1.xml. Relative
SYSTEM ID's are resolved relative to the URL of the file where the
entity reference occurs that leads to the inclusion of the other file
(this is comparable to how hyperlinks in HTML are treated).
Note that we make here some assumptions about the file system of the
computer. Pxp_reader.make_file_url has to deal with character
encodings of file names. It assumes UTF-8 by default. By passing
arguments to this function, other assumptions about the encoding of
file names can be made. Unfortunately, there is no portable way of
determining the character encoding the system uses for file names
(see the hyperlinks at the end of this section).
The returned doc object is of type Pxp_document.document. This type
is used for all regular documents that exist independently. The root
of the node tree is returned by doc#root which is a
. See Intro_trees for more about the tree
representation.
The call Pxp_tree_parser.parse_document_entity does not only parse,
but it also validates the document. This works only if there is a DTD,
and the document conforms to the DTD. There is a weaker criterion for
formal correctness called well-formedness. See below how to only the
check for well-formedness while parsing without doing the whole
validation.
Links about the file name encoding problem:
open expect file names in UTF-8 encoding.
It is strongly recommended to compile and link with the help of
ocamlfind. For (byte) compiling use one of
ocamlfind ocamlc -package pxp-engine -c file.mlocamlfind ocamlc -package pxp -c file.mlpxp-engine refers to the core library while pxp refers
to an extended version including the various lexers. For compiling, there
is no big difference between the two because the lexers are usually not
directly invoked. However, at link time you need these lexers. You can
choose between using the pre-defined package pxp and a manually selected
combination of pxp-engine with some lexer packages. So for linking
e.g. use one of:
ocamlfind ocamlc -package pxp -linkpkg -o executable ... 
  to get the standard selection of lexersocamlfind ocamlc -package pxp-engine,pxp-lex-iso88591,pxp-ulex-utf8 -linkpkg -o executable ... 
  to get lexers for ISO-8859-1 and UTF-8pxp
includes a standard set of lexers, including UTF-8 and many encodings of
the ISO-8859 series. For more about encodings, see below
Encodings.
Variations
Catching and printing exceptions
The relevant exceptions are defined in Pxp_types. You can catch
these exceptions (as thrown by the parser) as in:
try ... 
with
  | Pxp_types.Validation_error _
  | Pxp_types.WF_error _
  | Pxp_types.Namespace_error _
  | Pxp_types.Error _
  | Pxp_types.At(_,_) as error ->
      print_endline ("PXP error " ^ Pxp_types.string_of_exn error)
There are more exceptions, but these are usually caught within PXP and converted to one of the mentioned exceptions.
Printing trees in the O'Caml toploop
There are toploop printers for nodes and documents. They are automatically
activated when the findlib directive #require "pxp" is used to load
PXP into the toploop. Alternatively, one can also do
#install_printer Pxp_document.print_node;;
#install_printer Pxp_document.print_doc;;
For example, the tree <x><y>foo</y></x> would be shown as:
  # tree;;
  _ : ('a Pxp_document.node Pxp_document.extension as 'a) Pxp_document.node =
  * T_element "x"
    * T_element "y"
      * T_data "foo"
Parsing in well-formedness mode
In well-formedness mode many checks are not performed regarding the
formal integrity of the document. Note that the terms "valid" and
"well-formed" are rigidly defined in the XML standard, and that PXP
strictly tries to conform to the standard. Especially note that the
DOCTYPE clause is not rejected in well-formedness mode and that the
declarations are parsed although interpreted differently.
In order to call the parser in well-formedness mode, call one of the "wf" functions, e.g.
let doc = Pxp_tree_parser.parse_wfdocument_entity config source spec
Details. Even in well-formedness mode there is a DTD object. The DTD object is, however, differently treated:
arbitrary_allowed 
  (see Pxp_dtd.dtd.allow_arbitrary). If enabled as done in
  well-formedness mode, the DTD reacts specially when a declaration
  is missing so that the parser knows it has to accept that. 
  Note that, if one added a declaration programmatically
  to the DTD object, the DTD would find it, and would actually
  validate against it. Effectively, validation is not disabled in
  well-formedness mode, only the constraints imposed by the DTD
  object on the document are weaker. There is in fact a way
  to add declarations in well-formedness mode to get partly the
  effects of validation: This is called The mixed mode.DOCTYPE clause (if that clause exists).
Validating well-formed trees
It is possible to validate a tree later that was originally only parsed in well-formedness mode.
Of course, there is one obvious difficulty. As mentioned in the
previous section, the DTD object is incompletely built (declarations
of elements, attributes, and notations are ignored), so the DTD object
is not suitable for validating the document against it. For
validation, however, a complete DTD object is required.  The solution
is to replace the DTD object by a different one. As the DTD object is
referenced from all nodes of the tree, and thus intricately connected
with it, the only way to do so is to copy the entire tree. The
function Pxp_marshal.relocate_subtree can be used for this type of
copy operation.
We assume here that we can get the replacement DTD from an external
file, "file.dtd", and that another constraint is that the root
element must be start (as if we had <!DOCTYPE start SYSTEM "file.dtd">).
Also doc is the parsed "filename.xml" file as retrieved by
let config = Pxp_types.default_config
let spec = Pxp_tree_parser.default_spec
let source = Pxp_types.from_file "filename.xml"
let doc = Pxp_tree_parser.parse_wfdocument_entity config source spec
Now the validation against a different DTD is done by:
let rdtd_source = Pxp_types.from_file "file.dtd"
let rdtd = Pxp_dtd_parser.parse_dtd_entity config rdtd_source
let () = rdtd # set_root "start"
let vroot = Pxp_marshal.relocate_subtree doc#root rdtd spec
let () = Pxp_document.validate vroot
let vdoc = new Pxp_document.document config.warner config.encoding
let () = vdoc#init_root vroot doc#raw_root_name
The vdoc document has now the same contents as doc but points to
a different DTD, namely rdtd. Also, the validation checks have been
performed. A few more comments:
config for parsing the original document doc
  and the replacement DTD rdtd. This is not strictly required. However,
  the encoding of the in-memory representation must be identical
  (i.e. config.encoding).rdtd#set_root, any root element is allowed.doc before doing the validation,
  or to validate a doc that is not the result of a parser call but
  programmatically created.In PXP, the encoding of the parsed text (the external encoding), and the encoding of the in-memory representation can be distinct. For processing external encodings PXP relies on Ocamlnet. The external encoding is usually indicated in the XML declaration at the beginning of the text, e.g.
<?xml version="1.0" encoding="ISO-8859-2"?>
...
There is also an autorecognition of the external encoding that works for UTF-8 and UTF-16.
It is generally possible to override the external encoding
(e.g. because the file has already been converted but the XML
declaration was not changed at the same time). Some of the from_*
sources allow it to override the encoding directly, e.g. by setting
the fixenc argument when calling Pxp_types.from_channel. Note
that Pxp_types.from_file does not have this option as this source
allows it to read any file. Overriding encodings is, however, only
interesting for certain files. A workaround is to combine from_file
with a catalog of ID's, and to override the encodings for certain
files there. (Catalogs also allow to override external encodings.
See below, Specifying sources for examples using catalogs.)
As mentioned, the encoding of the in-memory representation can be distinct from the external encoding. It is required that every character in the document can be represented in the representation encoding. Because of this, the chosen encoding should be a superset of all external encodings that may occur. If you choose UTF-8 for the representation every character can be represented anyway.
You set the representation encoding in the config record, e.g.
let config =
  { Pxp_types.default_config
      with encoding = `Enc_utf8
  }
It is strictly required that only a single encoding is used in a document (and PXP also checks that).
The available encodings for the in-memory representation are a subset of the encodings supported by Ocamlnet. Effectively, UTF-8 is supported and a number of 8-bit encodings as far as they are ASCII- compatible (i.e. extensions of 7 bit ASCII).
For every representation encoding PXP needs a different lexer. PXP already comes with a set of lexers for the supported encodings. However, at link time the user program must ensure that the lexer is linked into the executable. The lexers are available as separate findlib packages:
pxp-ulex-utf8: This is the standard lexer for UTF-8pxp-wlex-utf8: This is the old, wlex-based lexer for UTF-8. It is not
  built when ulex is available.pxp-lex-utf8: This is the old, ocamllex-based lexer for UTF-8.
  It is slightly faster than pxp-ulex-utf8, but consumes a lot more
  memory.pxp-lex-*: These are lexers for various 8 bit character sets
Event parser (push/pull parsing)
It is sometimes not desirable to represent the parsed XML data as tree. An important reason is that the amount of data would exceed the available memory resources. Another reason may be to combine XML parsing with a custom grammar. In order to support this, PXP can be called as event parser. Basically, PXP emits events (tokens) while parsing certain syntax elements, and the caller of PXP processes these events. This mode can only be used together with well-formedness mode - for validation the tree representation is a prerequisite.
Here we show how to parse "filename.xml" with a pull parser:
let config = Pxp_types.default_config
let source = Pxp_types.from_file "filename.xml"
let entmng = Pxp_ev_parser.create_entity_manager config source
let entry = `Entry_document []
let next = Pxp_ev_parser.create_pull_parser config entry entmng
Now, one can call next() repeatedly to get one event after the other.
The events have type Pxp_types.event option.
More about event parsing can be found in Intro_events.
Low-profile trees
When the tree classes in Pxp_document are too much overhead,
it is easily possible to define a specially crafted tree data type, and
to transform the event-parsed document into such trees. For example,
consider this cute definition:
type tree =
  | Element of string * (string * string) list * tree list
  | Data of string
A tree node is either an Element(name,atts,children) or a 
Data(text) node. Now we event-parse the XML file:
let config = Pxp_types.default_config
let source = Pxp_types.from_file "filename.xml"
let entmng = Pxp_ev_parser.create_entity_manager config source
let entry = `Entry_document []
let next = Pxp_ev_parser.create_pull_parser config entry entmng
Finally, here is a function build_tree that calls the next function to
build our low-profile tree:
let rec build_tree() =
  match next() with
    | Some (E_start_tag(name,atts,_,_)) ->
        let children = build_children [] in
        let tree = Element(name,atts,children) in
        skip_rest();
        tree
    | Some (E_error e) ->
        raise e
    | Some _ ->
        build_tree()
    | None ->
        assert false     
and build_node() =
  match next() with
    | Some (E_char_data data) ->
        Some(Data data)
    | Some (E_start_tag(name,atts,_,_)) ->
        let children = build_children [] in
        Some(Element(name,atts,children))
    | Some (E_end_tag(_,_)) ->
        None
    | Some (E_error e) ->
        raise e
    | Some _ ->
        build_node()
    | None ->
        assert false
and build_children l =
  match build_node() with
    | Some n -> build_children (n :: l)
    | None -> List.rev l
    
and skip_rest() =
  match next() with
    | Some E_end_of_stream ->
        ()
    | Some (E_error e) ->
        raise e
    | Some _ ->
        skip_rest()
    | None ->
        assert false
Of course, this all is only reasonable for the well-forermedness mode,
as PXP's validation routines depend on the built-in tree representation
of Pxp_document.
Choosing the node types to represent
By default, PXP only represents element and data nodes (both in the normal tree representation and in the event stream). It is possible to enable more node types:
T_comment is used for them.
  In the event stream, the event type E_comment is used.T_pinstr node type
  is used, and in the event stream, the event type E_pinstr is
  used.T_super_root
  node type is used, and in the event stream, the event type E_start_super
  marks the beginning of this node, and E_end_super marks the end of 
  this node.config record, e.g.
let config =
  { Pxp_types.default_config
      with enable_comment_nodes = true;
           enable_pinstr_nodes = true;
           enable_super_root_node = true 
  }
Note that the "super root node" is sometimes called "root node" in various XML standards giving semantical model of XML. For PXP the name "super root node" is preferred because this node type is not obligatory, and the top-most element node can also be considered as root of the tree.
Controlling whitespace
Depending on the mode, PXP applies some automatic whitespace rules. The user can call functions to reduce whitespace even more.
In validating mode, there are whitespace rules for data nodes and
for attributes (the latter below). In this mode it is possible that an
element x is declared such that a regular expression describes the
permitted children.  For instance,
 <!ELEMENT x (y,z)> 
is such a declaration, meaning that x may only have y and z
as children, exactly in this order, as in
 <x><y>why</<y><z>zet</z></x> 
XML, however, allows that whitespace is added to make such terms more readable, as in
 
<x>
  <y>why</<y>
  <z>zet</z>
</x> 
The additional whitespace should not, however, appear as children of
node x, because it is considered as a purely notational improvement
without impact on semantics. By default, PXP does not create data nodes
for such notational whitespace. It is possible to disable the
suppression of this type of whitespace by setting
drop_ignorable_whitespace to false:
  let config =
    { Pxp_types.default_config 
        with drop_ignorable_whitespace = false
    }
In well-formedness mode, there is no such feature because element declarations are ignored.
Note that although in event mode the parser is restricted to
well-formedness parsing, it is still possible to get the effect of
drop_ignorable_whitespace. See
Pxp_event.drop_ignorable_whitespace_filter for how to selectively
enable this validation feature.
The other whitespace rules apply to attributes. In all modes line
breaks in attribute values are converted to spaces. That means a1
and a2 have identical values:
<x a1="1 2" a2="1
2" a3="1
2"/>
It is possible to suppress this conversion by using 
 as line
separator, as in a3, which truly includes a line-feed character.
In validating mode only there are more rules because attributes
are declared. If the attribute is declared with a list value
(IDREFS, ENTITIES, or NMTOKENS), any amount of whitespace can
be used to separate the list elements. PXP returns the value as
Valuelist l where l is an O'Caml list of strings.
If the tree representation is chosen, the function
Pxp_document.strip_whitespace can be called to reduce the amount
of whitespace in data nodes.
Checking the 
ID consistency and looking up nodes by ID
In XML it is possible to identify elements by giving them an ID
attribute. The requires a DTD, and could be done with declarations
like
  <!ATTLIST x id ID #REQUIRED>
meaning that element x has a mandatory attribute id with the special
ID property: Every node must have a unique id value.
In the same context, it is possible to declare attributes as references
to other nodes, expressed by denoting the id of the other node:
  <!ATTLIST y r IDREF #IMPLIED>
Here, the (optional) attribute r of y is a reference to another node.
It is only allowed to put identifiers into such attributes that also
occur in the ID of another node.
By default, PXP does neither check the uniqueness of ID-declared 
attributes nor the existence of the nodes referenced by IDREF-declared
attributes. In tree mode, it is possible to enable that, however.
For that purpose, one has to create an Pxp_tree_parser.index. If
passed to the parser function, the parser adds the ID-values of all
nodes to the index, and checks whether every ID value is unique.
Additionally, when one enables the idref_pass the parser also checks
whether IDREF attributes only point to existing nodes. The code:
let config = { Pxp_types.default_config with idref_pass = true }
let spec = Pxp_tree_parser.default_spec
let source = Pxp_types.from_file "filename.xml"
let hash_index = new Pxp_tree_parser.hash_index
let id_index = (hash_index :> _ Pxp_tree_parser.hash_index)
let doc = Pxp_tree_parser.parse_document_entity ~id_index config source spec
The difference between hash_index and id_index is that the former
object has one additional method index returning the whole index.
The id_index may also be useful after the document has been parsed.
The code processing the parsed documennt can take advantage of it by
looking up nodes in it. For example, to find the node identified
by "foo", one can call
 id_index # find "foo" 
which either returns this node, or raises Not_found.
Note that the id_index is not automatically updated when the parsed
tree is modified.
Finding nodes by element names
As we are at it: PXP does not maintain indexes of any kind. Unlike in other tree representations, there is no index of elements that would help one to quickly find elements by their names. The reason for this omission is that such indexes need to be updated when the tree is modified, and these updates can be quite expensive operations.
The ID index explained in the last section is not automatically
updated, and it has only been added to comply fully to the XML
standard (which demands ID checking).
Nevertheless, one can easily define indexes of one own (and for the advanced programmer it might be an interesting task to develop an extension module to PXP that generically solves this problem). For instance, here is an index of elements:
  let index = Hashtbl.create 50
  Pxp_document.iter_tree
    ~pre:(fun node ->
             match node with
               | T_element name -> Hashtbl.add index name node
               | _ -> ()
         )
    doc#root
Now, Hashtbl.find can be used to get the last occurrence, and
Hashtbl.find_all to get all occurrences.
If it is not worth-while to build an index, one can also call
the functions Pxp_document.find_element and 
Pxp_document.find_all_elements, but these functions rely on
linear searching.
Specifying sources
The Pxp_types.source says from where the data to parse comes.  The
task of the source is more complex as it looks at the first glance,
as it not only says from where the initially parsed entity comes, but
also from where further entities can be loaded that are referenced and
included by the first one.
The mentioned function Pxp_types.from_file allows that all files
can be opened as entities, and maps the SYSTEM identifiers to file
names. It is very powerful.
There are three more from_* functions:
Pxp_types.from_string gets the data from a stringPxp_types.from_channel gets the data from an in_channelPxp_types.from_obj_channel gets the data from an in_obj_channel
  (an Ocamlnet definition)from_file in so far as only
one entity can be parsed at all (unless one passes alternate resolvers
to them). This means it is not possible that the initially parsed
entity includes data from another entity. Example code:
 let source = Pxp_types.from_string "<?xml version='1.0'?><foo/>" 
So the source mechanism has these limitations:
Pxp_types.from_file function allows one to read from all
  files by using SYSTEM URL's of the form file:///path. It is
  not possible to restrict the file access in any way. There is no support
  for PUBLIC identifiers.Pxp_types.from_string allow one to
  parse data coming from everywhere, and it is not possible to access
  any files (as it is not possible to open any further external entity).Pxp_reader module with a very powerful abstraction
called Pxp_reader.resolver. There are resolvers for files, for
alternate resources like data channels, and there is the possibility
of building more complex resolvers by composing simpler ones.
Please see Pxp_reader and Intro_resolution for deeper explanations.
Here are the most important recipes to use this advanced mechanism:
Read from files, and define a catalog of exceptions:
let catalog =
 new Pxp_reader.lookup_id_as_file
  [ System("http://foo.org/our.dtd"), "/usr/share/foo.org/out.dtd";
    Public("-//W3C//DTD XHTML 1.0 Strict//EN",""), "/home/stuff/xhtml_strict.dtd"
  ]
let source = Pxp_types.from_file ~alt:[catalog] "filename.xml"
This allows one to open all local files using the file:///path 
URL's, but also maps the SYSTEM ID "http://foo.org/our.dtd" and
the PUBLIC ID "-//W3C//DTD XHTML 1.0 Strict//EN" to local files.
There is also Pxp_reader.lookup_id_as_string mapping to strings.
Read from files, but restrict access, and map URL's
let resolver =
  new Pxp_reader.rewrite_system_id
    [ "http://foo.org/", "file:///usr/share/foo.org";
      "file:///", "file:///home/stuff/localxml"
    ]
    (new Pxp_reader.resolve_as_file())
let file_url = Pxp_reader.make_file_url "filename.xml"
let source = ExtID(System((Neturl.string_of_url file_url), resolver)
This allows one to open entities from the whole http://foo.org/
hierarchy, but the data is not downloaded by HTTP, but instead
assumed to reside in the local directory hierarchy 
/usr/share/foo.org. Also, the whole file:/// hierarchy is
re-rooted to /home/stuff/localxml. As the URL's are normalized
before any access is tried, this scheme provides access protection
to other parts of the file system (i.e. one cannot escape from the
new root by "..").
In order to combine with a catalog as defined above, use
let resolver =
  new Pxp_reader.combine
    [ catalog;
      new Pxp_reader.rewrite_system_id ...
    ]
Virtual entity hierarchy
Given we have the three identifiers
http://virtual.com/f1.xml http://virtual.com/f2.xml http://virtual.com/f3.xml SYSTEM ID's,
and we have O'Caml strings f1_xml, f2_xml, and f3_xml with the
contents, we want to make the virtual.com hierarchy available
while parsing from a string s.
let resolver =
  new Pxp_reader.norm_system_id
    (new Pxp_reader.lookup_id_as_string
       [ "http://virtual.com/f1.xml"; f1_xml;
         "http://virtual.com/f2.xml"; f2_xml;
         "http://virtual.com/f3.xml"; f3_xml
       ]
    )
let source = Pxp_types.from_string ~alt:[resolver] s
The trick is Pxp_reader.norm_system_id. This class makes it possible
that these three enumerated documents can refer to each other by relative
URL. Without the SYSTEM ID normalization, these documents can only be
opened when exactly the URL is referenced that is also mentioned in the
catalog.
Embedding large constant XML in source code
Sometimes one needs to embed XML files into source code. For small files this is no problem at all, just define them as string literals
let s = "<?xml?> ..."
and parse the strings on demand, using the Pxp_types.from_string
source. For larger files, the disadvantage of this approach is that
the whole document has to be parsed again for every run of the
program. There is an efficient way of avoiding that.
The Pxp_codewriter module provides a function 
Pxp_codewriter.write_document that takes an already parsed XML tree
and writes O'Caml code as output that will create the tree again when
executed. This can be used as follows:
generate that parses the XML file with
  the required configuration options and that outputs the O'Caml code
  for this file using Pxp_codewritergeneratePxp_marshal for marshalling XML trees. The codewriter
module uses it.
Using the preprocessor to create XML trees
One way of creating XML trees programmatically is to call the create_*
functions in Pxp_document, e.g. Pxp_document.create_element_node.
However, this looks ugly, e.g. for creating <x><y>foo</y></x> one ends
up with
let tree =
  Pxp_document.create_element_node spec dtd "x" []
let y =
  Pxp_document.create_element_node spec dtd "y" []
let data =
  Pxp_document.create_data_node spec dtd "foo"
y # append_node data;
tree # append_node y
It is easier to use the PXP preprocessor, a camlp4 extension of the O'Caml syntax. It simplifies the above code to (line breaks are optional):
  let tree =
    <:pxp_tree<
      <x>
        <y>
          "foo"
    >>
For more about the preprocessor, see Intro_preprocessor.
Namespaces
PXP support namespaces, but
config record:
let m = Pxp_dtd.create_namespace_manager()
let config =
  { Pxp_types.default_config
      with enable_namespace_processing = Some m
  }
In event mode, this is already enough. In tree mode, you also need to direct PXP that it uses the special namespace-enabled node classes:
let spec = Pxp_tree_parser.default_namespace_spec
Of course, PXP can also parse namespace directives when namespace
processing is off. However, all the namespace-specific node methods
do not work like Pxp_document.node.namespace_uri.
Prefix normalization. PXP implements a technique called prefix
normalization when processing namespaces. The namespace prefix is
the part before the colon in element and attribute names like
prefix:localname. The prefix is changed in the document so every
namespace is uniquely identified by a prefix. Note that this means
that the elements and attributes may be renamed by the parser.
For details how the prefix normalization works, see Intro_namespaces.
Namespace processing can also be combined with event-oriented
parsing, see Events and namespaces.
Specifying which classes implement nodes - the mysterious 
spec parameter
For the tree representation PXP defines a set of classes implementing
the various node types. These classes, such as element_impl, are
all defined in Pxp_document.
It is now possible to instruct PXP to use different classes. In the last section we have already seen an example of this, because for namespace-enabled parsing a different set of node classes is used:
let spec = Pxp_tree_parser.default_namespace_spec
The mysterious spec parameter controls which class it uses for
which node type. In the source code of Pxp_tree_parser, we find
let default_spec =
  make_spec_from_mapping
    ~super_root_exemplar:      (new super_root_impl default_extension)
    ~comment_exemplar:         (new comment_impl default_extension)
    ~default_pinstr_exemplar:  (new pinstr_impl default_extension)
    ~data_exemplar:            (new data_impl default_extension)
    ~default_element_exemplar: (new element_impl default_extension)
    ~element_mapping:          (Hashtbl.create 1)
    ()
let default_namespace_spec =
  make_spec_from_mapping
    ~super_root_exemplar:      (new super_root_impl default_extension)
    ~comment_exemplar:         (new comment_impl default_extension)
    ~default_pinstr_exemplar:  (new pinstr_impl default_extension)
    ~data_exemplar:            (new data_impl default_extension)
    ~default_element_exemplar: (new namespace_element_impl default_extension)
    ~element_mapping:          (Hashtbl.create 1)
    ()
The function Pxp_document.make_spec_from_mapping creates a spec
from a set of constructors. In the namespace version of spec, the
only difference is that a special implementation for element nodes is
used.
One can also use this mechanism to let the parser create trees made of
customized classes. Note, however, that it is not possible to simply
create new classes by inherting from a predefined classes and then
adding new methods. The problem is that the typing constraints of PXP
do not allow that users add methods directly to node classes. However,
there is a special extension mechanism built-in, and one can use it to
add new methods indirectly to nodes. This means these methods do not
appear directly in the class type of nodes, but in the class type of
the node extension. See Intro_extensions for more about this.
What PXP cannot do for you
Although PXP has a long list of features, there are some types of parsing XML it is not designed for:
&entity; or %entity; PXP replaces it with the definition
  of that entity. It is an error if the entity turns out to be undefined,
  and parsing is stopped with an exception.