xml4h: XML for Humans in Python¶
xml4h is an MIT licensed library for Python to make it easier to work with XML.
This library exists because Python is awesome, XML is everywhere, and combining the two should be a pleasure but often is not. With xml4h, it can be easy.
As of version 1.0 xml4h supports Python versions 2.7 and 3.5+.
Features¶
xml4h is a simplification layer over existing Python XML processing libraries such as lxml, ElementTree and the minidom. It provides:
- a rich pythonic API to traverse and manipulate the XML DOM.
- a document builder to simply and safely construct complex documents with minimal code.
- a writer that serialises XML documents with the structure and format that you expect, unlike the machine- but not human-friendly output you tend to get from other libraries.
The xml4h abstraction layer also offers some other benefits, beyond a nice API and tool set:
- A common interface to different underlying XML libraries, so code written against xml4h need not be rewritten if you switch implementations.
- You can easily move between xml4h and the underlying implementation: parse your document using the fastest implementation, manipulate the DOM with human-friendly code using xml4h, then get back to the underlying implementation if you need to.
Installation¶
Install xml4h with pip:
$ pip install xml4h
Or install the tarball manually with:
$ python setup.py install
Links¶
- GitHub for source code and issues: https://github.com/jmurty/xml4h
- ReadTheDocs for documentation: https://xml4h.readthedocs.org
- Install from the Python Package Index: https://pypi.python.org/pypi/xml4h
Introduction¶
With xml4h you can easily parse XML files and access their data.
Let’s start with an example XML document:
$ cat tests/data/monty_python_films.xml
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python">
<Film year="1971">
<Title>And Now for Something Completely Different</Title>
<Description>
A collection of sketches from the first and second TV series of
Monty Python's Flying Circus purposely re-enacted and shot for film.
</Description>
</Film>
<Film year="1974">
<Title>Monty Python and the Holy Grail</Title>
<Description>
King Arthur and his knights embark on a low-budget search for
the Holy Grail, encountering humorous obstacles along the way.
Some of these turned into standalone sketches.
</Description>
</Film>
<Film year="1979">
<Title>Monty Python's Life of Brian</Title>
<Description>
Brian is born on the first Christmas, in the stable next to
Jesus'. He spends his life being mistaken for a messiah.
</Description>
</Film>
<... more Film elements here ...>
</MontyPythonFilms>
With xml4h you can parse the XML file and use “magical” element and attribute lookups to read data:
>>> import xml4h
>>> doc = xml4h.parse('tests/data/monty_python_films.xml')
>>> for film in doc.MontyPythonFilms.Film[:3]:
... print(film['year'] + ' : ' + film.Title.text)
1971 : And Now for Something Completely Different
1974 : Monty Python and the Holy Grail
1979 : Monty Python's Life of Brian
You can also use more explicit (non-magical) methods to traverse the DOM:
>>> for film in doc.child('MontyPythonFilms').children('Film')[:3]:
... print(film.attributes['year'] + ' : ' + film.children.first.text)
1971 : And Now for Something Completely Different
1974 : Monty Python and the Holy Grail
1979 : Monty Python's Life of Brian
The xml4h builder makes programmatic document creation simple, with a method-chaining feature that allows for expressive but sparse code that mirrors the document itself. Here is the code to build part of the above XML document:
>>> b = (xml4h.build('MontyPythonFilms')
... .attributes({'source': 'http://en.wikipedia.org/wiki/Monty_Python'})
... .element('Film')
... .attributes({'year': 1971})
... .element('Title')
... .text('And Now for Something Completely Different')
... .up()
... .elem('Description').t(
... "A collection of sketches from the first and second TV"
... " series of Monty Python's Flying Circus purposely"
... " re-enacted and shot for film."
... ).up()
... .up()
... )
>>> # A builder object can be re-used, and has short method aliases
>>> b = (b.e('Film')
... .attrs(year=1974)
... .e('Title').t('Monty Python and the Holy Grail').up()
... .e('Description').t(
... "King Arthur and his knights embark on a low-budget search"
... " for the Holy Grail, encountering humorous obstacles along"
... " the way. Some of these turned into standalone sketches."
... ).up()
... .up()
... )
Pretty-print your XML document with xml4h’s writer implementation with methods to write content to a stream or get the content as text with flexible formatting options:
>>> print(b.xml_doc(indent=4, newline=True))
<?xml version="1.0" encoding="utf-8"?>
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python">
<Film year="1971">
<Title>And Now for Something Completely Different</Title>
<Description>A collection of sketches from ...</Description>
</Film>
<Film year="1974">
<Title>Monty Python and the Holy Grail</Title>
<Description>King Arthur and his knights embark ...</Description>
</Film>
</MontyPythonFilms>
Why use xml4h?¶
Python has three popular libraries for working with XML, none of which are particularly easy to use:
- xml.dom.minidom is a light-weight, moderately-featured implementation of the W3C DOM that is included in the standard library. Unfortunately the W3C DOM API is verbose, clumsy, and not very pythonic, and the minidom does not support XPath expressions.
- xml.etree.ElementTree is a fast hierarchical data container that is included in the standard library and can be used to represent XML, mostly. The API is fairly pythonic and supports some basic XPath features, but it lacks some DOM traversal niceties you might expect (e.g. to get an element’s parent) and when using it you often feel like your working with something subtly different from XML, because you are.
- lxml is a fast, full-featured XML library with an API based on ElementTree but extended. It is your best choice for doing serious work with XML in Python but it is not included in the standard library, it can be difficult to install, and it gives you the same it’s-XML-but-not-quite feeling as its ElementTree forebear.
Given these three options it can be difficult to choose which library to use, especially if you’re new to XML processing in Python and haven’t already used (struggled with) any of them.
In the past your best bet would have been to go with lxml for the most flexibility, even though it might be overkill, because at least then you wouldn’t have to rewrite your code if you later find you need XPath support or powerful DOM traversal methods.
This is where xml4h comes in. It provides an abstraction layer over the existing XML libraries, taking advantage of their power while offering an improved API and tool set.
Development Status: beta¶
Currently xml4h includes adapter implementations for three of the main XML processing Python libraries.
If you have lxml available (highly recommended) it will use that, otherwise it will fall back to use the (c)ElementTree then the minidom libraries.
History¶
1.0¶
- Add support for Python 3 (3.5+)
- Dropped support for Python versions before 2.7.
- Fix node namespace prefix values for lxml adapter.
- Improve builder’s
up()
method to accept and distinguish between a count of parents to step up, or the name of a target ancestor node. - Add
xml()
andxml_doc()
methods to document builder to more easily get string content from it, without resorting to the write methods. - The
write()
andwrite_doc()
methods no longer send output tosys.stdout
by default. The user must explicitly provide a target writer object, and hopefully be more mindful of the need to set up encoding correctly when providing a text stream object. - Handling of redundant Element namespace prefixes is now more consistent: we always strip the prefix when the element has an xmlns attribute defining the same namespace URI.
0.2.0¶
- Add adapter for the (c)ElementTree library versions included as standard with Python 2.7+.
- Improved “magical” node traversal to work with lowercase tag names without always needing a trailing underscore. See also improved docs.
- Fixes for: potential errors ASCII-encoding nodes as strings; default XPath namespace from document node; lookup precedence of xmlns attributes.
0.1.0¶
- Initial alpha release with support for lxml and minidom libraries.
User Guide¶
Parser¶
The xml4h parser is a simple wrapper around the parser provided by an underlying XML library implementation.
Parse function¶
To parse XML documents with xml4h you feed the xml4h.parse()
function
an XML text document in one of three forms:
A file-like object:
>>> import xml4h >>> xml_file = open('tests/data/monty_python_films.xml', 'rb') >>> doc = xml4h.parse(xml_file) >>> doc.MontyPythonFilms <xml4h.nodes.Element: "MontyPythonFilms">
A file path string:
>>> doc = xml4h.parse('tests/data/monty_python_films.xml') >>> doc.root['source'] 'http://en.wikipedia.org/wiki/Monty_Python'
A string containing literal XML content:
>>> xml_file = open('tests/data/monty_python_films.xml', 'rb') >>> xml_text = xml_file.read() >>> doc = xml4h.parse(xml_text) >>> len(doc.find('Film')) 7
Note
The parse()
method distinguishes between a file path
string and an XML text string by looking for a <
character
in the value.
Stripping of Whitespace Nodes¶
By default the parse method ignores whitespace nodes in the XML document – or more accurately, it does extra work to remove these nodes after the document has been parsed by the underlying XML library.
Whitespace nodes are rarely interesting, since they are usually the result of XML content that has been serialized with extra whitespace to make it more readable to humans.
However if you need to keep these nodes, or if you want to avoid the extra
processing overhead when parsing large documents, you can disable this
feature by passing in the ignore_whitespace_text_nodes=False
flag:
>>> # Strip whitespace nodes from document
>>> doc = xml4h.parse('tests/data/monty_python_films.xml')
>>> # No excess text nodes (XML doc lists 7 films)
>>> len(doc.MontyPythonFilms.children)
7
>>> doc.MontyPythonFilms.children[0]
<xml4h.nodes.Element: "Film">
>>> # Don't strip whitespace nodes
>>> doc = xml4h.parse('tests/data/monty_python_films.xml',
... ignore_whitespace_text_nodes=False)
>>> # An extra text node is present
>>> len(doc.MontyPythonFilms.children)
8
>>> doc.MontyPythonFilms.children[0]
<xml4h.nodes.Text: "#text">
Builder¶
xml4h includes a document builder tool that makes it easy to create valid, well-formed XML documents using relatively sparse python code. It makes it so easy to create XML that you will no longer be tempted to cobble together documents with error-prone methods like manual string concatenation or a templating library.
Internally, the builder uses the DOM-building features of an underlying XML library which means it is (almost) impossible to construct an invalid document.
Here is some example code to build a document about Monty Python films:
>>> import xml4h
>>> xmlb = (xml4h.build('MontyPythonFilms')
... .attributes({'source': 'http://en.wikipedia.org/wiki/Monty_Python'})
... .element('Film')
... .attributes({'year': 1971})
... .element('Title')
... .text('And Now for Something Completely Different')
... .up()
... .elem('Description').t(
... "A collection of sketches from the first and second TV"
... " series of Monty Python's Flying Circus purposely"
... " re-enacted and shot for film.")
... .up()
... .up()
... .elem('Film')
... .attrs(year=1974)
... .e('Title')
... .t('Monty Python and the Holy Grail')
... .up()
... .e('Description').t(
... "King Arthur and his knights embark on a low-budget search"
... " for the Holy Grail, encountering humorous obstacles along"
... " the way. Some of these turned into standalone sketches."
... ).up()
... )
The code above produces the following XML document (abbreviated):
>>> print(xmlb.xml_doc(indent=True))
<?xml version="1.0" encoding="utf-8"?>
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python">
<Film year="1971">
<Title>And Now for Something Completely Different</Title>
<Description>A collection of sketches from the first and second...
</Film>
<Film year="1974">
<Title>Monty Python and the Holy Grail</Title>
<Description>King Arthur and his knights embark on a low-budget...
</Film>
</MontyPythonFilms>
Getting Started¶
You typically create a new XML document builder by calling the
xml4h.build()
function with the name of the root element:
>>> root_b = xml4h.build('RootElement')
The function returns a Builder
object that represents
the RootElement and allows you to manipulate this element’s attributes
or to add child elements.
Once you have the first builder instance, every action you perform to add content to the XML document will return another instance of the Builder class:
>>> # Add attributes to the root element's Builder
>>> root_b = root_b.attributes({'a': 1, 'b': 2}, c=3)
>>> root_b
<xml4h.builder.Builder object ...
The Builder class always represents an underlying element in the DOM. The
dom_element
attribute returns the element node:
>>> root_b.dom_element
<xml4h.nodes.Element: "RootElement">
>>> root_b.dom_element.attributes
<xml4h.nodes.AttributeDict: [('a', '1'), ('b', '2'), ('c', '3')]>
When you add a new child element, the result is a builder instance representing that child element, not the original element:
>>> child1_b = root_b.element('ChildElement1')
>>> child2_b = root_b.element('ChildElement2')
>>> # The element method returns a Builder wrapping the new child element
>>> child2_b.dom_element
<xml4h.nodes.Element: "ChildElement2">
>>> child2_b.dom_element.parent
<xml4h.nodes.Element: "RootElement">
This feature of the builder can be a little confusing, but it allows for the very convenient method-chaining feature that gives the builder its power.
Method Chaining¶
Because every builder method that adds content to the XML document returns a builder instance representing the nearest (or newest) element, you can chain together many method calls to construct your document without any need for intermediate variables.
For example, the example code in the previous section used the variables
root_b
, child1_b
and child2_b
to represent builder instances but
this is not necessary. Here is how you can use method-chaining to build the
same document with less code:
>>> b = (xml4h
... .build('RootElement').attributes({'a': 1, 'b': 2}, c=3)
... .element('ChildElement1').up() # NOTE the up() method
... .element('ChildElement2')
... )
>>> print(b.xml_doc(indent=4))
<?xml version="1.0" encoding="utf-8"?>
<RootElement a="1" b="2" c="3">
<ChildElement1/>
<ChildElement2/>
</RootElement>
Notice how you can use chained method calls to write code with a structure that mirrors that of the XML document you want to produce? This makes it much easier to spot errors in your code than it would be if you were to concatenate strings.
Note
It is a good idea to wrap the build()
function call and all
following chained methods in parentheses, so you don’t need to put
backslash (\) characters at the end of every line.
The code above introduces a very important builder method:
up()
. This method returns a builder instance
representing the current element’s parent, or indeed any ancestor.
Without the up()
method, every time you created a child element with the
builder you would end up deeper in the document structure with no way to return
to prior elements to add sibling nodes or hierarchies.
To help reduce the number of up()
method calls you need to include in
your code, this method can also jump up multiple levels or to a named ancestor
element:
>>> # A builder that references a deeply-nested element:
>>> deep_b = (xml4h.build('Root')
... .element('Deep')
... .element('AndDeeper')
... .element('AndDeeperStill')
... .element('UntilWeGetThere')
... )
>>> deep_b.dom_element
<xml4h.nodes.Element: "UntilWeGetThere">
>>> # Jump up 4 levels, back to the root element
>>> deep_b.up(4).dom_element
<xml4h.nodes.Element: "Root">
>>> # Jump up to a named ancestor element
>>> deep_b.up('Root').dom_element
<xml4h.nodes.Element: "Root">
Shorthand Methods¶
To make your XML-producing code even less verbose and quicker to type, the builder has shorthand “alias” methods corresponding to the full names.
For example, instead of calling element()
to create a new
child element, you can instead use the equivalent elem()
or e()
methods. Similarly, instead of typing attributes()
you can use attrs()
or a()
.
Here are the methods and method aliases for adding content to an XML document:
XML Node Created | Builder method | Aliases |
---|---|---|
Element | element |
elem , e |
Attribute | attributes |
attrs , a |
Text | text |
t |
CDATA | cdata |
data , d |
Comment | comment |
c |
Process Instruction | processing_instruction |
inst , i |
These shorthand method aliases are convenient and lead to even less cruft around the actual XML content you are interested in. But on the other hand they are much less explicit than the longer versions, so use them judiciously.
Access the DOM¶
The XML builder is merely a layer of convenience methods that sits on the
xml4h.nodes
DOM API. This means you can quickly access the underlying
nodes from a builder if you need to inspect them or manipulate them in a
way the builder doesn’t allow:
- The
dom_element
attribute returns a builder’s underlyingElement
- The
root
attribute returns the document’s root element. - The
document
attribute returns a builder’s underlyingDocument
.
See the DOM Nodes API documentation to find out how to work with DOM element nodes once you get them.
Building on an Existing DOM¶
When you are building an XML document from scratch you will generally use
the build()
function described in Getting Started. However,
what if you want to add content to a parsed XML document DOM you have already?
To wrap an Element
DOM node with a builder you simply
provide the element node to the same builder()
method used previously and
it will do the right thing.
Here is an example of parsing an existing XML document, locating an element of interest, constructing a builder from that element, and adding some new content. Luckily, the code is simpler than that description…
>>> # Parse an XML document
>>> doc = xml4h.parse('tests/data/monty_python_films.xml')
>>> # Find an Element node of interest
>>> lob_film_elem = doc.MontyPythonFilms.Film[2]
>>> lob_film_elem.Title.text
"Monty Python's Life of Brian"
>>> # Construct a builder from the element
>>> lob_builder = xml4h.build(lob_film_elem)
>>> # Add content
>>> b = (lob_builder.attrs(stars=5)
... .elem('Review').t('One of my favourite films!').up())
>>> # See the results
>>> print(lob_builder.xml())
<Film stars="5" year="1979">
<Title>Monty Python's Life of Brian</Title>
<Description>Brian is born on the first Christmas, in the stable...
<Review>One of my favourite films!</Review>
</Film>
Hydra-Builder¶
Because each builder class instance is independent, an advanced technique for constructing complex documents is to use multiple builders anchored at different places in the DOM. In some situations, the ability to add content to different places in the same document can be very handy.
Here is a trivial example of this technique:
>>> # Create two Elements in a doc to store even or odd numbers
>>> odd_b = xml4h.build('EvenAndOdd').elem('Odd')
>>> even_b = odd_b.up().elem('Even')
>>> # Populate the numbers from a loop
>>> for i in range(1, 11):
... if i % 2 == 0:
... even_b.elem('Number').text(i)
... else:
... odd_b.elem('Number').text(i)
<...
>>> # Check the final document
>>> print(odd_b.xml_doc(indent=True))
<?xml version="1.0" encoding="utf-8"?>
<EvenAndOdd>
<Odd>
<Number>1</Number>
<Number>3</Number>
<Number>5</Number>
<Number>7</Number>
<Number>9</Number>
</Odd>
<Even>
<Number>2</Number>
<Number>4</Number>
<Number>6</Number>
<Number>8</Number>
<Number>10</Number>
</Even>
</EvenAndOdd>
Writer¶
The xml4h writer produces serialized XML text documents formatted more traditionally – and in our opinion more correctly – than the other Python XML libraries.
Write methods¶
To write out an XML document with xml4h you will generally use the
write()
or write_doc()
methods
available on any xml4h node.
The writer methods require a file or any IO stream object as the first argument, and will automatically handle text or binary IO streams.
The write()
method outputs the current node and any
descendants:
>>> import xml4h
>>> doc = xml4h.parse('tests/data/monty_python_films.xml')
>>> first_film_elem = doc.find('Film')[0]
>>> # Write XML node to stdout
>>> import sys
>>> first_film_elem.write(sys.stdout, indent=True)
<Film year="1971">
<Title>And Now for Something Completely Different</Title>
<Description>A collection of sketches from the first and second...
</Film>
The write_doc()
method outputs the entire document no
matter which node you call it on:
>>> first_film_elem.write_doc(sys.stdout, indent=True)
<?xml version="1.0" encoding="utf-8"?>
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python">
<Film year="1971">
<Title>And Now for Something Completely Different</Title>
<Description>A collection of sketches from the first and second...
</Film>
...
To send output to a file:
>>> # Write to a file
>>> with open('/tmp/example.xml', 'wb') as f:
... first_film_elem.write_doc(f)
Get XML as a string¶
Because you will often want to generate a string of XML content directly,
xml4h includes the convenience methods xml()
and xml_doc()
to do this easily.
The xml()
method works like the write method and
will return a string of XML content including the current node and its
descendants:
>>> print(first_film_elem.xml())
<Film year="1971">
<Title>And Now for Something Completely...
The xml_doc()
method works like the write_doc
method and returns a string for the whole document:
>>> print(first_film_elem.xml_doc())
<?xml version="1.0" encoding="utf-8"?>
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python">
<Film year="1971">
<Title>And Now for Something Completely Different</Title>
<Description>A collection of sketches from the first and second...
</Film>
...
Note
xml4h assumes that when you directly generate an XML string with these methods it is intended for human consumption, so it applies pretty-print formatting by default.
Format Output¶
The write and xml methods accept a range of formatting options to control how XML content is serialized. These are useful if you expect a human to read the resulting data.
For the full range of formatting options see the code documentation for
write()
and xml()
et al.
but here are some pointers to get you started:
- Set
indent=True
to write a pretty-printed XML document with four space characters for indentation and\n
for newlines. - To use a tab character for indenting and
\r\n
for indents:indent='\t', newline='\r\n'
. - xml4h writes utf-8-encoded documents by default, to write with a
different encoding:
encoding='iso-8859-1'
. - To avoid outputting the XML declaration when writing a document:
omit_declaration=True
.
Write using the underlying implementation¶
Because xml4h sits on top of an underlying XML library implementation you can use that library’s serialization methods if you prefer, and if you don’t mind having some implementation-specific code.
For example, if you are using lxml as the underlying library you can use its serialisation methods by accessing the implementation node:
>>> # Get the implementation root node, in this case an lxml node
>>> lxml_root_node = first_film_elem.root.impl_node
>>> type(lxml_root_node)
<... 'lxml.etree._Element'>
>>> # Use lxml features as normal; xml4h is no longer in the picture
>>> from lxml import etree
>>> xml_bytes = etree.tostring(
... lxml_root_node, encoding='utf-8', xml_declaration=True, pretty_print=True)
>>> print(xml_bytes.decode('utf-8'))
<?xml version='1.0' encoding='utf-8'?>
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python"><Film year="1971"><Title>And Now for Something Completely Different</Title>
<Description>A collection of sketches from the first and second...
</Film>
<Film year="1974"><Title>Monty Python and the Holy Grail</Title>
<Description>King Arthur and his knights embark on a low-budget...
</Film>
...
Note
The output from lxml is a little quirky, at least on the author’s machine.
Note for example the single-quote characters in the XML declaration, and
the missing newline and indent before the first <Film>
element. But
don’t worry, that’s why you have xml4h ;)
DOM Nodes¶
xml4h provides node objects and convenience methods that make it easier to work with an in-memory XML document object model (DOM).
This section of the document covers the main features of xml4h nodes. For the full API-level documentation see DOM Nodes API.
Traversing Nodes¶
xml4h aims to provide a simple and intuitive API for traversing and manipulating the XML DOM. To that end it includes a number of convenience methods for performing common tasks:
- Get the
Document
or rootElement
from any node via thedocument
androot
attributes respectively. - You can get the
name
attribute of nodes that have a name, or look up the different name components withprefix
to get the namespace prefix (if any) andlocal_name
to get the name portion without the prefix. - Nodes that have a value expose it via the
value
attribute. - A node’s
parent
attribute returns its parent, while theancestors
attribute returns a list containing its parent, grand-parent, great-grand-parent etc. - A node’s
children
attribute returns the child nodes that belong to it, while thesiblings
attribute returns all other nodes that belong to its parent. You can also get thesiblings_before
orsiblings_after
the current node. - Look up a node’s namespace URI with
namespace_uri
or the aliasns_uri
. - Check what type of
Node
you have with Boolean attributes likeis_element
,is_text
,is_entity
etc.
“Magical” Node Traversal¶
To make it easy to traverse XML documents with a known structure xml4h
performs some minor magic when you look up attributes or keys on Document
and Element nodes. If you like, you can take advantage of magical traversal
to avoid peppering your code with find
and xpath
searches, or with
child
and children
node attribute lookups.
The principle is simple:
- Child elements are available as Python attributes of the parent element class.
- XML element attributes are available as a Python dict in the owning element.
Here is an example of retrieving information from our Monty Python films
document using element names as Python attributes (MontyPythonFilms
,
Film
, Title
) and XML attribute names as Python keys (year
):
>>> # Parse an example XML document about Monty Python films
>>> import xml4h
>>> doc = xml4h.parse('tests/data/monty_python_films.xml')
>>> for film in doc.MontyPythonFilms.Film:
... print(film['year'] + ' : ' + film.Title.text)
1971 : And Now for Something Completely Different
1974 : Monty Python and the Holy Grail
...
Python class attribute lookups of child elements work very well when your XML
document contains only camel-case tag names LikeThisOne
or LikeThat
.
However, if your document contains lower-case tag names there is a chance the
element names will clash with existing Python attribute or method names in the
xml4h classes.
To work around this potential issue you can add an underscore (_
)
character at the end of a magical attribute lookup to avoid the naming clash;
xml4h will remove that character before looking for a child element. For
example, to look up a child of the element elem1
which is named child
,
the code elem1.child_
will return the child element whereas elem1.child
would access the child()
Node method instead.
Note
Not all XML child element tag names are accessible using magical traversal.
Names with leading underscore characters will not work, and nor will names
containing hyphens because they are not valid Python attribute names. If you
have to deal with XML names like this use the full API methods like
child()
and children()
instead.
All the gory details about how magical traversal works are documented at
NodeAttrAndChildElementLookupsMixin
. Depending on how
you feel about magical behaviour this feature might feel like a great
convenience, or black magic that makes you wary. The right attitude probably
lies somewhere in the middle…
Warning
The behaviour of namespaced XML elements and attributes is inconsistent.
You can do magical traversal of elements regardless of what namespace the
elements are in, but to look up XML attributes with a namespace prefix
you must include that prefix in the name e.g. prefix:attribute-name
.
Searching with Find and XPath¶
There are two ways to search for elements within an xml4h document: find
and xpath
.
The find methods provided by the library are easy to use but can only perform
relatively simple searches that return Element
results,
whereas you need to be familiar with XPath query syntax to search effectively
with the xpath
method but you can perform more complex searches and get
results other than just elements.
Find Methods¶
xml4h provides three different find methods:
find()
searches descendants of the current node for elements matching the given constraints. You can search by element name, by namespace URI, or with no constraints at all:>>> # Find ALL elements in the document >>> elems = doc.find() >>> [e.name for e in elems] ['MontyPythonFilms', 'Film', 'Title', 'Description', 'Film', 'Title', 'Description',... >>> # Find the seven <Film> elements in the XML document >>> film_elems = doc.find('Film') >>> [e.Title.text for e in film_elems] ['And Now for Something Completely Different', 'Monty Python and the Holy Grail',...
Note that the
find()
method only finds descendants of the node you run it on:>>> # Find <Title> elements in a single <Film> element; there's only one >>> film_elem = doc.find('Film', first_only=True) >>> film_elem.find('Title') [<xml4h.nodes.Element: "Title">]
find_first()
searches descendants of the current node but only returns the first result element, not a list. If there are no matching element results this method returns None:>>> # Find the first <Film> element in the document >>> doc.find_first('Film') <xml4h.nodes.Element: "Film"> >>> # Search for an element that does not exist >>> print(doc.find_first('OopsWrongName')) None
If you were paying attention you may have noticed in the example above that you can make the
find()
method do exactly same thing asfind_first()
by passing the keyword argumentfirst_only=True
.find_doc()
is a convenience method that searches the entire document no matter which node you run it on:>>> # Normal find only searches descendants of the current node >>> len(film_elem.find('Title')) 1 >>> # find_doc searches the entire document >>> len(film_elem.find_doc('Title')) 7
This method is exactly like calling
xml4h_node.document.find()
, which is actually what happens behind the scenes.
XPath Querying¶
xml4h provides a single XPath search method which is available on
Document
and Element
nodes:
xpath()
takes an XPath query string and returns
the result which may be a list of elements, a list of attributes, a list of
values, or a single value. The result depends entirely on the kind of query you
perform.
Note
XPath querying is currently only available if you use the lxml or
ElementTree implementation libraries. You can check whether the XPath
feature is available with has_feature()
.
Note
Although ElementTree supports XPath queries, this support is very limited and most of the example XPath queries below will not work. If you want to use XPath, you should install lxml for better support.
XPath queries are powerful and complex so we cannot describe them in detail here, but we can at least present some useful examples. Here are queries that perform the same work as the find queries we saw above:
>>> # Query for ALL elements in the document
>>> elems = doc.xpath('//*')
>>> [e.name for e in elems]
['MontyPythonFilms', 'Film', 'Title', 'Description', 'Film', 'Title', 'Description',...
>>> # Query for the seven <Film> elements in the XML document
>>> film_elems = doc.xpath('//Film')
>>> [e.Title.text for e in film_elems]
['And Now for Something Completely Different', 'Monty Python and the Holy Grail',...
>>> # Query for the first <Film> element in the document (returns list)
>>> doc.xpath('//Film[1]')
[<xml4h.nodes.Element: "Film">]
>>> # Query for <Title> elements in a single <Film> element; there's only one
>>> film_elem = doc.xpath('Film[1]')[0]
>>> film_elem.xpath('Title')
[<xml4h.nodes.Element: "Title">]
You can also do things with XPath queries that you simply cannot with the find method, such as find all the attributes of a certain name or apply rich constraints to the query:
>>> # Query for all year attributes
>>> doc.xpath('//@year')
['1971', '1974', '1979', '1982', '1983', '2009', '2012']
>>> # Query for the title of the film released in 1982
>>> doc.xpath('//Film[@year="1982"]/Title/text()')
['Monty Python Live at the Hollywood Bowl']
Namespaces and XPath¶
Finally, let’s discuss how you can run XPath queries on documents with namespaces, because unfortunately this is not a simple subject.
First, you need to understand that if you are working with a namespaced document your XPath queries must refer to those namespaces or they will not find anything:
>>> # Parse a namespaced version of the Monty Python Films doc
>>> ns_doc = xml4h.parse('tests/data/monty_python_films.ns.xml')
>>> print(ns_doc.xml())
<?xml version="1.0" encoding="utf-8"?>
<MontyPythonFilms source="http://en.wikipedia.org/wiki/Monty_Python" xmlns="uri:monty-python" xmlns:work="uri:artistic-work">
<work:Film year="1971">
<Title>And Now for Something Completely Different</Title>
...
>>> # XPath queries without prefixes won't find namespaced elements
>>> ns_doc.xpath('//Film')
[]
To refer to namespaced nodes in your query the namespace must have a prefix
alias assigned to it. You can specify prefixes when you call the xpath method
by providing a namespaces
keyword argument with a dictionary of
alias-to-URI mappings:
>>> # Specify explicit prefix alias mappings
>>> films = ns_doc.xpath('//x:Film', namespaces={'x': 'uri:artistic-work'})
>>> len(films)
7
Or, preferably, if your document node already has prefix mappings you can use them directly:
>>> # Our root node already has a 'work' prefix defined...
>>> ns_doc.root['xmlns:work']
'uri:artistic-work'
>>> # ...so we can use this prefix directly
>>> films = ns_doc.root.xpath('//work:Film')
>>> len(films)
7
Another gotcha is when a document has a default namespace. The default namespace applies to every descendent node without its own namespace, but XPath doesn’t have a good way of dealing with this since there is no such thing as a “default namespace” prefix alias.
xml4h helps out by providing just such an alias: the underscore (_
):
>>> # Our document root has a default namespace
>>> ns_doc.root.ns_uri
'uri:monty-python'
>>> # You need a prefix alias that refers to the default namespace
>>> ns_doc.xpath('//Title')
[]
>>> # You could specify it explicitly...
>>> titles = ns_doc.xpath('//x:Title',
... namespaces={'x': ns_doc.root.ns_uri})
>>> len(titles)
7
>>> # ...or use xml4h's special default namespace prefix: _
>>> titles = ns_doc.xpath('//_:Title')
>>> len(titles)
7
Filtering Node Lists¶
Many xml4h node attributes return a list of nodes as a
NodeList
object which confers some special filtering
powers. You get this special node list object from attributes like
children
, ancestors
, and siblings
, and from the find
search
method if it has element results.
Here are some examples of how you can easily filter a
NodeList
to get just the
nodes you need:
Get the first child node using the
filter
method:>>> # Filter to get just the first child >>> doc.root.children.filter(first_only=True) <xml4h.nodes.Element: "Film"> >>> # The document has 7 <Film> element children of the root >>> len(doc.root.children) 7
Get the first child node by treating
children
as a callable:>>> doc.root.children(first_only=True) <xml4h.nodes.Element: "Film">
When you treat the node list as a callable it calls the
filter
method behind the scenes, but since doing it the callable way is quicker and clearer in code we will use that approach from now on.Get the first child node with the
child
filtering method, which accepts the same constraints as thefilter
method:>>> doc.root.child() <xml4h.nodes.Element: "Film"> >>> # Apply filtering with child >>> print(doc.root.child('WrongName')) None
Get the first of a set of children with the
first
attribute:>>> doc.root.children.first <xml4h.nodes.Element: "Film">
Filter the node list by name:
>>> for n in doc.root.children('Film'): ... print(n.Title.text) And Now for Something Completely Different Monty Python and the Holy Grail Monty Python's Life of Brian Monty Python Live at the Hollywood Bowl Monty Python's The Meaning of Life Monty Python: Almost the Truth (The Lawyer's Cut) A Liar's Autobiography: Volume IV >>> len(doc.root.children('WrongName')) 0
Note
Passing a node name as the first argument will match the local name of a node. You can match the full node name, which might include a prefix for example, with a call like:
.children(name='SomeName')
.Filter with a custom function:
>>> # Filter to films released in the year 1979 >>> for n in doc.root.children('Film', ... filter_fn=lambda node: node.attributes['year'] == '1979'): ... print(n.Title.text) Monty Python's Life of Brian
Manipulating Nodes and Elements¶
xml4h provides simple methods to manipulate the structure and content of an
XML DOM. The methods available depend on the kind of node you are interacting
with, and by far the majority are for working with
Element
nodes.
Delete a Node¶
Any node can be removes from its owner document with
delete()
:
>>> # Before deleting a Film element there are 7 films
>>> len(doc.MontyPythonFilms.Film)
7
>>> doc.MontyPythonFilms.children('Film')[-1].delete()
>>> len(doc.MontyPythonFilms.Film)
6
Note
By default deleting a node also destroys it, but it can optionally be left
intact after removal from the document by including the destroy=False
option.
Name and Value Attributes¶
Many nodes have low-level name and value properties that can be read from and written to. Nodes with names and values include Text, CDATA, Comment, ProcessingInstruction, Attribute, and Element nodes.
Here is an example of accessing the low-level name and value properties of a Text node:
>>> text_node = doc.MontyPythonFilms.child('Film').child('Title').child()
>>> text_node.is_text
True
>>> text_node.name
'#text'
>>> text_node.value
'And Now for Something Completely Different'
And here is the same for an Attribute node:
>>> # Access the name/value properties of an Attribute node
>>> year_attr = doc.MontyPythonFilms.child('Film').attribute_node('year')
>>> year_attr.is_attribute
True
>>> year_attr.name
'year'
>>> year_attr.value
'1971'
The name attribute of a node is not necessarily a plain string, in the case of
nodes within a defined namespaced the name
attribute may comprise two
components: a prefix
that represents the namespace, and a local_name
which is the plain name of the node ignoring the namespace. For more
information on namespaces see Namespaces.
Import a Node and its Descendants¶
In addition to manipulating nodes in a single XML document directly, you can also import a node (and all its descendant) from another document using a node clone or transplant operation.
There are two ways to import a node and its descendants:
- Use the
clone_node()
Node method orclone()
Builder method to copy a node into your document without removing it from its original document. - Use the
transplant_node()
Node method ortransplant()
Builder method to transplant a node into your document and remove it from its original document.
Here is an example of transplanting a node into a document (which also happens
to undo the damage we did to our example DOM in the delete()
example
above):
>>> # Build a new document containing a Film element
>>> film_builder = (xml4h.build('DeletedFilm')
... .element('Film').attrs(year='1971')
... .element('Title')
... .text('And Now for Something Completely Different').up()
... .element('Description').text(
... "A collection of sketches from the first and second TV"
... " series of Monty Python's Flying Circus purposely"
... " re-enacted and shot for film.")
... )
>>> # Transplant the Film element from the new document
>>> node_to_transplant = film_builder.root.child('Film')
>>> doc.MontyPythonFilms.transplant_node(node_to_transplant)
>>> len(doc.MontyPythonFilms.Film)
7
When you transplant a node from another document it is removed from that document:
>>> # After transplanting the Film node it is no longer in the original doc
>>> len(film_builder.root.find('Film'))
0
If you need to leave the original document unchanged when importing a node use the clone methods instead.
Working with Elements¶
Element nodes have the most methods to access and manipulate their content, which is fitting since this is the most useful type of node and you will deal with elements regularly.
The leaf elements in XML documents often have one or more
Text
node children that contain the element’s data
content. While you could iterate over such text nodes as child nodes, xml4h
provides the more convenient text accessors you would expect:
>>> title_elem = doc.MontyPythonFilms.Film[0].Title
>>> orig_title = title_elem.text
>>> orig_title
'And Now for Something Completely Different'
>>> title_elem.text = 'A new, and wrong, title'
>>> title_elem.text
'A new, and wrong, title'
>>> # Let's put it back the way it was...
>>> title_elem.text = orig_title
Elements also have attributes that can be manipulated in a number of ways.
Look up an element’s attributes with:
the
attributes()
attribute (or aliasesattrib
andattrs
) that return an ordered dictionary of attribute names and values:>>> film_elem = doc.MontyPythonFilms.Film[0] >>> film_elem.attributes <xml4h.nodes.AttributeDict: [('year', '1971')]>
or by obtaining an element’s attributes as
Attribute
nodes, though that is only likely to be useful in unusual circumstances:>>> film_elem.attribute_nodes [<xml4h.nodes.Attribute: "year">] >>> # Get a specific attribute node by name or namespace URI >>> film_elem.attribute_node('year') <xml4h.nodes.Attribute: "year">
and there’s also the “magical” keyword lookup technique discussed in “Magical” Node Traversal for quickly grabbing attribute values.
Set attribute values with:
the
set_attributes()
method, which allows you to add attributes without replacing existing ones. This method also supports defining XML attributes as a dictionary, list of name/value pairs, or keyword arguments:>>> # Set/add attributes as a dictionary >>> film_elem.set_attributes({'a1': 'v1'}) >>> # Set/add attributes as a list of name/value pairs >>> film_elem.set_attributes([('a2', 'v2')]) >>> # Set/add attributes as keyword arguments >>> film_elem.set_attributes(a3='v3', a4=4) >>> film_elem.attributes <xml4h.nodes.AttributeDict: [('a1', 'v1'), ('a2', 'v2'), ('a3', 'v3'), ('a4', '4'), ('year', '1971')]>
the setter version of the
attributes
attribute, which replaces any existing attributes with the new set:>>> film_elem.attributes = {'year': '1971', 'note': 'funny'} >>> film_elem.attributes <xml4h.nodes.AttributeDict: [('note', 'funny'), ('year', '1971')]>
Delete attributes from an element by:
using Python’s delete-in-dict technique:
>>> del(film_elem.attributes['note']) >>> film_elem.attributes <xml4h.nodes.AttributeDict: [('year', '1971')]>
or by calling the
delete()
method on anAttribute
node.
Finally, the Element
class provides a number of methods
for programmatically adding child nodes, for cases where you would rather work
directly with nodes instead of using a Builder.
The most complex of these methods is add_element()
which allows you to add a named child element, and to optionally to set the new
element’s namespace, text content, and attributes all at the same time. Let’s
try an example:
>>> # Add a Film element with an attribute
>>> new_film_elem = doc.MontyPythonFilms.add_element(
... 'Film', attributes={'year': 'never'})
>>> # Add a Description element with text content
>>> desc_elem = new_film_elem.add_element(
... 'Description', text='Just testing...')
>>> # Add a Title element with text *before* the description element
>>> title_elem = desc_elem.add_element(
... 'Title', text='The Film that Never Was', before_this_element=True)
>>> print(doc.MontyPythonFilms.Film[-1].xml())
<Film year="never">
<Title>The Film that Never Was</Title>
<Description>Just testing...</Description>
</Film>
There are similar methods for handling simpler cases like adding text nodes, comments etc. Here is an example of adding text nodes:
>>> # Add a text node
>>> title_elem = doc.MontyPythonFilms.Film[-1].Title
>>> title_elem.add_text(', and Never Will Be')
>>> title_elem.text
'The Film that Never Was, and Never Will Be'
Refer to the Element
documentation for more information
about the other methods for adding nodes.
Wrapping and Unwrapping xml4h Nodes¶
You can easily convert to or from xml4h’s wrapped version of an implementation node. For example, if you prefer the lxml library’s ElementMaker document builder approach to the xml4h Builder, you can create a document in lxml…
>>> from lxml.builder import ElementMaker
>>> E = ElementMaker()
>>> lxml_doc = E.DocRoot(
... E.Item(
... E.Name('Item 1'),
... E.Value('Value 1')
... ),
... E.Item(
... E.Name('Item 2'),
... E.Value('Value 2')
... )
... )
>>> lxml_doc
<Element DocRoot at ...
…and then convert (or, more accurately, wrap) the lxml nodes with the appropriate adapter to make them xml4h versions:
>>> # Convert lxml Document to xml4h version
>>> xml4h_doc = xml4h.LXMLAdapter.wrap_document(lxml_doc)
>>> xml4h_doc.children
[<xml4h.nodes.Element: "Item">, <xml4h.nodes.Element: "Item">]
>>> # Get an element within the lxml document
>>> lxml_elem = list(lxml_doc)[0]
>>> lxml_elem
<Element Item at ...
>>> # Convert lxml Element to xml4h version
>>> xml4h_elem = xml4h.LXMLAdapter.wrap_node(lxml_elem, lxml_doc)
>>> xml4h_elem
<xml4h.nodes.Element: "Item">
You can reach the underlying XML implementation document or node at any time from an xml4h node:
>>> # Get an xml4h node's underlying implementation node
>>> xml4h_elem.impl_node
<Element Item at ...
>>> xml4h_elem.impl_node == lxml_elem
True
>>> # Get the underlying implementatation document from any node
>>> xml4h_elem.impl_document
<Element DocRoot at ...
>>> xml4h_elem.impl_document == lxml_doc
True
Advanced¶
Namespaces¶
xml4h supports using XML namespaces in a number of ways, and tries to make this sometimes complex and fiddly aspect of XML a little easier to deal with.
Namespace URIs¶
XML document nodes can be associated with a namespace URI which uniquely identifies the namespace. At bottom a URI is really just a name to identifiy the namespace, which may or may not point at an actual resource.
Namespace URIs are the core piece of the namespacing puzzle, everything else is extras.
Namespace URI values are assigned to a node in one of three ways:
an
xmlns
attribute on an element assigns a namespace URI to that element, and may also define a shorthand prefix for the namespace:<AnElement xmlns:my-prefix="urn:example-uri">
Note
Technically the
xmlns
attribute must itself also be in the special XML namespacing namespace http://www.w3.org/2000/xmlns/. You needn’t care about this.a tag or attribute name includes a prefix alias portion that specifies the namespace the item belongs to:
<my-prefix:AnotherElement attr1="x" my-prefix:attr2="i am namespaced">
A prefix alias can be defined using an “xmlns” attribute as described above, or by using the Builder
ns_prefix()
or Nodeset_ns_prefix()
methods.in an apparent effort to reduce confusion around namespace URIs and prefixes, some XML libraries avoid prefix aliases altogether and instead require you to specify the full namespace URI as a prefix to tag and attribute names using a special syntax with braces:
>>> tagname = '{urn:example-uri}YetAnotherWayToNamespace'
Note
In the author’s opinion, using a non-standard way to define namespaces does not reduce confusion. xml4h supports this approach technically but not philosphically.
xml4h allows you to assign namespace URIs to document nodes when using the Builder:
>>> # Assign a default namespace with ns_uri
>>> import xml4h
>>> b = xml4h.build('Doc', ns_uri='ns-uri')
>>> root = b.root
>>> # Descendent without a namespace inherit their ancestor's default one
>>> elem1 = b.elem('Elem1').dom_element
>>> elem1.namespace_uri
'ns-uri'
>>> # Define a prefix alias to assign a new or existing namespace URI
>>> elem2 = b.ns_prefix('my-ns', 'second-ns-uri') \
... .elem('my-ns:Elem2').dom_element
>>> print(root.xml())
<Doc xmlns="ns-uri" xmlns:my-ns="second-ns-uri">
<Elem1/>
<my-ns:Elem2/>
</Doc>
>>> # Or use the explicit URI prefix approach, if you must
>>> elem3 = b.elem('{third-ns-uri}Elem3').dom_element
>>> elem3.namespace_uri
'third-ns-uri'
And when adding nodes with the API:
>>> # Define the ns_uri argument when creating a new element
>>> elem4 = root.add_element('Elem4', ns_uri='fourth-ns-uri')
>>> # Attributes can be namespaced too
>>> elem4.set_attributes({'my-ns:attr1': 'value'})
>>> print(elem4.xml())
<Elem4 my-ns:attr1="value" xmlns="fourth-ns-uri"/>
Filtering by Namespace¶
xml4h allows you to find and filter nodes based on their namespace.
The find()
method takes a ns_uri
keyword argument to
return only elements in that namespace:
>>> # By default, find ignores namespaces...
>>> [n.local_name for n in root.find()]
['Elem1', 'Elem2', 'Elem3', 'Elem4']
>>> # ...but will filter by namespace URI if you wish
>>> [n.local_name for n in root.find(ns_uri='fourth-ns-uri')]
['Elem4']
Similarly, a node’s children listing can be filtered:
>>> len(root.children)
4
>>> root.children(ns_uri='ns-uri')
[<xml4h.nodes.Element: "Elem1">]
XPath queries can also filter by namespace, but the
xpath()
method needs to be given a dictionary mapping
of prefix aliases to URIs:
>>> root.xpath('//ns4:*', namespaces={'ns4': 'fourth-ns-uri'})
[<xml4h.nodes.Element: "Elem4">]
Note
Normally, because XPath queries rely on namespace prefix aliases, they cannot find namespaced nodes in the default namespace which has an “empty” prefix name. xml4h works around this limitation by providing the special empty/default prefix alias ‘_’.
Element Names: Local and Prefix Components¶
When you use a namespace prefix alias to define the namespace an element or attribute belongs to, the name of that node will be made up of two components:
- prefix - the namespace alias.
- local - the real name of the node, without the namespace alias.
xml4h makes the full (qualified) name, and the two components, available at node attributes:
>>> # Elem2's namespace was defined earlier using a prefix alias
>>> elem2
<xml4h.nodes.Element: "my-ns:Elem2">
# The full node name...
>>> elem2.name
'my-ns:Elem2'
>>> # ...comprises a prefix...
>>> elem2.prefix
'my-ns'
>>> # ...and a local name component
>>> elem2.local_name
'Elem2'
>>> # Here is an element without a prefix alias
>>> elem1.name
'Elem1'
>>> elem1.prefix == None
True
>>> elem1.local_name
'Elem1'
xml4h Architecture¶
To best understand the xml4h library and to use it appropriately in demanding situations, you should appreciate what the library is not.
xml4h is not a full-fledged XML library in its own right, far from it. Instead of implementing low-level document parsing and manipulation tools, it operates as an abstraction layer on top of the pre-existing XML processing libraries you already know.
This means the improved API and tool suite provided by xml4h work by mediating operations you perform, asking the underlying XML library to do the work, and packaging up the results of this work as wrapped xml4h objects.
This approach has a number of implications, good and bad.
On the good side:
- you can start using and benefiting from xml4h in an existing projects that already use a supported XML library without any impact, it can fit right in.
- xml4h can take advantage of the existing powerful and fast XML libraries to do its work.
- by providing an abstraction layer over multiple libraries, xml4h can make it (relatively) easy to switch the underlying library without you needing to rewrite your own XML handling code.
- by building on the shoulders of giants, xml4h itself can remain relatively lightweight and focussed on simplicity and usability.
- the author of xml4h does not have to write XML-handling code in C…
On the bad side:
- if the underlying XML libraries available in the Python environment do not support a feature (like XPath querying) then that feature will not be available in xml4h.
- xml4h cannot provide radical new XML processing features, since the bulk of its work must be done by the underlying library.
- the abstraction layer xml4h uses to do its work requires more resources than it would to use the underlying library directly, so if you absolutely need maximal speed or minimal memory use the library might prove too expensive.
- xml4h sometimes needs to jump through some hoops to maintain the shared abstraction interface over multiple libraries, which means extra work is done in Python instead of by the underlying library code in C.
The author believes the benefits of using xml4h outweighs the drawbacks in the majority of real-world situations, or he wouldn’t have created the library in the first place, but ultimately it is up to you to decide where you should or should not use it.
Library Adapters¶
To provide an abstraction layer over multiple underlying XML libraries, xml4h
uses an “adapter” mechanism to mediate operations on documents. There is an
adapter implementation for each library xml4h can work with, each of which
extends the XmlImplAdapter
class. This base
class includes some standard behaviour, and defines the interface for adapter
implementations (to the extent you can define such interfaces in Python).
The current version of xml4h includes adapter implementations for the three main XML processing libraries for Python:
LXMLAdapter
works with the excellent lxml library which is very full-featured and fast, but which is not included in the standard library.cElementTreeAdapter
andElementTreeAdapter
work with the ElementTree libraries included with the standard library of Python versions 2.7 and later. ElementTree is fast and includes support for some basic XPath expressions. If the C-based version of ElementTree is available, the former adapter is made available and should be used for best performance.XmlDomImplAdapter
works with the minidom W3C-style XML library included with the standard library. This library is always available but is slower and has fewer features than alternative libraries (e.g. no support for XPath)
The adapter layer allows the rest of the xml4h library code to remain almost entirely oblivious to the underlying XML library that happens to be available at the time. The xml4h Builder, Node objects, writer etc. call adapter methods to perform document operations, and the adapter is responsible for doing the necessary work with the underlying library.
“Best” Adapter¶
While xml4h can work with multiple underlying XML libraries, some of these libraries are better (faster, more fully-featured) than others so it would be smart to use the best of the libraries available.
xml4h does exactly that: unless you explicitly choose an adapter (see below) xml4h will find the supported libraries in the Python environment and choose the “best” adapter for you in the circumstances.
Here is the list of libraries xml4h will choose from, best to least-best:
- lxml
- (c)ElementTree
- ElementTree
- minidom
The xml4h.best_adapter
attribute stores the adapter class that xml4h
considers to be the best.
Choose Your Own Adapter¶
By default, xml4h will choose an adapter and underlying XML library implementation that it considers the best available. However, in some cases you may need to have full control over which underlying implementation xml4h uses, perhaps because you will use features of the underlying XML implementation later on, or because you need the performance characteristics only available in a particular library.
For these situations it is possible to tell xml4h which adapter implementation, and therefore which underlying XML library, it should use.
To use a specific adapter implementation when parsing a document, or when
creating a new document using the builder, simply provide the optional
adapter
keyword argument to the relevant method:
Parsing:
>>> # Explicitly use the minidom adapter to parse a document >>> minidom_doc = xml4h.parse('tests/data/monty_python_films.xml', ... adapter=xml4h.XmlDomImplAdapter) >>> minidom_doc.root.impl_node <DOM Element: MontyPythonFilms at ...
Building:
>>> # Explicitly use the lxml adapter to build a document >>> lxml_b = xml4h.build('MyDoc', adapter=xml4h.LXMLAdapter) >>> lxml_b.root.impl_node <Element {http://www.w3.org/2000/xmlns/}MyDoc at ...
Manipulating:
>>> # Use xml4h with a cElementTree document object >>> import xml.etree.ElementTree as ET >>> et_doc = ET.parse('tests/data/monty_python_films.xml') >>> et_doc <xml.etree.ElementTree.ElementTree object ... >>> doc = xml4h.cElementTreeAdapter.wrap_document(et_doc) >>> doc.root <xml4h.nodes.Element: "MontyPythonFilms">
Check Feature Support¶
Because not all underlying XML libraries support all the features exposed by xml4h, the library includes a simple mechanism to check whether a given feature is available in the current Python environment or with the current adapter.
To check for feature support call the has_feature()
method on a document node, or
has_feature()
on an adapter class.
List of features that are not available in all adapters:
xpath
- Can perform XPath queries using thexpath()
method.- More to come later, probably…
For example, here is how you would test for XPath support in the minidom adapter, which doesn’t include it:
>>> minidom_doc.root.has_feature('xpath')
False
If you forget to check for a feature and use it anyway, you will get
a FeatureUnavailableException
:
>>> try:
... minidom_doc.root.xpath('//*')
... except Exception as e:
... e
FeatureUnavailableException('xpath'...
Adapter & Implementation Quirks¶
Although xml4h aims to provide a seamless abstraction over underlying XML library implementations this isn’t always possible, or is only possible by performing lots of extra work that affects performance. This section describes some implementation-specific quirks or differences you may encounter.
LXMLAdapter - lxml¶
- lxml does not have full support for CDATA nodes, which devolve into plain text node values when written (by xml4h or by lxml’s writer).
- Namespaces defined by adding
xmlns
element attributes are not properly represented in the underlying implementation due to the lxml library’s immutablensmap
namespace map. Such namespaces are written correcly by the xml4h writer, but to avoid quirks it is best to specify namespace when creating nodes by setting thens_uri
keyword attribute. - When xml4h writes lxml-based documents with namespaces, some node tag names may have unnecessary namespace prefix aliases.
(c)ElementTreeAdapter - ElementTree¶
- Only the versions of (c)ElementTree included with Python version 2.7 and later are supported.
- ElementTree supports only a very limited subset of XPath for querying, so
although the
has_feature('xpath')
check returnsTrue
don’t expect to get the full power of XPath when you use this adapter. - ElementTree does not have full support for CDATA nodes, which devolve into plain text node values when written (by xml4h or by ElementTree’s writer).
- Because ElementTree doesn’t retain information about a node’s parent, xml4h needs to build and maintain its own records of which nodes are parents of which children. This extra overhead might harm performance or memory usage.
- ElementTree doesn’t normally remember explicit namespace definition directives when parsing a document. xml4h works around this when it is asked to parse XML data, but if you parse data outside of xml4h then use the library on the resultant document the namespace definitions will get messed up.
XmlImplAdapter - minidom¶
- No support for performing XPath queries.
- Slower than alternative C-based implementations.
API¶
Main Interface¶
-
xml4h.
parse
(to_parse, ignore_whitespace_text_nodes=True, adapter=None)[source]¶ Parse an XML document into an xml4h-wrapped DOM representation using an underlying XML library implementation.
Parameters: - to_parse (a file-like object or string) – an XML document file, document bytes, or the
path to an XML file. If a bytes value is given that contains
a
<
character it is treated as literal XML data, otherwise a bytes value is treated as a file path. - ignore_whitespace_text_nodes (bool) – if
True
pure whitespace nodes are stripped from the parsed document, since these are usually noise introduced by XML docs serialized to be human-friendly. - adapter (adapter class or None) – the xml4h implementation adapter class used to parse
the document and to interact with the resulting nodes.
If None,
best_adapter
will be used.
Returns: an
xml4h.nodes.Document
node representing the parsed document.Delegates to an adapter’s
parse_string()
orparse_file()
implementation.- to_parse (a file-like object or string) – an XML document file, document bytes, or the
path to an XML file. If a bytes value is given that contains
a
-
xml4h.
build
(tagname_or_element, ns_uri=None, adapter=None)[source]¶ Return a
Builder
that represents an element in a new or existing XML DOM and provides “chainable” methods focussed specifically on adding XML content.Parameters: - tagname_or_element (string or
Element
node) – a string name for the root node of a new XML document, or anElement
node in an existing document. - ns_uri (string or None) – a namespace URI to apply to the new root node. This argument has no effect this method is acting on an element.
- adapter (adapter class or None) – the xml4h implementation adapter class used to
interact with the document DOM nodes.
If None,
best_adapter
will be used.
Returns: a
Builder
instance that represents anElement
node in an XML DOM.- tagname_or_element (string or
-
xml4h.
best_adapter
¶ alias of
xml4h.impls.xml_etree_elementtree.cElementTreeAdapter
Builder¶
Builder is a utility class that makes it easy to create valid, well-formed
XML documents using relatively sparse python code. The builder class works
by wrapping an xml4h.nodes.Element
node to provide “chainable”
methods focussed specifically on adding XML content.
Each method that adds content returns a Builder instance representing the
current or the newly-added element. Behind the scenes, the builder uses the
xml4h.nodes
node traversal and manipulation methods to add content
directly to the underlying DOM.
You will not generally create Builder instances directly, but will instead
call the xml4h.builder()
method with the name for a new root element
or with an existing xml4h.nodes.Element
node.
-
class
xml4h.builder.
Builder
(element)[source]¶ Builder class that wraps an
xml4h.nodes.Element
node with methods for adding XML content to an underlying DOM.-
a
(*args, **kwargs)¶ Alias of
attributes()
-
attributes
(*args, **kwargs)[source]¶ Add one or more attributes to the
xml4h.nodes.Element
node represented by this Builder.Returns: the current Builder. Delegates to
xml4h.nodes.Element.set_attributes()
.
-
attrs
(*args, **kwargs)¶ Alias of
attributes()
-
cdata
(text)[source]¶ Add a CDATA node to the
xml4h.nodes.Element
node represented by this Builder.Returns: the current Builder. Delegates to
xml4h.nodes.Element.add_cdata()
.
-
clone
(node)[source]¶ Clone a node from another document to become a child of the
xml4h.nodes.Element
node represented by this Builder.Returns: a new Builder that represents the current element (not the cloned node). Delegates to
xml4h.nodes.Node.clone_node()
.
-
comment
(text)[source]¶ Add a coment node to the
xml4h.nodes.Element
node represented by this Builder.Returns: the current Builder. Delegates to
xml4h.nodes.Element.add_comment()
.
-
document
¶ Returns: the xml4h.nodes.Document
node that contains the element represented by this Builder.
-
dom_element
¶ Returns: the xml4h.nodes.Element
node represented by this Builder.
-
element
(*args, **kwargs)[source]¶ Add a child element to the
xml4h.nodes.Element
node represented by this Builder.Returns: a new Builder that represents the child element. Delegates to
xml4h.nodes.Element.add_element()
.
-
find
(**kwargs)[source]¶ Find descendants of the element represented by this builder that match the given constraints.
Returns: a list of xml4h.nodes.Element
nodesDelegates to
xml4h.nodes.Node.find()
-
find_doc
(**kwargs)[source]¶ Find nodes in this element’s owning
xml4h.nodes.Document
that match the given constraints.Returns: a list of xml4h.nodes.Element
nodesDelegates to
xml4h.nodes.Node.find_doc()
.
-
i
(target, data)¶ Alias of
processing_instruction()
-
instruction
(target, data)¶ Alias of
processing_instruction()
-
ns_prefix
(prefix, ns_uri)[source]¶ Set the namespace prefix of the
xml4h.nodes.Element
node represented by this Builder.Returns: the current Builder. Delegates to
xml4h.nodes.Element.set_ns_prefix()
.
-
processing_instruction
(target, data)[source]¶ Add a processing instruction node to the
xml4h.nodes.Element
node represented by this Builder.Returns: the current Builder. Delegates to
xml4h.nodes.Element.add_instruction()
.
-
root
¶ Returns: the xml4h.nodes.Element
root node ancestor of the element represented by this Builder
-
text
(text)[source]¶ Add a text node to the
xml4h.nodes.Element
node represented by this Builder.Returns: the current Builder. Delegates to
xml4h.nodes.Element.add_text()
.
-
transplant
(node)[source]¶ Transplant a node from another document to become a child of the
xml4h.nodes.Element
node represented by this Builder.Returns: a new Builder that represents the current element (not the transplanted node). Delegates to
xml4h.nodes.Node.transplant_node()
.
-
up
(count_or_element_name=1)[source]¶ Returns: a builder representing an ancestor of the current element, by default the parent element. Parameters: count_or_element_name (integer or string) – when an integer, return the n’th ancestor element up to the document’s root element. when a string, return the nearest ancestor element with that name, or the document’s root element if there are no matching ancestors. Defaults to integer value 1 which means the immediate parent.
-
write
(*args, **kwargs)[source]¶ Write XML bytes for the element represented by this builder.
Delegates to
xml4h.nodes.Node.write()
.
-
write_doc
(*args, **kwargs)[source]¶ Write XML bytes for the Document containing the element represented by this builder.
Delegates to
xml4h.nodes.Node.write_doc()
.
-
xml
(**kwargs)[source]¶ Returns: XML string for the element represented by this builder. Delegates to
xml4h.nodes.Node.xml()
.
-
xml_doc
(**kwargs)[source]¶ Returns: XML string for the Document containing the element represented by this builder. Delegates to
xml4h.nodes.Node.xml_doc()
.
-
Writer¶
Writer to serialize XML DOM documents or sections to text.
-
xml4h.writer.
write_node
(node, writer, encoding='utf-8', indent=0, newline='', omit_declaration=False, node_depth=0, quote_char='"')[source]¶ Serialize an xml4h DOM node and its descendants to text, writing the output to the given writer.
Parameters: - node (an
xml4h.nodes.Node
or subclass) – the DOM node whose content and descendants will be serialized. - writer (a file, stream, etc) – a file or stream to which XML text is written.
- encoding (string) – the character encoding for serialized text.
- indent (string, int, bool, or None) –
indentation prefix to apply to descendent nodes for pretty-printing. The value can take many forms:
- int: the number of spaces to indent. 0 means no indent.
- string: a literal prefix for indented nodes, such as
\t
. - bool: no indent if False, four spaces indent if True.
- None: no indent.
- newline (string, bool, or None) –
the string value used to separate lines of output. The value can take a number of forms:
- string: the literal newline value, such as
\n
or\r
. An empty string means no newline. - bool: no newline if False,
\n
newline if True. - None: no newline.
- string: the literal newline value, such as
- omit_declaration (boolean) – if True the XML declaration header
is omitted, otherwise it is included. Note that the declaration is
only output when serializing an
xml4h.nodes.Document
node. - node_depth (int) – the indentation level to start at, such as 2 to indent output as if the given node has two ancestors. This parameter will only be useful if you need to output XML text fragments that can be assembled into a document. This parameter has no effect unless indentation is applied.
- quote_char (string) – the character that delimits quoted content. You should never need to mess with this.
- node (an
DOM Nodes API¶
-
class
xml4h.nodes.
Attribute
(node, adapter)[source]¶ Node representing an attribute of a
Document
orElement
node.
-
class
xml4h.nodes.
AttributeDict
(attr_impl_nodes, impl_element, adapter)[source]¶ Dictionary-like object of element attributes that always reflects the state of the underlying element node, and that allows for in-place modifications that will immediately affect the element.
-
__init__
(attr_impl_nodes, impl_element, adapter)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
__weakref__
¶ list of weak references to the object (if defined)
-
impl_attributes
¶ Returns: the attribute node objects from the underlying XML implementation.
-
namespace_uri
(name)[source]¶ Parameters: name (string) – the name of an attribute to look up. Returns: the namespace URI associated with the named attribute, or None.
-
prefix
(name)[source]¶ Parameters: name (string) – the name of an attribute to look up. Returns: the prefix component of the named attribute’s name, or None.
-
to_dict
¶ Returns: an OrderedDict
of attribute name/value pairs.
-
-
class
xml4h.nodes.
CDATA
(node, adapter)[source]¶ Node representing character data in an XML document.
-
class
xml4h.nodes.
DocumentFragment
(node, adapter)[source]¶ Node representing an XML document fragment.
-
class
xml4h.nodes.
DocumentType
(node, adapter)[source]¶ Node representing the type of an XML document.
-
class
xml4h.nodes.
Element
(node, adapter)[source]¶ Node representing an element in an XML document, with support for manipulating and adding content to the element.
-
add_cdata
(data)[source]¶ Add a character data node to this element.
Parameters: data (string) – text content to add as character data.
-
add_comment
(text)[source]¶ Add a comment node to this element.
Parameters: text (string) – text content to add as a comment.
-
add_element
(name, ns_uri=None, attributes=None, text=None, before_this_element=False)[source]¶ Add a new child element to this element, with an optional namespace definition. If no namespace is provided the child will be assigned to the default namespace.
Parameters: - name (string) –
a name for the child node. The name may be used to apply a namespace to the child by including:
- a prefix component in the name of the form
ns_prefix:element_name
, where the prefix has already been defined for a namespace URI (such as viaset_ns_prefix()
). - a literal namespace URI value delimited by curly braces, of
the form
{ns_uri}element_name
.
- a prefix component in the name of the form
- ns_uri (string or None) – a URI specifying the new element’s namespace. If the
name
parameter specifies a namespace this parameter is ignored. - attributes (dict, list, tuple, or None) – collection of attributes to assign to the new child.
- text (string or None) – text value to assign to the new child.
- before_this_element (bool) – if True the new element is added as a sibling preceding this element, instead of as a child. In other words, the new element will be a child of this element’s parent node, and will immediately precent this element in the DOM.
Returns: the new child as a an
Element
node.- name (string) –
-
add_instruction
(target, data)[source]¶ Add an instruction node to this element.
Parameters: text (string) – text content to add as an instruction.
-
add_text
(text)[source]¶ Add a text node to this element.
Adding text with this method is subtly different from assigning a new text value with
text()
accessor, because it “appends” to rather than replacing this element’s set of text nodes.Parameters: - text – text content to add to this element.
- type – string or anything that can be coerced by
unicode()
.
-
attrib
¶ Alias of
attributes()
-
attribute_node
(name, ns_uri=None)[source]¶ Parameters: - name (string) – the name of the attribute to return.
- ns_uri (string or None) – a URI defining a namespace constraint on the attribute.
Returns: this element’s attributes that match
ns_uri
asAttribute
nodes.
-
attributes
¶ Get or set this element’s attributes as name/value pairs.
Note
Setting element attributes via this accessor will remove any existing attributes, as opposed to the
set_attributes()
method which only updates and replaces them.
-
attrs
¶ Alias of
attributes()
-
builder
¶ Returns: a Builder
representing this element with convenience methods for adding XML content.
-
set_attributes
(attr_obj=None, ns_uri=None, **attr_dict)[source]¶ Add or update this element’s attributes, where attributes can be specified in a number of ways.
Parameters: - attr_obj (dict, list, tuple, or None) – a dictionary or list of attribute name/value pairs.
- ns_uri (string or None) – a URI defining a namespace for the new attributes.
- attr_dict (dict) – attribute name and values specified as keyword arguments.
-
set_ns_prefix
(prefix, ns_uri)[source]¶ Define a namespace prefix that will serve as shorthand for the given namespace URI in element names.
Parameters: - prefix (string) – prefix that will serve as an alias for a the namespace URI.
- ns_uri (string) – namespace URI that will be denoted by the prefix.
-
text
¶ Get or set the text content of this element.
-
-
class
xml4h.nodes.
EntityReference
(node, adapter)[source]¶ Node representing an entity reference in an XML document.
-
class
xml4h.nodes.
NameValueNodeMixin
(node, adapter)[source]¶ Provide methods to access node name and value attributes, where the node name may also be composed of “prefix” and “local” components.
-
local_name
¶ Returns: the local component of a node name excluding any prefix.
-
name
¶ Get the name of a node, possibly including prefix and local components.
-
prefix
¶ Returns: the namespace prefix component of a node name, or None.
-
value
¶ Get or set the value of a node.
-
-
class
xml4h.nodes.
Node
(node, adapter)[source]¶ Base class for xml4h DOM nodes that represent and interact with a node in the underlying XML implementation.
-
XMLNS_URI
= 'http://www.w3.org/2000/xmlns/'¶ URI constant for XMLNS
-
__init__
(node, adapter)[source]¶ Construct an object that represents and wraps a DOM node in the underlying XML implementation.
Parameters: - node – node object from the underlying XML implementation.
- adapter – the
xml4h.impls.XmlImplAdapter
subclass implementation to mediate operations on the node in the underlying XML implementation.
-
__weakref__
¶ list of weak references to the object (if defined)
-
_convert_nodelist
(impl_nodelist)[source]¶ Convert a list of underlying implementation nodes into a list of xml4h wrapper nodes.
-
adapter
¶ Returns: the xml4h.impls.XmlImplAdapter
subclass implementation that mediates operations on the node in the underlying XML implementation.
-
adapter_class
¶ Returns: the class
of thexml4h.impls.XmlImplAdapter
subclass implementation that mediates operations on the node in the underlying XML implementation.
-
ancestors
¶ Returns: the ancestors of this node in a list ordered by proximity to this node, that is: parent, grandparent, great-grandparent etc.
-
child
(local_name=None, name=None, ns_uri=None, node_type=None, filter_fn=None)[source]¶ Returns: the first child node matching the given constraints, or None if there are no matching child nodes. Delegates to
NodeList.filter()
.
-
clone_node
(node)[source]¶ Clone a node from another document to become a child of this node, by copying the node’s data into this document but leaving the node untouched in the source document. The node to be cloned can be a
Node
based on the same underlying XML library implementation and adapter, or a “raw” node from that implementation.Parameters: node (xml4h or implementation node) – the node in another document to clone.
-
delete
(destroy=True)[source]¶ Delete this node from the owning document.
Parameters: destroy (bool) – if True the child node will be destroyed in addition to being removed from the document. Returns: the removed child node, or None if the child was destroyed.
-
find
(name=None, ns_uri=None, first_only=False)[source]¶ Find
Element
node descendants of this node, with optional constraints to limit the results.Parameters: - name (string or None) – limit results to elements with this name.
If None or
'*'
all element names are matched. - ns_uri (string or None) – limit results to elements within this namespace URI. If None all elements are matched, regardless of namespace.
- first_only (bool) – if True only return the first result node or None if there is no matching node.
Returns: a list of
Element
nodes matching any given constraints, or a single node iffirst_only=True
.- name (string or None) – limit results to elements with this name.
If None or
-
find_doc
(name=None, ns_uri=None, first_only=False)[source]¶ Find
Element
node descendants of the document containing this node, with optional constraints to limit the results.Delegates to
find()
applied to this node’s owning document.
-
find_first
(name=None, ns_uri=None)[source]¶ Find the first
Element
node descendant of this node that matches any optional constraints, or None if there are no matching elements.Delegates to
find()
withfirst_only=True
.
-
has_feature
(feature_name)[source]¶ Returns: True if a named feature is supported by the adapter implementation underlying this node.
-
impl_document
¶ Returns: the document object from the underlying XML implementation that contains the node represented by this xml4h node.
-
impl_node
¶ Returns: the node object from the underlying XML implementation that is represented by this xml4h node.
-
is_document_fragment
¶ Returns: True if this is a DocumentFragment
node.
-
is_document_type
¶ Returns: True if this is a DocumentType
node.
-
is_entity_reference
¶ Returns: True if this is an EntityReference
node.
-
is_processing_instruction
¶ Returns: True if this is a ProcessingInstruction
node.
-
is_root
¶ Returns: True if this node is the document’s root element
-
namespace_uri
¶ Returns: this node’s namespace URI or None.
-
node_type
¶ Returns: an int constant value that identifies the type of this node, such as ELEMENT_NODE
orTEXT_NODE
.
-
ns_uri
¶ Alias for
namespace_uri()
-
parent
¶ Returns: the parent of this node, or None of the node has no parent.
-
root
¶ Returns: the root Element
node of the document that contains this node, orself
if this node is the root element.
-
siblings_after
¶ Returns: a list of this node’s siblings that occur after this node in the DOM.
-
siblings_before
¶ Returns: a list of this node’s siblings that occur before this node in the DOM.
-
transplant_node
(node)[source]¶ Transplant a node from another document to become a child of this node, removing it from the source document. The node to be transplanted can be a
Node
based on the same underlying XML library implementation and adapter, or a “raw” node from that implementation.Parameters: node (xml4h or implementation node) – the node in another document to transplant.
-
write
(writer, encoding='utf-8', indent=0, newline='', omit_declaration=False, node_depth=0, quote_char='"')[source]¶ Serialize this node and its descendants to text, writing the output to the given writer.
Parameters: - writer (a file, stream, etc) – a file or stream to which XML text is written.
- encoding (string) – the character encoding for serialized text.
- indent (string, int, bool, or None) –
indentation prefix to apply to descendent nodes for pretty-printing. The value can take many forms:
- int: the number of spaces to indent. 0 means no indent.
- string: a literal prefix for indented nodes, such as
\t
. - bool: no indent if False, four spaces indent if True.
- None: no indent
- newline (string, bool, or None) –
the string value used to separate lines of output. The value can take a number of forms:
- string: the literal newline value, such as
\n
or\r
. An empty string means no newline. - bool: no newline if False,
\n
newline if True. - None: no newline.
- string: the literal newline value, such as
- omit_declaration (boolean) – if True the XML declaration header
is omitted, otherwise it is included. Note that the declaration is
only output when serializing an
xml4h.nodes.Document
node. - node_depth (int) – the indentation level to start at, such as 2 to indent output as if the given node has two ancestors. This parameter will only be useful if you need to output XML text fragments that can be assembled into a document. This parameter has no effect unless indentation is applied.
- quote_char (string) – the character that delimits quoted content. You should never need to mess with this.
Delegates to
xml4h.writer.write_node()
applied to this node.
-
write_doc
(writer, *args, **kwargs)[source]¶ Serialize to text the document containing this node, writing the output to the given writer.
Parameters: writer (a file, stream, etc) – a file or stream to which XML text is written. Delegates to
write()
-
-
class
xml4h.nodes.
NodeAttrAndChildElementLookupsMixin
[source]¶ Perform “magical” lookup of a node’s attributes via dict-style keyword reference, and child elements via class attribute reference.
-
__getattr__
(child_name)[source]¶ Retrieve this node’s child element by tag name regardless of the elements namespace, assuming the name given doesn’t match an existing attribute or method.
Parameters: child_name (string) – tag name of the child element to look up. To avoid name clashes with class attributes the child name may includes a trailing underscore ( _
) character, which is removed to get the real child tag name. The child name must not begin with underscore characters.Returns: the type of the return value depends on how many child elements match the name: Raise: AttributeError if the node has no child element with the given name, or if the given name does not match the required pattern.
-
__getitem__
(attr_name)[source]¶ Retrieve this node’s attribute value by name using dict-style keyword lookup.
Parameters: attr_name (string) – name of the attribute. If the attribute has a namespace prefix that must be included, in other words the name must be a qname not local name. Raise: KeyError if the node has no such attribute.
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
xml4h.nodes.
NodeList
[source]¶ Custom implementation for
Node
lists that provides additional functionality, such as node filtering.-
__call__
(local_name=None, name=None, ns_uri=None, node_type=None, filter_fn=None, first_only=False)¶ Alias for
filter()
.
-
__weakref__
¶ list of weak references to the object (if defined)
-
filter
(local_name=None, name=None, ns_uri=None, node_type=None, filter_fn=None, first_only=False)[source]¶ Apply filters to the set of nodes in this list.
Parameters: - local_name (string or None) – a local name used to filter the nodes.
- name (string or None) – a name used to filter the nodes.
- ns_uri (string or None) – a namespace URI used to filter the nodes. If None all nodes are returned regardless of namespace.
- node_type (int node type constant, class, or None) – a node type definition used to filter the nodes.
- filter_fn (function or None) –
an arbitrary function to filter nodes in this list. This function must accept a single
Node
argument and return a bool indicating whether to include the node in the filtered results.Note
if
filter_fn
is provided all other filter arguments are ignore.
Returns: the type of the return value depends on the value of the
first_only
parameter and how many nodes match the filter:
-
first
¶ Returns: the first of the available children nodes, or None if there are no children.
-
-
class
xml4h.nodes.
ProcessingInstruction
(node, adapter)[source]¶ Node representing a processing instruction in an XML document.
-
data
¶ Get or set the value of a node.
-
target
¶ Get the name of a node, possibly including prefix and local components.
-
-
class
xml4h.nodes.
XPathMixin
[source]¶ Provide
xpath()
method to nodes that support XPath searching.-
__weakref__
¶ list of weak references to the object (if defined)
-
xpath
(xpath, **kwargs)[source]¶ Perform an XPath query on the current node.
Parameters: - xpath (string) – XPath query.
- kwargs (dict) – Optional keyword arguments that are passed through to the underlying XML library implementation.
Returns: results of the query as a list of
Node
objects, or a list of base type objects if the XPath query does not reference node objects.
-
XML Libarary Adapters¶
-
class
xml4h.impls.interface.
XmlImplAdapter
(document)[source]¶ Base class that defines how xml4h interacts with an underlying XML library that the adaptor “wraps” to provide additional (or at least different) functionality.
This class should be treated as an abstract class. It provides some common implementation code used by all xml4h adapter implementations, but mostly it sketches out the methods the real implementaiton subclasses must provide.
-
clear_caches
()[source]¶ Clear any in-adapter cached data, for cases where cached data could become outdated e.g. by making DOM changes directly outside of xml4h.
This is a no-op if the implementing adapter has no cached data.
-
find_node_elements
(node, name='*', ns_uri='*')[source]¶ Returns: element node descendents of the given node that match the search constraints.
Parameters: - node – a node object from the underlying XML library.
- name (string) – only elements with a matching name will be
returned. If the value is
*
all names will match. - ns_uri (string) – only elements with a matching namespace URI
will be returned. If the value is
*
all namespaces will match.
-
get_ns_info_from_node_name
(name, impl_node)[source]¶ Return a three-element tuple with the prefix, local name, and namespace URI for the given element/attribute name (in the context of the given node’s hierarchy). If the name has no associated prefix or namespace information, None is return for those tuple members.
-
classmethod
has_feature
(feature_name)[source]¶ Returns: True if a named feature is supported by this adapter.
-
-
class
xml4h.impls.lxml_etree.
LXMLAdapter
(document)[source]¶ Adapter to the lxml XML library implementation.
-
find_node_elements
(node, name='*', ns_uri='*')[source]¶ Returns: element node descendents of the given node that match the search constraints.
Parameters: - node – a node object from the underlying XML library.
- name (string) – only elements with a matching name will be
returned. If the value is
*
all names will match. - ns_uri (string) – only elements with a matching namespace URI
will be returned. If the value is
*
all namespaces will match.
-
classmethod
is_available
()[source]¶ Returns: True if this adapter’s underlying XML library is available in the Python environment.
-
xpath_on_node
(node, xpath, **kwargs)[source]¶ Return result of performing the given XPath query on the given node.
All known namespace prefix-to-URI mappings in the document are automatically included in the XPath invocation.
If an empty/default namespace (i.e. None) is defined, this is converted to the prefix name ‘_’ so it can be used despite empty namespace prefixes being unsupported by XPath.
-
-
class
xml4h.impls.xml_etree_elementtree.
ElementTreeAdapter
(document)[source]¶ Adapter to the ElementTree XML library.
This code must work with either the base ElementTree pure python implementation or the C-based cElementTree implementation, since it is reused in the cElementTree class defined below.
-
ET
= <module 'xml.etree.ElementTree' from '/home/docs/.pyenv/versions/3.7.3/lib/python3.7/xml/etree/ElementTree.py'>¶
-
clear_caches
()[source]¶ Clear any in-adapter cached data, for cases where cached data could become outdated e.g. by making DOM changes directly outside of xml4h.
This is a no-op if the implementing adapter has no cached data.
-
find_node_elements
(node, name='*', ns_uri='*')[source]¶ Returns: element node descendents of the given node that match the search constraints.
Parameters: - node – a node object from the underlying XML library.
- name (string) – only elements with a matching name will be
returned. If the value is
*
all names will match. - ns_uri (string) – only elements with a matching namespace URI
will be returned. If the value is
*
all namespaces will match.
-
classmethod
is_available
()[source]¶ Returns: True if this adapter’s underlying XML library is available in the Python environment.
-
xpath_on_node
(node, xpath, **kwargs)[source]¶ Return result of performing the given XPath query on the given node.
All known namespace prefix-to-URI mappings in the document are automatically included in the XPath invocation.
If an empty/default namespace (i.e. None) is defined, this is converted to the prefix name ‘_’ so it can be used despite empty namespace prefixes being unsupported by XPath.
-
-
class
xml4h.impls.xml_etree_elementtree.
cElementTreeAdapter
(document)[source]¶ Adapter to the C-based implementation of the ElementTree XML library.
-
class
xml4h.impls.xml_dom_minidom.
XmlDomImplAdapter
(document)[source]¶ Adapter to the minidom XML library implementation.
-
find_node_elements
(node, name='*', ns_uri='*')[source]¶ Returns: element node descendents of the given node that match the search constraints.
Parameters: - node – a node object from the underlying XML library.
- name (string) – only elements with a matching name will be
returned. If the value is
*
all names will match. - ns_uri (string) – only elements with a matching namespace URI
will be returned. If the value is
*
all namespaces will match.
-
Custom Exceptions¶
Custom xml4h exceptions.
User has attempted to use a feature that is available in some xml4h implementations/adapters, but is not available in the current one.
-
exception
xml4h.exceptions.
IncorrectArgumentTypeException
(arg, expected_types)[source]¶ Richer flavour of a ValueError that describes exactly what argument types are expected.
-
exception
xml4h.exceptions.
UnknownNamespaceException
[source]¶ User has attempted to refer to an unknown or undeclared namespace by prefix or URI.