back to XmlBlob main page
About XPath
XPath is a query language for selecting nodes from an XML document. It's a well known standard defined by the W3CThe VirtualXPath module enables SpatiaLite to directly perform XPath queries on behalf of XmlBlobs.
You can easily find on the WEB many quick references about XPath; the most comprehensive and authoritative (but not necessarily the easiest one) is the official standard reference.
A quick practical intro
We'll start as usual by examining few practical examples and introducing some useful comment where appropriate.Just a very short introduction: an XmlDocument always has a tree structure formed by parent- and child-nodes, and surely has a unique root-node. This strongly resembles the very familiar file-system structure, where a ramified hierarchy of directories aka folders (i.e. parent-nodes) contains many files (i.e. child-nodes).
And not at all surprisingly, the XPath design is closely modelled on pathnames; it's not exactly the same (many relevant differences exist), but you'll be pleasantly surprised by the many familiar analogies you'll find.
Preparing a test-DB
- create a new empty DB-file.
- then import all XML samples from the books folder; performing a full XML validation isn't absolutely necessary in this case.
- be sure to import all XML samples into a DB table named books.
CREATE VIRTUAL TABLE xpath_books USING VirtualXPath( 'books', 'xml_document' );Now you simply have to create a further VirtualXPath table based on the table you've just now created by importing all the XmlDocuments.
The first argument 'books' identifies the main table name, and the second one 'xml_document' identifies the column containing the XmlBlobs.
All VirtualXPath tables always contain the following columns:
- pkid is the corresponding Primary Key value into the main table.
- sub is an ordinal number within the same pkid-value (a single XPath expression may eventually have more than a single match into the same XmlDocument).
- parent is the name of the parent-node (or NULL, in the case of the root-node).
- node is the name of the node itself corresponding to some XPath positive match.
- attribute is the name of the matching attribute (or NULL, if no argument is involved).
- value is the node- or attribute-value matching the XPath expression.
- xpath_expr is the XPath expression being currently queried.
Your first XPath query
SELECT * FROM xpath_books WHERE xpath_expr = '/Books/author/title' LIMIT 5;
pkid | sub | parent | node | attribute | value | xpath_expr |
---|---|---|---|---|---|---|
1 | 0 | author | title | NULL | NULL | /Books/author/title |
1 | 1 | author | title | NULL | NULL | /Books/author/title |
1 | 2 | author | title | NULL | NULL | /Books/author/title |
1 | 3 | author | title | NULL | NULL | /Books/author/title |
1 | 4 | author | title | NULL | NULL | /Books/author/title |
- the WHERE xpath_expr = expr clause instructs the VirtualXPath module to identify all XML-nodes matching the required XPath expression.
- The XPath expression '/Books/author/title' simply means:
- starting from the root-node named Books
- identify any child-node named author containing on its turn a further child-node named title.
SELECT * FROM xpath_books WHERE xpath_expr = '/Books/author/title/text()' LIMIT 5;
pkid | sub | parent | node | attribute | value | xpath_expr |
---|---|---|---|---|---|---|
1 | 0 | author | title | NULL | Buddenbrooks - Verfall einer Familie | /Books/author/title/text() |
1 | 1 | author | title | NULL | Königliche Hoheit | /Books/author/title/text() |
1 | 2 | author | title | NULL | Bekenntnisse des Hochstaplers Felix Krull | /Books/author/title/text() |
1 | 3 | author | title | NULL | Der Tod in Venedig | /Books/author/title/text() |
1 | 4 | author | title | NULL | Der Zauberberg | /Books/author/title/text() |
So it's now time to introduce the XPath text() operator; this is intended to fetch the inner text enclosed between a begin- and the corresponding close-tag: e.g. supposing an XML fragment like <someTag>somevalue</someTag> the XPath expression someTag/text() will then return someValue. And effectively this second query will now allow you to examine all book titles contained within the imported XmlDocuments.
SELECT * FROM xpath_books WHERE xpath_expr = '/Books/author/title/text()' ORDER BY value DESC LIMIT 5 OFFSET 8;
pkid | sub | parent | node | attribute | value | xpath_expr |
---|---|---|---|---|---|---|
7 | 2 | author | title | NULL | Крейцерова соната | /Books/author/title/text() |
7 | 0 | author | title | NULL | Война и мир | /Books/author/title/text() |
7 | 1 | author | title | NULL | Анна Каренина | /Books/author/title/text() |
5 | 10 | author | title | NULL | Τρώαδες | /Books/author/title/text() |
5 | 13 | author | title | NULL | Σφῆκες | /Books/author/title/text() |
That's not all: you can eventually add an ORDER BY clause, so to get all titles lexicographically ordered.
SELECT t.file_name, x.parent, x.node, x.value FROM xpath_books AS x JOIN books AS t ON (t.pk_uid = x.pkid) WHERE xpath_expr = '/Books/author/title/text()' AND t.file_name = 'books_it.xml';
file_name | parent | node | value |
---|---|---|---|
books_it.xml | author | title | I promessi sposi |
books_it.xml | author | title | I Malavoglia |
books_it.xml | author | title | Mastro Don Gesualdo |
books_it.xml | author | title | Il fu Mattia Pascal |
books_it.xml | author | title | Uno, nessuno e centomila |
And you can even JOIN the main table and the corresponding VirtualXPath table: in this example we are now showing all book titles from some Italian author.
More elaborate XPath queries
SELECT t.file_name, x.parent, x.node, x.value FROM xpath_books AS x JOIN books AS t ON (t.pk_uid = x.pkid) WHERE xpath_expr = '//*[text()="Oliver Twist"]/text()';
file_name | parent | node | value |
---|---|---|---|
books_en.xml | author | title | Oliver Twist |
- this new XPath expression doesn't looks at all like an ordinary pathname. Some further explanation is surely required: let's examine all elements one at each time:
- // intends in any tree-position, not only starting from the root.
- * obviously is a wild-card, intending any node-name
- [ ... ] are two special markers enclosing a conditional expression, intending include only the nodes presenting a matching value.
- all together now: the XPath expression '//*[text()="Oliver Twist"]/text()' effectively means:
- retrieve any string-value equal to "Oliver Twist", indifferently from the node-name and not considering its hierarchical position on the three.
SELECT t.file_name, x.parent, x.node, x.value FROM xpath_books AS x JOIN books AS t ON (t.pk_uid = x.pkid) WHERE xpath_expr = '//title[starts-with(., "Les")]/text()';
file_name | parent | node | value |
---|---|---|---|
books_fr.xml | author | title | Les Bijoux indiscrets |
books_fr.xml | author | title | Les Misérables |
But XPath supports many other useful string-comparison operators: start-with(str1, str2) evaluates to TRUE only when srt1 starts with str2.
The single-dot in this case simply intends the node-value itself (in pure pathname fashion).
SELECT t.file_name, x.parent, x.node, x.value FROM xpath_books AS x JOIN books AS t ON (t.pk_uid = x.pkid) WHERE xpath_expr = '//title[contains(., " of ")]/text()';
file_name | parent | node | value |
---|---|---|---|
books_en.xml | author | title | The Luck of Barry Lyndon |
books_en.xml | author | title | Heart of Darkness |
And contains(str1, str2) evaluates to TRUE only when srt1 contains str2. So XPath offers the rough equivalent of the LIKE SQL operator.
Nodes and Attributes
SELECT pkid, sub, parent, node, attribute, value FROM xpath_books WHERE xpath_expr = '//*/@lastName' ORDER BY value DESC LIMIT 5;
pkid | sub | parent | node | attribute | value |
---|---|---|---|---|---|
7 | 0 | Books | author | lastName | Толстой |
7 | 2 | Books | author | lastName | Достоевский |
7 | 1 | Books | author | lastName | Го́голь |
3 | 2 | Books | author | lastName | de Molina |
3 | 0 | Books | author | lastName | de Cervantes |
- the special sign @ marks an attribute-name.
SELECT pkid, sub, parent, node, value FROM xpath_books WHERE xpath_expr = '//*[@firstName="Σοφοκλῆς"]/title/text()' ORDER BY value;
pkid | sub | parent | node | value |
---|---|---|---|---|
5 | 2 | author | title | Oἰδίπoυς ἐπὶ Κολωνῷ |
5 | 1 | author | title | Οιδίπoυς τύραννoς |
5 | 0 | author | title | Ἀντιγόνη |
5 | 3 | author | title | Ἠλέκτρα |
You can obviously apply some conditional expression to an attribute value as well, exactly as we've already seen before for the node-value case.
In this example we've just extracted the list of Sophocles' tragedies.
SELECT pkid, sub, parent, node, value FROM xpath_books WHERE xpath_expr = '//title[contains(., "Katharina Blum")]/../@lastName' ORDER BY value;
pkid | sub | parent | node | value |
---|---|---|---|---|
1 | 0 | Books | author | Böll |
And in this final example we are going to discover who is the author of a novel about Katharina Blum:
- the double-dot simply intending go back to the parent-node, in pure pathname fashion.
XPath and XML namespaces
- now you have to import the XML sample named opera.xml from the misc folder; performing a full XML validation isn't strictly necessary.
- be sure to import this XML sample into a DB table named opera.
- then create and XPath Virtual Table named xpath_opera supporting the above table.
SELECT * FROM xpath_opera WHERE xpath_expr = '//*' LIMIT 3;
pkid | sub | parent | node | attribute | value | xpath_expr |
---|---|---|---|---|---|---|
1 | 0 | NULL | op:Opera | NULL | NULL | //* |
1 | 1 | op:Opera | op:author | NULL | NULL | //* |
1 | 2 | op:author | op:title | NULL | NULL | //* |
Please note well: now all nodes have an op: prefix; this is because this XmlDocuments declares a namespace, and XPath is very pedantic about namespaces.
xmlns:op="http://www.gaia-gis.it/opera"
- You can check the original XmlDocument by yourself; the above XML statement declares an XML namespace (xmlns).
- strictly speaking in technical terms, an XML namespace corresponds to an URI; in this case http://www.gaia-gis.it/opera (please note: this URI doesn't have to necessarily exist; it simply has to be a unique name identifier).
- for practical reasons, this URI corresponds to a short-prefix (xmlns:op); so for any further scope op and http://www.gaia-gis.it/opera are one and the same.
Anyway using the short-prefix is by far more practical than using the full URI.
SELECT sub, parent, node, value FROM xpath_opera WHERE xpath_expr = '//title/text()';Please note well: this query will return absolutely no row: this is because from the XPath own perspective there is absolutely no node being named title.
SELECT sub, parent, node, value FROM xpath_opera WHERE xpath_expr = '//op:title/text()' LIMIT 3;
sub | parent | node | value |
---|---|---|---|
0 | op:author | op:title | Rigoletto |
1 | op:author | op:title | Il trovatore |
2 | op:author | op:title | a traviata |
This second query instead works exactly as expected, because this time we are now correctly passing an XPath expression defining the fully qualified node-name, that is op:title (including the namespace prefix).
The cursed default namespace
- now you have to import the XML sample named movies.xml from the misc folder; performing a full XML validation isn't strictly necessary.
- be sure to import this XML sample into a DB table named movies.
- then create and XPath Virtual Table named xpath_movies supporting the above table.
SELECT * FROM xpath_movies WHERE xpath_expr = '//*' LIMIT 3;
pkid | sub | parent | node | attribute | value | xpath_expr |
---|---|---|---|---|---|---|
1 | 0 | NULL | dflt:Movies | NULL | NULL | //* |
1 | 1 | dflt:Movies | dflt:director | NULL | NULL | //* |
1 | 2 | dflt:director | dflt:movie | NULL | NULL | //* |
Please note well: now all nodes have a dflt: prefix; this is because this XmlDocuments declares yet again a namespace, but this time we have to face the doomed default namespace.
xmlns="http://www.gaia-gis.it/movies"
- You can check the original XmlDocument by yourself; the above XML statement declares an XML namespace (xmlns) corresponding to http://www.gaia-gis.it/movies
- but there is no corresponding prefix for this namespace: it's the infamous default namespace (i.e. lacking any prefix).
- this one is a fully legitimate XML option; but on the other side XPath strictly pretends that any namespaces absolutely has to be identified by its own prefix.
- so Spatialite/VirtualXPath in this case automatically sets a dflt: prefix, otherwise XPath would be absolutely unable to evaluate any expression for these nodes.
SELECT pkid, sub, parent, node, value FROM xpath_movies WHERE xpath_expr = '//*[@lastName="Weir"]/dflt:movie/@title' ORDER BY value;
pkid | sub | parent | node | value |
---|---|---|---|---|
1 | 1 | dflt:director | dflt:movie | Gallipoli |
1 | 2 | dflt:director | dflt:movie | Master and Commander: The Far Side of the World |
1 | 0 | dflt:director | dflt:movie | Picnic at Hanging Rock |
The above XPath query will correctly return all movies directed by Peter Weir.
SELECT pkid, sub, parent, node, value FROM xpath_movies WHERE xpath_expr = '//dflt:movie[@title="Pulp Fiction"]/../@lastName' ORDER BY value;
pkid | sub | parent | node | value |
---|---|---|---|---|
1 | 0 | dflt:Movies | dflt:director | Tarantino |
And this second query will help us in discovering that Quentin Tarantino directed Pulp Fiction.
XPath and numeric comparisons
SELECT pkid, sub, parent, node, value FROM xpath_movies WHERE xpath_expr = '//dflt:movie[@minutes < 110]/@title';
pkid | sub | parent | node | value |
---|---|---|---|---|
1 | 0 | dflt:director | dflt:movie | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb |
1 | 1 | dflt:director | dflt:movie | Dressed to Kill |
XPath has the capability to perform numeric comparisons: this query identifies any moving having a less than 110 minutues running time.
SELECT pkid, sub, parent, node, value FROM xpath_movies WHERE xpath_expr = '//dflt:movie[@minutes > 133 and @minutes < 138]/@title';
pkid | sub | parent | node | value |
---|---|---|---|---|
1 | 0 | dflt:director | dflt:movie | A Clockwork Orange |
1 | 1 | dflt:director | dflt:movie | Goodfellas |
You can eventually compare a range of numeric values: in this latest example we've extracted the list of movies having a running time between 133 and 138 minutes.
back to XmlBlob main page