Vim as XML Editor: More Tasks

Questions

Sometimes you need to find out facts about the buffer with which you're working. Here are some examples: (each command line is to be entered as one line)

How many indexterm elements are there?
:%w !xmlstar sel -t -v "count(//indexterm)"
What's the title of the first section which has more than one nested sections?
:%w !xmlstar sel -t -v "//section[count(.//section)>1]/title"
What's the title of the first section which is nested deeper than one level?
:%w !xmlstar sel -t -v "//section[count(ancestor::section)>1]/title"
How many programlistings are there which don't have a role attribute?
:%w !xmlstar sel -t -v "count(//programlisting[not(@role)])"
How many XHTML div elements are in the document?
:%w !xmlstar sel -N "xh=http://www.w3.org/1999/xhtml"
 -t -v "count(//xh:div)"
How many words are there in the document?
:%w !xmlstar sel -t -v "/" |
 ruby -e 'puts($stdin.read.scan(/\S+/).length)'
You might need to exchange the apostrophes around the Ruby code for quotes on Windows. To count the words inside the (first) chapter with title "Foo" replace "/" with "//chapter[title='Foo']". If wc is available on your system you can try | wc -w instead of | ruby -e [...].

Validation

If you just want to check for well-formedness, enter
:!xmllint --noout %
Most often you will want to validate the document. The following command writes the buffer to xmllint:
:%w !xmllint --valid --noout -
This means you can validate the document with all changes, without having to save it first. If you set up a shell script or batch file just for validation, eg

xmllintval.bat

@echo off
xmllint --valid --noout %1 %2 %3 %4 %5
then you can simply do
:%w !xmllintval -
If you want to validate the file, pass the file's path to the validator:
:!xmllint --valid --noout %
or
:!xmllintval %
xmllint also allows you to validate documents which don't have a document type declaration . Let's say you have a DBX chapter file
<?xml version="1.0" encoding="UTF-8"?>
<chapter>
  <title>Setup</title>
...
which is included in the main book file via an entity reference
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book
  PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
  "https://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
[
 <!ENTITY setup SYSTEM "setup.dbx">
 ...
]>
...
&setup;
...
This means the chapter file can't have a doctype declaration. XInclude is a solution for that problem but it's not yet as widely supported as the entity reference mechanism which is part of the XML standard itself. Being able to validate documents which have no document type declaration is generally useful, and factors like the growing popularity of RNG, namespaces and version/profile attributes, and the hopefully decreasing dependence on DTDs to modify the document's information set will probably mean that less documents will have document type declarations. So let's say you edit the chapter in Vim and want to validate the buffer.
:%w !xmllint --valid --noout -
will fail because there is no FPI included and no DTD referenced. So you can try the following (one line):
:%w !xmllint --dtdvalid
 "https://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" --noout -
If you set up the catalog as described in the previous chapter then this works offline. (It won't work if the chapter contains references to entities that are declared in the main (book) file. In this case you probably would feed the main file to the validator.) The line doesn't have to be changed when used on a different machine which means that you can add it to your crossplatform vimrc:
nmap <Leader>d4 :%w !xmllint --dtdvalid 
 \ "https://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
 \ --noout -<CR>
Alternatively you can pass the FPI to xmllint (one line)
:%w !xmllint --dtdvalidfpi
 "-//OASIS//DTD DocBook XML V4.2//EN" --noout -
if it's listed in a catalog. [mapleader] d 4 works great for DBX chapters which are external entities referenced from the main book file as long as the following requirements are met: In addition to being an external parsed entity (can't contain a doctype declaration) the file must also be an XML document. In other words: Since the file can't contain entity declarations it also can't contain entity references. Making sure that modules of a DBX document are well-formed XML documents that can stand alone is generally a good idea since it simplifies collaboration and exchange, reuse, and last but not least editing and validation.
To validate the buffer with RXP, type
:%w !rxp -V -N -s -x
If you have set up rxpval you can enter
:%w !rxpval
This will not work if there are relative references to external entities and the current directory is not that of the file being validated (this also goes for stuff like :%w !xmllintval -). When RXP or xmllint eat stdin they use the current directory as base directory when resolving relative references. So if you want to validate a buffer which has relative references to external entities, you can do
:cd /directory/of/the/file/
to change the working directory, and
:pwd
to check it. A more convenient way to change the working directory is to put
map <Leader>cd :exe 'cd ' . expand ("%:p:h")<CR>
into your vimrc then do [mapleader] c d before writing the buffer to the validator via :%w !command....
If you want to validate the file, then it's not necessary to change the working directory. You can simply pass the file path, eg
:!rxp -V -N -s -x %
or
:!rxpval %
RXP works with XHTML documents, but with DBX documents I experienced problems which hopefully will be resolved with future releases. There are various command line options; check the RXP man page.
The commands are long enough to be tedious to type each time. After you've entered a specific one once, you can look for it in the command history:
  • Type : to go to Vim's command line,
  • optionally type the start of the command,
  • then repeat [up] until you see the command you were looking for.
Now you can edit it, or enter it unmodified. Another way of saving typing is to map commands to a very short key sequence in the vimrc.

Pretty-printing

Pretty-printing is not a trivial task. Relevant details of your document might be changed, so beware. I mainly use pretty-printing when viewing documents with extremely long lines or no indention, or when editing XML generated by tools, but I nearly never use a pretty-printing tool to format the code of documents I author.

Warning

Again: Your data can get corrupted whenever you filter the buffer. This goes for search'n'replace, external tools, etc. Use u to undo.

Select a well-formed fragment (one root element), then filter it through xmllint's pretty-printer, by entering
!xmllint --format -
xmllint will insert an XML prolog; if you didn't filter the whole buffer, this probably isn't desired. You can map this filter command to some shorter key sequence, and also include some commands to delete the prolog.

Caution

Namespace prefixes not declared inside the fragment are stripped.

To pretty-print the whole buffer, do
:%!xmllint --format -

Example

Let's say you receive a file which looks like this:
<?xml version="1.0"?>
<chapter><title>Ökopläne</title><simplelist>
<member role="überfällig">übermäßige
Ölförderung stoppen</member></simplelist></chapter>
The code is laid out in a way which makes it hard to work with. This
:%!xmllint --format --encode UTF-8 -
should bring
<?xml version="1.0" encoding="UTF-8"?>
<chapter>
  <title>Ökopläne</title>
  <simplelist>
    <member role="&#xFC;berf&#xE4;llig">übermäßige
Ölförderung stoppen</member>
  </simplelist>
</chapter>
which looks better, but xmllint currently doesn't resolve the NCRs of "special characters" (probably those outside the US-ASCII range) inside attribute values which can make editing harder.

You could also try tidy for pretty-printing.

Cleaning up

To delete all comments do
:%!xmlstar ed -d //comment()

HTML

If you have to deal with tag soup you might want to try turning it into XHTML so that you can enjoy all the advantages of XML. To filter the whole buffer through Tidy, do
:%!tidy
If there are irrecoverable errors, eg unknown elements, everything will be deleted. You can undo the filtering via u.
To filter a block, do
!}tidy
but don't forget that depending on the doctype flag Tidy might insert a doctype declaration if there is none.
If there's unwanted stuff left, strip tags like this:
:%s,</\?u>,,g