Mastering jq: xml (and any other data format)

In this tutorial, we will go over how to use jq to transform xml data as well as any other data format, including binary formats. The steps assumes a basic familiarity with jq and unix shell pipelines. If you’re unfamiliar with jq, check out the first part of the mastering jq series.

XML

In this section, we’ll use jq to transform xml data.

Setup

Consider the following data and let it be stored in before.json:

{
  "root": {
    "a": [
      1,
      2,
      3,
      4,
      5,
      6,
      7,
      8,
      9,
      10
    ]
  }
}

and we need to double each number so it looks like:

{
  "root": {
    "a": [
      2,
      4,
      6,
      8,
      10,
      12,
      14,
      16,
      18,
      20
    ]
  }
}

Then, with jq, we can do:

cat before.json |\
jq '{root: {a: (.root | .a | map(. * 2))}}'

Simple XML

Let us consider the same problem, but the data is instead encoded as xml. Let it be in a file, before.xml

<?xml version="1.0" encoding="utf-8"?>
<root>
	<a>1</a>
	<a>2</a>
	<a>3</a>
	<a>4</a>
	<a>5</a>
	<a>6</a>
	<a>7</a>
	<a>8</a>
	<a>9</a>
	<a>10</a>
</root>

And the desired output is:

<?xml version="1.0" encoding="utf-8"?>
<root>
	<root>
		<a>2</a>
		<a>4</a>
		<a>6</a>
		<a>8</a>
		<a>10</a>
		<a>12</a>
		<a>14</a>
		<a>16</a>
		<a>18</a>
		<a>20</a>
	</root>
</root>

When the same data was json, we did:

cat before.json |\
jq '{root: {a: (.root | .a | map(. * 2))}}'

For xml, we’ll do:

cat before.xml |\
xml-to-json |\
jq '{root: {a: (.root.a | map((. | tonumber) * 2))}}' |\
json-to-xml

Where:

  • xml-to-json is a shell script that takes data encoded as xml on stdin and converts it to data encoded as json on stdout.

  • json-to-xml is a shell script that takes data encoded as json on stdin and converts it to data encoded as xml on stdout.

We also need to use jq’s tonumber filter because all xml contents are strings. Other than that, to use jq with xml data, all we need is two shell programs: xml-to-json and json-to-xml.

Complex XML

In this section, we’ll use jq to process a much more complex piece of xml. As we’ve seen, jq works with xml to handle a simple case. But so do many other xml and plain text tools. It’s when the xml gets complicated that jq stands out.

I recently migrated my rss from a reader to a custom IFTTT hook. rss readers export data as a feedlist.opml (xml) file, and IFTTT required me to paste in in one url at a time. For that I needed a list of the urls as newline delimited strings. But, feedlist files are complex and intricate urls. There’s structure. There’s two types of urls, listurls and xmlurls, and we only want the xmlurls. Grep fails us. xmllint is a pain. But jq? jq handles it beautifully. Don’t worry too much about the specifics, the point is that jq works great even when the xml data is complex.

cat feedlist.opml |\
xml-to-json |\
jq '.opml.body |\
to_entries |\
map(.value.outline)[0] |\
map(select(.outline |\
type == "array")) |\
map(.outline) |\
flatten |\
map(.["@xmlUrl"])[]' -r

Installing xml-to-json and json-to-xml

These tools, along with some other json-related tools, are available for free from my github and can be easily installed

git clone https://github.com/tyleradams/json-toolkit &&\
cd json-toolkit &&\
sudo make install

If anybody would like to help me make this into a debian package (or any other type of package), I’d love some help, DM me on twitter @code_faster.

Any other data format

In this section, we’ll go over how to use jq with any other data format. This includes even binary data.

Textual Data

We’ll start by showing how to use jq with any other textual data format.

To use jq with xml formatted data, we did:

cat before.xml |\
xml-to-json |\
jq '{root: {a: (.root.a | map(. * 2))}}' |\
json-to-xml

If it’s the same data, but in another data format, let’s call it df, we’ll do:

cat before.df |\
df-to-json |\
jq '{root: {a: (.root.a | map(. * 2))}}' |\
json-to-df

Where:

  • df-to-json is a program that reads data encoded as df to stdin and writes the data encoded as json to stdout.

  • json-to-df is a program that reads data encoded as json to stdin and writes the data encoded as df to stdout.

As we can see, to use jq with df data, all we need is two programs: df-to-json and json-to-df. Since df is an arbitrary textual format, we see this works for all textual data.

Installing df-to-json and json-to-df

If df is:

  • csv

  • dsv

  • logfmt

  • xml

  • yaml

then json-toolkit contains both df-to-json and json-to-df.

Binary data

UPDATE: The original post erroneously claimed that unix pipes don’t handle binary data.

Using jq with binary data is exactly the same as textual data as unix pipes are indifferent to whether the data is binary or textual.

Implementing df-to-json and json-to-df

Leveraging a df library

If we have a df library which provides a df encoder and decoder, then making df-to-json and json-to-df is fast. Python has a large number of libraries, so it’s a good place to start, but we can use any programming language. Assuming we find one in Python, df-to-json will look like:

#!/usr/bin/env python3

import df
import json
import sys

df_encoded_data = sys.stdin.read()

data = df.decode(df_encoded_data)

# json.dumps means make data encode data as a json string
print(json.dumps(data))

If we find the df library in a non-python language, we could write a similar program in that language.

Similarly, json-to-df can be written as:

#!/usr/bin/env python3

import df
import json
import sys

json_encoded_data = sys.stdin.read()

# json.loads means decode json encoded data
# We don't use json.load(sys.stdin) in order to emphasize the symmetry with df.decode

data = json.loads(json_encoded_data)

print(df.encode(data))

As we can see, the program is almost exactly the same as df-to-json, except we swap which library does the decoding and which does the encoding.

For example, let’s look at the implementation of json-to-xml and xml-to-json. Here’s the implementation of json-to-xml. Just skim it and pay attention to the 2 bold lines.

#!/usr/bin/env python3

import json
import sys
import xmltodict

JSON_FROM_PYTHON_NAMES = {
    dict: "object",
    list: "array",
    int: "Number",
    float: "Number",
    bool: "Boolean",
    None: "null"
}

class InvalidXMLSerializableData(Exception):
    pass

class IncompleteIfTreeException(Exception):
    pass

def validate_data(data):
    if type(data) == dict and len(data.keys()) == 1:
        return

    # Prefacing \n makes multierror lines easier to read
    message = "\n    Only a json object with 1 key can be serialized to xml"
    if type(data) != dict:
        type_name = JSON_FROM_PYTHON_NAMES[type(data)]
        if type_name[0] in ["a", "e", "i", "o", "u"]:
            message += "\n    The inputted json value is not an object, it is an {}".format(type_name)
        else:
            message += "\n    The inputted json value is not an object, it is a {}".format(type_name)
    elif type(data) == dict and len(data.keys()) != 1:
        message += "\n    Input object does not have 1 key, it has {} keys".format(data.keys())
    else:
        raise Exception("The code cannot handle this input, to receive support, please file a bug specifying the input")
    raise InvalidXMLSerializableData(message)

def main():
    if len(sys.argv) != 1:
        print("Usage: json-to-xml")

    data = json.load(sys.stdin)
    validate_data(data)
    print(xmltodict.unparse(data, pretty=True))

if __name__ == "__main__":
    main()

The 2 bold lines do all of the work and are exactly the same as the 3 lines we wrote for df-to-json, just written more tersely. Everything else in the program just helps the programmer.

Here’s an implementation of xml-to-json, pay attention to the 3 bold lines:

#!/usr/bin/env python3

import json
import sys
import xmltodict

def main():
    if len(sys.argv) != 1:
        print("Usage: xml-to-json")

    xml_string = sys.stdin.read()
    data = xmltodict.parse(xml_string)
    print(json.dumps(data))

if __name__ == "__main__":
    main()

As we can see, the 3 bold lines in xml-to-json is implemented exactly as df-to-json suggests.

Binary data

UPDATE: An earlier version of the post erroneously claimed that unix pipes could not handle binary data. As such, this section has been simplified.

df-to-json

This is almost the same as df-to-json for textual formats, but we must change sys.stdin.read() to sys.stdin.buffer.read() as sys.stdin.read() assumes textual data.

#!/usr/bin/env python3

import df
import json
import sys

df_encoded_data = sys.stdin.buffer.read()

data = df.decode(df_encoded_data)
# json.dumps means make data encode data as a json string
print(json.dumps(data))

json-to-df

This is almost the same as json-to-df for textual formats, but we must change print() to sys.stdout.buffer.write() as print() assumes textual data.

#!/usr/bin/env python3

import df
import json
import sys

json_encoded_data = sys.stdin.read()

data = json.loads(json_encoded_data)

sys.stdout.buffer.write(df.encode(data))

No pre-written df library

All of the above assumed we found a pre-written df library which encoded and decoded df data for us. In practice, this is quite likely. We can find a pre-written library if df is a well known data format, if it’s proprietary and we have a relationship with the vendor, or if we designed the data format ourselves.

If we cannot find the library, then we’ll have to implement it. Once we do, we can use the same techniques as above and we’ll be able to use jq with data encoded as df.

The general problem of data encoding/decoding (if the data is textual, decoding is called parsing) is large and beyond the scope of this piece. In my experience, however, I’ve only had to implement custom df libraries for implicitly structured text either from a website, program logs, or a program output. We will go through one such example: channel lists from irc, a lively, ancient chat protocol that predates http.

Encoding/Decoding /list

An IRC server returns a list of channels when sent the command /list. This list is returned as pure text and since it’s text, it’s data, we can parse it, and use jq to transform it.

Here’s a sample from freenode, a popular irc server for developers.

##solvinglp(11): Liber Primus conversation only; this will be strictly enforced.
#tiddlywiki(8)
#optware(6): Optware-NG is now being built by nslu2-linux.org! | IRC Logs at: http://logs.nslu2-linux.org/livelogs/optware.txt
#app-business(8): Welcome to #app-business, where we discuss app business tactics and marketing!
#upt-packaging(3)
#rsqueak(4): RSqueak • Fast • https://github.com/HPI-SWA-Lab/RSqueak • http://speed.squeak.org/
#sugar-meeting(8): The meeting channel for the Sugar learning platform | Meeting logs at http://meeting.sugarlabs.org/sugar-meeting/meetings | See also #sugar | THIS CHANNEL IS ALWAYS LOGGED
#aerospike(3): Aerospike's Community Channel
#jsonnet(7): New website design http://jsonnet.org

We see

  1. the channel name,

  2. "("

  3. number of users

  4. ")"

    This may be followed by

    1. ": "

    2. The channel description.

  5. "\n"

Here’s an implementation with the essential decoding lines bolded.

def decode_line(line):
    name = line.split("(")[0]
    users = int(line.split("(")[1].split(")")[0])
    if line.split(")")[1] == "\n":
        title = None
    else:
        title = line.split(")")[1][2:].rstrip()
    return {
        "name": name,
        "users": users,
        "title": title
    }

def decode(lines):
    return [decode_line(line) for line in lines]

def encode_channel(channel):
    if title:
        return "{}({}): {}\n".format(channel["name"], channel["users"], channel["title"])
    else:
        return "{}({}):\n".format(channel["name"], channel["users"])

def encode(channels):
    return "".join([encode_channel(channel) for channel in channels])

The key decoding lines are cryptic, but built on two simple ideas: string splitting by character, offsets, and whitespace stripping.

The name is everything before the first “(“

The number of users is everything between “(“ and “)”

The title, if it exists, is everything after “)”, with a 2 character offset to strip out “: ” from the beginning of every title, minus the newline at the end (which is taken care of by whitespace stripping).

Now that we’ve built this, we can use jq to transform it. If we want the channel names sorted by number of users, we can use jq:

cat channels.txt |\
./irc-list-to-json |\
jq -r 'sort_by(.users) | map(.name)[]'

Conclusion

  • We saw that jq is extremely versatile. With a little bit of help it works not only on xml, but on any data format even if it’s binary.

  • We saw the json-toolkit has a few tools to enable jq to work with common formats like xml, csv, and yaml.

  • We saw how to leverage existing libraries, in any language, into enabling jq to transform a new data format.

  • We’ve seen an example of parsing informally structured data using simple techniques.

If you don’t want to miss other tool tutorials or techniques that will make you a faster coder, just click the subscribe button.