Avoid these 6 jq features
In this tutorial, we’ll discuss features of jq that will slow you down and how to move fast instead. jq is a powerful tool for any fast coder and by avoiding jq’s pitfalls, we’ll be even faster. This tutorial assumes a basic familiarity with jq and bash.
first, last, nth(n)
jq has 3 fabulously dangerous methods, first/last/nth(n) that break The Rule of Least Surprise. Don’t use them, instead, use simple filters.
Before/After
First
# Before: Using first with an expression
echo 10 | jq 'first(., .*2, .*3)'
# After: Using simple filters
echo 10 | jq '[., .*2, .*3][0]'
# Before: Using first without an expression
echo [2, 4, 6] | jq 'first'
# After: Using array index
echo [2, 4, 6] | jq '.[0]'
Last
# Before: Using last with an expression
echo 10 | jq 'last(., .*2, .*3)'
# After: Using simple filters
echo 10 | jq '[., .*2, .*3][-1]'
# Before: Using last without an expression
echo [2, 4, 6] | jq 'last'
# After: Using array index
echo [2, 4, 6] | jq '.[-1]'
nth(n)
# Before: Using nth with an expression
echo 10 | jq 'nth(1; ., .*2, .*3)'
# After: Using simple filters
echo 10 | jq '[., .*2, .*3][1]'
# Before: Using nth without an expression
echo [2, 4, 6] | jq 'nth(1)'
# After: Using array index
echo [2, 4, 6] | jq '.[1]'
Explanation
What does this return?
echo '["json", "is", "great"]' | jq 'first'
Answer:
"json"
So far so good.
What does this return?
echo '["json", "is", "great"]' | jq 'first(.)'
Well, . is the identity filter, so it should return "json", right? Wrong, it returns the whole array:
["json", "is", "great"]
Huh!? What’s going on?
Turns out, first, last, and nth operate on different types when an expression is passed in. Normally they operate on arrays. first takes the first element of the array passed in
But with an expression? They operate on the json values that come out of the expression. first takes the first json value that comes out of the expression
This is extremely subtle and surprising. Subtle and surprising behavior is where bugs live. Bugs cause debugging which slows us down. Avoid bugs, and use less surprising features.
Variables
If we find ourselves forced to use variables, we should rewrite our program in python.
Example
Before
# Taken from the jq manual where variables should make our lives easier.
# It's slightly modified to produce 1 array output rather than 2 json values
echo '{"posts": [{"title": "Frist psot", "author": "anon"}, {"title": "A well-written article", "author": "person1"}], "realnames": {"anon": "Anonymous Coward", "person1": "Person McPherson"}}' |
jq '.realnames as $names | .posts | map({title, author: $names[.author]})'
After
#!/usr/bin/env python3
# cleanup-posts
import sys
import json
def L(data):
names = data["realnames"]
posts = data["posts"]
return [{"title": post["title"], "author": names[post["author"]]} for post in posts]
print(json.dumps(L(json.load(sys.stdin))))
And then call it with
echo '{"posts": [{"title": "Frist psot", "author": "anon"}, {"title": "A well-written article", "author": "person1"}], "realnames": {"anon": "Anonymous Coward", "person1": "Person McPherson"}}' |
./cleanup-posts |
# Use jq for pretty printing
jq '.'
Explanation
jq is very fast for coding simple, stateless pipelines. Any particular section can be tested by specifying the json value on stdin and looking at the json value on stdout. Setting a variable adds state and stateful pipelines can’t be tested so quickly. We need a unit testing framework, which already slows us down, and jq doesn’t even have one. If a section sets a variable to a value, the framework must give us access to the value outside the returned json values. If a piece expects a variable to be set, the framework must allow us to set the variable. Implementing and mainting all of this is slow. Instead we should rewrite our program in python which has a wonderfully well supported unit testing framework, pytest.
--arg (with pipes)
If we find ourselves using both --arg and pipes ( | ) in the same jq program, we should split up our jq into two separate jq programs, one which uses --arg and one which uses pipes.
Example
Before
#!/usr/bin/env bash
VERSION=8
echo '{"foo": {"bar": [0, 1, 0, 0]}}' |
jq --arg version $VERSION '
.foo.bar |
map(select(. == 1)) |
{version: $version | tonumber, count: length}
'
After
#!/usr/bin/env bash
VERSION=8
echo '{"foo": {"bar": [0, 1, 0, 0]}}' |
jq --arg version $VERSION '{data: ., version: $version}' |
jq '
{version: .version | tonumber, bar: .data.foo.bar} |
{version, valid_bar: .bar | map(select(. == 1))} |
{version, count: .valid_bar | length}
'
Explanation
Huh!? Why is the refactored code longer?
It is. This rule makes debugging much faster at the expense of verbosity. But if debugging is already fast, the verbosity might not be worth it. However, debugging time is hard to estimate, so we err on the side of assuming all code is time consuming to debug.
Real explanation
--arg makes variables which we saw break our stateless pipelines. Not only that, but bringing in data from outside also breaks our stateless pipeline. However, if all the program does is bring in data from the outside and has no pipes, that’s simple enough that it’s fast to code. The second program has pipes, but since it is a simple stateless pipeline, it’s fast to code.
Look! A wild RL(W)
For long time readers, this is an RLW, without an explicit W. Bringing in data from outside is a R, so the first jq call is a pure R. Piping data is logic, so the second jq call is a pure L. We don’t need an explicit W because by default jq writes the data as json to stdout.
env
env is exactly like --arg, and the same rules apply. env is fine. Pipes are fine. But not in the same program. If they are in the same jq program, split up the program into 2. The first makes calls to env, and the second uses pipes.
Functions
If we find ourselves writing custom jq functions, we should rewrite our jq program in python.
Example
Before
seq 20 | jq -s '
def fib(n):
if n < 0 then
fib(n+2) - fib(n+1)
elif n == 0 then
0
elif n == 1 then
1
elif n >= 2 then
fib(n-1) + fib(n-2)
else
error("Invalid If Tree:n:\(n)")
end;
map(. - 10) | map(fib(.))'
After
#!/usr/bin/env python3
# Call this, fib, with seq 20 | jq -s '.' | ./fib | jq '.'
import json
import sys
def fib(n):
if n < 0:
return fib(n+2) - fib(n+1)
elif n == 0:
return 0
elif n == 1:
return 1
elif n >= 2:
return fib(n-1) + fib(n-2)
else:
raise Exception("Invalid If Tree:n:{}".format(n))
def L(data):
offset_data = [n - 10 for n in data]
return [fib(n) for n in offset_data]
print(json.dumps(L(json.load(sys.stdin))))
And then call it with
seq 20 | jq -s '.' | ./fib | jq '.'
Explanation
Defining a function inline, like defining a variable, adds state to our pipeline. Any section of the pipeline that expects a function to be defined cannot be debugged in isolation. It needs the function defined inline. This is more complexity which means more chances us to make a mistake which means we move more slowly. If jq had a stable package manager, we could import the function and every section could be debugged in isolation. It doesn’t. There is an experimental one jqnpm which hasn’t been touched in almost a year. Experimental software is buggy which means debugging which means we move more slowly. Instead, we should rewrite our program in python. Python is fast to debug even with variables. It also has a very stable, fantastic package manager.
Conclusion
We’ve seen various features which will slow us down and what to do to move fast.
Next week, we’ll start a new series, “mastering sed”, an ancient, powerful unix command. jq is even called the "sed for JSON data". If you don’t want to miss out, just click the Subscribe Now" button below.