Avoid these 6 jq features

In this tutorial, we’ll discuss features of jq that will slow you down and how to move fast instead. jq is a powerful tool for any fast coder and by avoiding jq’s pitfalls, we’ll be even faster. This tutorial assumes a basic familiarity with jq and bash.

first, last, nth(n)

jq has 3 fabulously dangerous methods, first/last/nth(n) that break The Rule of Least Surprise. Don’t use them, instead, use simple filters.

Before/After

First

# Before: Using first with an expression
echo 10 | jq 'first(., .*2, .*3)'
# After: Using simple filters
echo 10 | jq '[., .*2, .*3][0]'

# Before: Using first without an expression
echo [2, 4, 6] | jq 'first'
# After: Using array index
echo [2, 4, 6] | jq '.[0]'

Last

# Before: Using last with an expression 
echo 10 | jq 'last(., .*2, .*3)' 
# After: Using simple filters 
echo 10 | jq '[., .*2, .*3][-1]'

# Before: Using last without an expression
echo [2, 4, 6] | jq 'last' 
# After: Using array index 
echo [2, 4, 6] | jq '.[-1]'

nth(n)

# Before: Using nth with an expression
echo 10 | jq 'nth(1; ., .*2, .*3)' 
# After: Using simple filters
echo 10 | jq '[., .*2, .*3][1]'  

# Before: Using nth without an expression
echo [2, 4, 6] | jq 'nth(1)'
# After: Using array index 
echo [2, 4, 6] | jq '.[1]' 

Explanation

What does this return?

echo '["json", "is", "great"]' | jq 'first'

Answer:

"json"

So far so good.

What does this return?

echo '["json", "is", "great"]' | jq 'first(.)'

Well, . is the identity filter, so it should return "json", right? Wrong, it returns the whole array:

["json", "is", "great"]

Huh!? What’s going on?

Turns out, first, last, and nth operate on different types when an expression is passed in. Normally they operate on arrays. first takes the first element of the array passed in

But with an expression? They operate on the json values that come out of the expression. first takes the first json value that comes out of the expression

This is extremely subtle and surprising. Subtle and surprising behavior is where bugs live. Bugs cause debugging which slows us down. Avoid bugs, and use less surprising features.

Variables

If we find ourselves forced to use variables, we should rewrite our program in python.

Example

Before

# Taken from the jq manual where variables should make our lives easier.
# It's slightly modified to produce 1 array output rather than 2 json values 

echo '{"posts": [{"title": "Frist psot", "author": "anon"}, {"title": "A well-written article", "author": "person1"}], "realnames": {"anon": "Anonymous Coward", "person1": "Person McPherson"}}' |
jq '.realnames as $names | .posts | map({title, author: $names[.author]})'

After

#!/usr/bin/env python3
# cleanup-posts
import sys
import json
def L(data):
  names = data["realnames"]
  posts = data["posts"]
  return [{"title": post["title"], "author": names[post["author"]]} for post in posts]

print(json.dumps(L(json.load(sys.stdin))))

And then call it with

echo '{"posts": [{"title": "Frist psot", "author": "anon"}, {"title": "A well-written article", "author": "person1"}], "realnames": {"anon": "Anonymous Coward", "person1": "Person McPherson"}}' |
./cleanup-posts |
# Use jq for pretty printing
jq '.'

Explanation

jq is very fast for coding simple, stateless pipelines. Any particular section can be tested by specifying the json value on stdin and looking at the json value on stdout. Setting a variable adds state and stateful pipelines can’t be tested so quickly. We need a unit testing framework, which already slows us down, and jq doesn’t even have one. If a section sets a variable to a value, the framework must give us access to the value outside the returned json values. If a piece expects a variable to be set, the framework must allow us to set the variable. Implementing and mainting all of this is slow. Instead we should rewrite our program in python which has a wonderfully well supported unit testing framework, pytest.

--arg (with pipes)

If we find ourselves using both --arg and pipes ( | ) in the same jq program, we should split up our jq into two separate jq programs, one which uses --arg and one which uses pipes.

Example

Before

#!/usr/bin/env bash

VERSION=8

echo '{"foo": {"bar": [0, 1, 0, 0]}}' |
jq --arg version $VERSION '
  .foo.bar |
  map(select(. == 1)) |
  {version: $version | tonumber, count: length}
'

After

#!/usr/bin/env bash

VERSION=8

echo '{"foo": {"bar": [0, 1, 0, 0]}}' |
jq --arg version $VERSION '{data: ., version: $version}' |
jq '
{version: .version | tonumber, bar: .data.foo.bar} |
{version, valid_bar: .bar | map(select(. == 1))} |
{version, count: .valid_bar | length}
'

Explanation

Huh!? Why is the refactored code longer?

It is. This rule makes debugging much faster at the expense of verbosity. But if debugging is already fast, the verbosity might not be worth it. However, debugging time is hard to estimate, so we err on the side of assuming all code is time consuming to debug.

Real explanation

--arg makes variables which we saw break our stateless pipelines. Not only that, but bringing in data from outside also breaks our stateless pipeline. However, if all the program does is bring in data from the outside and has no pipes, that’s simple enough that it’s fast to code. The second program has pipes, but since it is a simple stateless pipeline, it’s fast to code.

Look! A wild RL(W)

For long time readers, this is an RLW, without an explicit W. Bringing in data from outside is a R, so the first jq call is a pure R. Piping data is logic, so the second jq call is a pure L. We don’t need an explicit W because by default jq writes the data as json to stdout.

env

env is exactly like --arg, and the same rules apply. env is fine. Pipes are fine. But not in the same program. If they are in the same jq program, split up the program into 2. The first makes calls to env, and the second uses pipes.

Functions

If we find ourselves writing custom jq functions, we should rewrite our jq program in python.

Example

Before

seq 20 | jq -s '
def fib(n):
  if n < 0 then
    fib(n+2) - fib(n+1)
  elif n == 0 then
    0
  elif n == 1 then
    1
  elif n >= 2 then
    fib(n-1) + fib(n-2)
  else
    error("Invalid If Tree:n:\(n)")
  end; 
map(. - 10) | map(fib(.))'

After

#!/usr/bin/env python3
# Call this, fib, with seq 20 | jq -s '.' | ./fib | jq '.'

import json
import sys

def fib(n):
  if n < 0:
    return fib(n+2) - fib(n+1)
  elif n == 0:
    return 0
  elif n == 1:
    return 1
  elif n >= 2:
    return fib(n-1) + fib(n-2)
  else:
    raise Exception("Invalid If Tree:n:{}".format(n))
  
def L(data):
  offset_data = [n - 10 for n in data]
  return [fib(n) for n in offset_data]

print(json.dumps(L(json.load(sys.stdin))))

And then call it with

seq 20 | jq -s '.' | ./fib | jq '.'

Explanation

Defining a function inline, like defining a variable, adds state to our pipeline. Any section of the pipeline that expects a function to be defined cannot be debugged in isolation. It needs the function defined inline. This is more complexity which means more chances us to make a mistake which means we move more slowly. If jq had a stable package manager, we could import the function and every section could be debugged in isolation. It doesn’t. There is an experimental one jqnpm which hasn’t been touched in almost a year. Experimental software is buggy which means debugging which means we move more slowly. Instead, we should rewrite our program in python. Python is fast to debug even with variables. It also has a very stable, fantastic package manager.

Conclusion

We’ve seen various features which will slow us down and what to do to move fast.

Next week, we’ll start a new series, “mastering sed”, an ancient, powerful unix command. jq is even called the "sed for JSON data". If you don’t want to miss out, just click the Subscribe Now" button below.