Make JSON your lingua franca

Jun 02, 2020

Code faster by making json your lingua franca.

Making json your lingua franca means making json the way you and your programs represent data, with few exceptions.

Why? It's handhackable, transparent, widely supported, and fast to code against.

Handhackable

Handhackable data formats can be written directly by hand which reduces coding speed drag from dependencies and toolchains.

Let's say we need to whip up a custom piece of data for a test. Maybe it's a weird corner case. Or maybe we’re running experiments to learn how the code works.

In another format, like protobuf, we'd have to first import some libraries. Learn how to textually represent a protobuf object. Write the textual representation of what we really want. Then use the libraries to convert that object into what we really want. And then finally, we can send it to the program.

With json, we can just write the json directly, which is what we really want and send it to the program. No libraries needed, no boiler plate, intermediate representations or documentation to slow us down. Just the essential work.

Transparent

Transparency is the yin to handhackable's yang. If it can be written directly, that's less drag. If it can be read directly, without converters or viewers, that's less drag for the same reasons. Write it directly with your hands. Read it directly with your eyes.

Widely supported

A widely supported data format means our questions have already been asked and answered on stack overflow. Tools we dream of having have already been built.

Using stackoverflow to answer a question is much faster than figuring it out ourselves. We need an example of using a json api? Stackoverflow's got us. It's worth trying to figure it out ourselves if we're trying to grow as programmers. But when coding fast, a properly cited solution works the same and often better than figuring it out ourselves.

Dreaming of a tool? They're out there. jq is the most powerful, one any fast coder has in their toolbox.

Fast to code against

The last important property of json is that it's fast to code against json data. It's important that data is both fast to read/write by hand and to read/write by program since we do a lot of both.

On the command line, jq enables very fast coding against json data:

cat log.json | jq -r '.response.name | select(. != null)'

In Python, the json library enables very fast coding against json data:

import json
o = json.load(open("data.json"))
print(json.dumps([f(e) for e in o]))

But how can we use json everywhere? Let's go through some use cases to see how to use json in various scenarios.

Database

When working data from a database, it should be serialized as json. When data is read out from the DB, it's expressed as json. When data is written to the DB, it's expressed as json.

Let's say we're whipping up a prototype and need a quick DB. Use json text in a file. If we need to make manual edits to the data, we can use our favorite text editor. If we need to change the data schema, we can write a program to refactor the data for us. Let's say we need to add a new field, color to our data. Existing data happens to all be "Red".

cat db.json | jq 'map(.color = "Red")' > new-db.json; mv new-db.json db.json

Or if we need to remove the color field:

cat db.json | jq 'map(del(.color))' > new-db.json; mv new-db.json db.json

Once we've demo'd our prototype and we need to make our code production ready, can we still use json?

Yes.

Both NoSQL DBs like AWSDynamoDB and SQL DBs like PostgreSQL support a json datatype. PostgreSQL even supports json queries and indicies, getting most of the benefits of SQL while still letting the db speak json.

HTTP API payload

When we're writing an API over HTTP, the payload should be serialized as json. But what if the API is only for legacy clients that don't speak json? Then the API can't be json, right?

Bilingual APIs

If we're writing a server which needs to interface with, say, a legacy XML client, it needs to speak XML because that's its job. But it also needs to speak json, because that's our lingua franca. Every tool we have speaks json. Development tools. Testing tools. Debugging tools. All of them speak json and we'll code faster if we can use these tools with our server. This makes the server billingual, it speaks both XML and json. In theory, we could permit the server to only speak XML. If we want to use a json tool with it, we use XML-json translators. But it's easier to have the server speak json natively than to always require translators every time we interact with the API.

Log lines

When we log we should log in json. Some example log lines:

{"level": "INFO", "type": "system", "message": "The system has booted"}

{"level": "INFO", "type": "api", "api": "/users", "request": {"id": 1}, "response": {"name": "John Doe"}}

We should definitely not log with unstructured strings or a custom format. XML is too verbose. But what about logfmt?

logfmt

logfmt seems like a natural choice to log given its name and very light syntax:

level=INFO type=system message="The system has booted"

Compare this to the same logline in json:

{"level": "INFO", "type": "system", "message": "The system has booted"}

The logfmt line looks shorter and cleaner. But, the json is still pretty legible, so json only has a minor disadvantage here. And, logfmt only represents strings, which is a major advantage for json. Consider this json log line with complex data.

{"level": "INFO", "type": "api", "api": "/users", "request": {"id": 1}, "response": {"name": "John Doe"}}

How does logfmt handle this? It doesn't. If we try to force it to with the python library, the python library just prints the python string for the values. For dictionaries, this is kind of like json, so we end up with a hybrid logfmt, json-like data format which is very slow to work with.

level="INFO" time="2020-05-28T13:13:45Z" type="api" api="/users" request="{'id': 1}" response="{'name': 'John Doe'}"

Let's say we need to get the response names for these log lines. Here’s a solution, but it’s slow to code.

#!/usr/bin/env python3

import json
import logfmt
import sys

for l in logfmt.parse(sys.stdin):
  if "response" in l:
    r = l["response"]
    jr = json.loads(r.replace("\"","\\\"").replace("'","\""))
    if "name" in jr:
      print(jr["name"])

See those subtle string replaces inside the json loads? Subtle is easy to get wrong, which means slow to get right. Checking for field presence is also slow. It's easy to assume the data is always nice, until processing the logs spits an error. Then we have to go back to do it right.

Let's compare this to the json logline

{"level": "INFO", "type": "api", "api": "/users", "request": {"id": 1}, "response": {"name": "John \"Steve\" Doe"}}

The data model is simple: json. And because it's json, we can leverage jq to solve this problem very quickly:

cat log.json | jq -r '.response.name | select(. != null)'

Much faster.

Misc data

Any other data that our program deals with can be formatted as json. If we're writing a program that has a debug command that dumps program state, the state should be in json. If we're writing a program that imports data, that data should be in json. If we're writing an academic paper with graphs, we can have some fun. The graphs are a derived artifact from the raw data, so we can store the raw datain a file, use gnuplot to plot the data into an eps which is then imported by the latex file. If this is all put in a make command, then we can "make paper" so compile our paper from the raw data! And, how should we store the data? json.

Testing

Testing...like test case data? That's usually written inline with a program. Sometimes test case data is embedded in objects and other values. Seems messy and tedious?

It is. In a future post, we'll use gold standard testing which turns testing into a data problem and that data will be json. But today, we'll talk about test results.

Tests create results. Which tests ran. Which tests failed. What was expected. What was actual. That's data. json data:

[
    { "test": "check_even_arguments", "status": "PASSED", "time": 0.0001 },
    { "test": "check_odd_arguments", "status": "FAILED", "expected": 0, "actual": 1, "time": 0.0001 }
]

Okay, so we've seen many cases of when we should use json, but there are exceptions. When shouldn't we use json?

Performance

In theory, json can bottleneck performance. If it does, then it can't be used. But make sure it does before optimizing. Optimizations are difficult and slow. Optimized code is rigid and less readable, which is slower to change. From a fast coding perspective, this is very expensive and code should only be optimized when completely necessary. This means it's worth profiling the system to find exactly what part of a system is the performance bottleneck. Optimizing only that part, and proving the optimization provided the necessary performance benefit.

Configuration files

json is almost a suitable format for configuration files. But, json doesn't support comments, and comments are critical to configuration files. Comments are needed to provide sample configuration and explain surprising configuration choices. This is important enough to warrant another data format: yaml. yaml is similar to json, but more complicated and less popular. It's handhackable and transparent, so it's almost as good as json.

Temporary debug print statements

When writing temporary debug print statements, sometimes, like in C, json serialization is too cumbersome just to log some quick values. Cumbersome is a drag on our speed, so because of the ephemeral and simple nature of the data we'll use Colon:Separated:Values instead. Colon:Separated:Values is the classic unix data format that predates json by 30 years. It's very fast to write, but like logfmt, it can only express strings. Still, due to its age and history, there are powerful tools that make working with Colon:Separated:Values fast. Imagine we have logs like:

DEBUG:N:0:M:0
Random thread log!
DEBUG:N:2:M:0
DEBUG:N:5:M:0
Another random thread log!
...

Clearly the “DEBUG” loglines are our temporary debug print statements. We want the 2nd-5th values of N after M goes from 0 to 1. Getting this data is fast using general text processing tools:

cat system.log | grep "^DEBUG" | grep "M:1" | sed -n 2,5p | cut -d : -f 3

It's a bit of a drag to learn more tools that seem to do the same thing as jq. If these were json log lines like

{"level": "TEMP_DEBUG", "values": {"m": 0, "n": 0}}
{"level": "INFO", "message": "Random thread log!"}

we could write it as

cat system.log | jq 'map(select(.level == "TEMP_DEBUG")) | map(select(.values.m == 1))[1:5] | map(.values.n)'

So these tools appear redundant. Why should we slow down to learn tools that do the same exact thing?

Sadly not all data we will come across will be json or even structured. Processing non-structured data is an unfortunately slow part of programming. Creating structure from it is much slower than adhoc processing it as it comes up. This adhoc processing is much faster if we arm ourselves with generic text processing tools. They are a worthwhile investment for any aspiring fast coder.

CodeFaster

Discussion about this post