Printf Debugging
Printf debugging is essential, so since we’re going to do it anyways, let’s do it fast. In this post we’ll focus specifically on how to write printf debugging log lines to make them fast to read and write programs against. There’s two important features: a machine parsable format like json or DSV and explicitly labeled fields.
JSON
As we saw last week, JSON is the ideal way to structure machine parsable data. Here are some sample json logs from the field:
{"level": "DEBUG", "data": [[0, "type", "text", 0, null, 0], [1, "name", "text", 0, null, 0], [2, "tbl_name", "text", 0, null, 0], [3, "rootpage", "int", 0, null, 0], [4, "sql", "text", 0, null, 0]], "query": "pragma table_info(sqlite_master);"}
{"level": "DEBUG", "data": [], "query": "select * from sqlite_master;"}
Notice that each field is labeled explicitly. If we’re logging a query, we might be tempted to log it as:
print(json.dumps(query))
If we’re whipping up a log line that isn’t pushed to production, we can cut some corners to go faster right? No. Even with temporary data, more time is spent reading and thinking than writing. Explicitly labeled logs are much faster to read than implicit logs. When we’re juggling 100 things in our head about a program, it’s slow to also use our head to reverse engineer logs. If we ever question what a log is logging, we’re moving too slowly. If we think ““select *” corresponds to what again? Oh right, a query,” that’s a sign that implicit logs need more context. So, we should log it as:
print(json.dumps({"level": "DEBUG", "query": query}))
The other benefit of explicitly labeled logs is that they’re structured and therefore very fast to write programs against. If we need to extract the queries from logs, it’s just a jq command away:
cat logs.json | jq 'map(select(.level == "DEBUG")) | map(select(.query))) | map(.query)'
json is also fantastic because the data values themselves are json, which means we can use jq to not only parse the log’s fields, but the field itself. If we want to know what the first row of data is for a given query, jq can handle it no sweat.
cat logs.json | jq -s 'map(select(.level == "DEBUG")) | map(select(.data))) | map({query: query, first_data_row: .data[0]])'
[
{
"query": "pragma table_info(sqlite_master);",
"first_data_row": [
0,
"type",
"text",
0,
null,
0
]
},
{
"query": "select * from sqlite_master;",
"first_data_row": null
}
]
Again, we take the time in our jq pipeline to explicitly label the data even though it’s a derived log format that’s not persisted anywhere and may not be programmed against. This will save us valuable time and brain power when reading logs.
DSV
DSV, Delimiter Separated Values, are a good alternative if writing out json is laborious like in bash. DSV is very easy to write by hand hand and non-escaped DSV can be parsed quickly with grep and cut.
How do we write DSV? Simple, if % is the delimiter character, then ['A', 'B', 'C']
is written in DSV as:
A%B%C
If the delimiter is in a value, it’s escaped with \, so ['A%B' 'C']
is written in DSV as
A\%B%C
In unix, DSV with : as a separator is common. For example from, /etc/passwd
root:x:0:0:root:/root:/bin/bash daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin bin:x:2:2:bin:/bin:/usr/sbin/nologin sys:x:3:3:sys:/dev:/usr/sbin/nologin
However, a lot of data formats like ISO8601 and json dictionaries contain : which means the data will be escaped. Non-escaped DSV is much faster to write programs against, so % is an ideal delimiter because it makes it much more likely that the data does not need to be escaped.
Remember the JSON logs from above? Here’s the equivalent in DSV.
DEBUG%QUERY%pragma table_info(sqlite_master);%DATA%[(0, 'type', 'text', 0, None, 0), (1, 'name', 'text', 0, None, 0), (2, 'tbl_name', 'text', 0, None, 0), (3, 'rootpage', 'int', 0, None, 0), (4, 'sql', 'text', 0, None, 0)]
DEBUG%QUERY%select * from sqlite_master;%DATA%[]
How do we extract the queries without jq? grep and cut.
cat log.log | grep DEBUG%QUERY | cut -d % -f 3
How would we extract the first row of data? Well that’s not so easy because the data row is encoded as a python dict, which isn’t dsv. We might be able to concoct a very complex regular expression to parse the dict, but that’ll be slow. Instead we should serialize the data itself in an easy to parse format, like json. It’s a weird situation, but it happens. This then lets us use grep to select debug query log lines, cut to extract the data fields, and jq to parse the field and return the first element
cat log.log | grep DEBUG%QUERY | cut -d % -f 5 | jq '.[0]
If we also wanted to tack on the query with the first row, that’s not so easy, showing us some of the limitations of dsv compared to json. Still, it’s pretty fast.