I’ll let you in on a little secret. Unix “command line wizards” are just highly textually fluent. Textual fluency is powerful because only because its a widely applicable, special case of data fluency. If we understand and master general data fluency, we’ll be even faster coders. In this post, we’ll go over what data fluency is, why data fluency is important, which data formats to focus on, and then two real life examples of how I’ve applied data fluency.
Data Fluency
Data fluency is proficiency at reading, writing, and transforming data. Both manually (by hand) and programatically (with code).
Manual data fluency is done in your favorite text editor or your favorite hex editor if the data’s binary. Opening a JSON file in sublime and changing values is manual data fluency. Opening a compiled executable in Ghex and changing values is also manual data fleuncy. Generating JSON text using echo is manual data fluency, for example:
# Echo is great for generating test values to test with complicated jq scripts.
echo '[1]' | jq 'map(select(. % 2 == 0))'
Manipulating data with code is done using any programming language, both general purpose languages like python and Domain Specific Languages like sed’s. Using jq to manipulate json is programmatic. Using grep to filter out health-check server logs:
grep -v "/health-check"
is programmatic data fluency.
Why data fluency?
We’re not data scientists, and we might not be data engineers, so why should we learn data fluency?
Programming-adjacent tasks are often data tasks
Although we’re programmers, we often do programming-adjacent tasks, like testing, database seeding, and extracting data from logs. These examples are all data tasks, which means we’re working directly with data. The more fluent we are with data, the faster we can get these done.
We’ve pushed complexity into our data
Last week, we discussed pushing complexity into our data. When we do this, our programs become smaller and simpler, but our data becomes larger and more complicated. This means we end up working with data more. We need to generate lots of it, read it, parse it, and validate it.
Which types of data fluency
There’s many types of data fluency, but two in particular provide large productivity gains.
Textual Fluency
Textual fluency is the underlying power of command line wizards and unix greybeards. grep, cut, cat, sed, et al are all tools that make it easy to work with text. Almost all data is textual (as opposed to raw binary), so textual fluency is incredibly versatile and can speed up many programming and programming-adjacent tasks. When mastered, textual fluency can look something like this:
# This command looks through my custom bash history for good examples of cut, until I realized this itself was a good example
cat .bashlog \
| cut -d : -f 5 \
| grep -v bashlog \
| grep cut \
JSON Fluency
JSON is the structured data lingua franca. If we’re fluent in JSON, we can leverage that fluency to almost every structured data format by using translators. Here’s an example selecting rows of a CSV where the “createdDate” and “deletedDate” are equal using only json fluency:
cat data.csv \
| csv-to-json \
| json-table-to-objs | jq 'map(select(.createdDate == .deletedDate))' | json-objs-to-table \
| json-to-csv
The structure of this program is:
read data - cat data.csv
convert to json - csv-to-json
JSON magic (json-table-to-objs | jq 'map(select(.createdDate == .deletedDate))' | json-objs-to-table)
convert to csv - json-to-csv
This structure, with task-dependent JSON magic, can be used to quickly manipulate other data formats like xml, yaml, tsv, dsv, and logfmt.
If a translator doesn’t exist for your data format, translators to/from JSON are very easy to make in your favorite programming language, as long as your favorite language has libraries for both json and your data format.
Real World Examples
Last week, I had two tasks that happened to be great examples of data fluency, they’re shared below.
Making Users (text and JSON)
I’m working on a product, and it needs to be tested. One task I needed to complete was to seed the testing database with my test users. Before I became fluent in data, I would’ve opened the tests on one monitor, sequel pro on the other, and manually typed in the values for each user. It’d take a while, but it’d work. Now that I’m fluent in data, I’ll do it all in one, short pipeline.
First question, where is the data? It’s in my test file (called test-generator). Since they’re email addresses, I can pull them out using grep.
Let’s pull them out:
cat test-generator | egrep -o '[^"]*@[^"]' | sort | uniq
Output (toy data):
1@d.c
2@d.c
3@d.c
Great, now I have structured, newline delimited data. Time to convert to JSON.
cat test-generator | egrep -o '[^"]*@[^"]' | sort | uniq \
| jq -R . | jq -s .
Output:
[
"1@d.c",
"2@d.c",
"3@d.c"
]
Now that it’s JSON, we can use jq to make user objects. I’ll set password to be password in cleartext just for educational purposes. In practice, don’t store passwords in cleartext:
cat test-generator | egrep -o '[^"]*@[^"]' | sort | uniq \
| jq -R . | jq -s . \
| jq 'map({type: "USER", email: ., password: "password"})'
Output:
[
{
"type": "USER",
"email": "1@d.c",
"password": "password"
},
{
"type": "USER",
"email": "2@d.c",
"password": "password"
},
{
"type": "USER",
"email": "3@d.c",
"password": "password"
}
]
and now to put them in the database I have a utility insert-test-data, based on json-sql, which takes a json list of objects and inserts them into the test database.
cat test-generator | egrep -o '[^"]*@[^"]' | sort | uniq \
| jq -R . | jq -s . \
| jq 'map({type: "USER", email: ., password: "password"})' \
| ./insert-test-data
Done at 162 characters. Although the toy example has only 3 users, this solution works no matter how many test users there are. And, even if there are only a few users, the time is spent programming and practicing data fluency, not mindlessly typing meaningless strings. Time spend programming and practicing data fluency is an investment in ourselves more fluent with data.
Trading signal CSV (JSON)
Sometimes in my work, I get CSVs that I need to compare to the product’s output. This presents two problems:
1 - The product output is in JSON, the CSV is a CSV.
2 - The CSV and product output have different structures. The CSV only has raw data, but the product output has “transactions”.
This is a classic data fluency problem. Doing it by hand is impossible.
I can’t share the actual data, but we can use toy data to explain the idea:
Here’s the toy CSV:
Field 1,Field 2,Field 3
Field 1-Row 1,Field 2-Row 1,Field 3-Row 1
Field 1-Row 2,Field 2-Row 2,Field 3-Row 2
Field 1-Row 3,Field 2-Row 3,Field 3-Row 3
Here’s the toy product output:
[
{
"Date": "2020-01-01",
"Transaction": true
},
{
"Date": "2020-01-02",
"Transaction": false
},
{
"Date": "2020-01-03",
"Transaction": true
}
]
and transaction should be true when field 1 is less than field 2
Before we go on, think for a second and imagine the data is much, much larger, and “Transaction” is computed in a more complex way. How would you set this up? What tools would you use? Anything reusable? How’d you debug it?
Alright, let’s begin.
Step 1: make the csv data JSON and look at it
cat data.csv | csv-to-json | json-format - | sed 10q
Output:
[
[
"Date",
"Field 1",
"Field 2"
],
[
"2020-01-01",
"Field 1-Row 1",
"Field 2-Row 1"
Step 2: CSV data is naturally a 2d array, really what we want is a list of objects, this is common enough I have a tool to deal with it
cat data.csv | csv-to-json | json-table-to-objs | json-format - | sed 10q
Output:
[
{
"Date": "2020-01-01",
"Field 1": "Field 1-Row 1",
"Field 2": "Field 2-Row 1"
},
{
"Date": "2020-01-02",
"Field 1": "Field 1-Row 2",
"Field 2": "Field 2-Row 2"
Now the data’s easy to work with. Next up is extracting the trades from the raw data. Let’s whip up a quick python script:
#!/usr/bin/env python3
import json
import sys
def f(row):
return {
"Date": row["Date"],
"Transaction": row["Field 1"] < row["Field 2"]
}
def t(data):
return [f(row) for row in data]
# Handle I/O
print(json.dumps(t(json.load(sys.stdin))))
extract-trades is an RLW that reads JSON data from STDIN and writes JSON to STDOUT. This makes it easy to use and reuse.
Step 3: extract trades using extract-trades.
cat data.csv | csv-to-json | json-table-to-objs | ./extract-trades | json-format - | sed 10q
Output:
[
{
"Date": "2020-01-01",
"Transaction": true
},
{
"Date": "2020-01-02",
"Transaction": true
},
{
Now to compare the data, I have two tools. diff, which compares textual files, and json-diff (from my json-toolkit) which finds compares json files. For this, I prefer to use json-diff
# Save trade data
cat data.csv | csv-to-json | json-table-to-objs | ./extract-trades | json-format - > trade-data.json
json-diff trade-data.json simulation-output.json | json-format - | sed 10q
Output:
[
{
"leftValue": true,
"path": [
1,
"Transaction"
],
"rightValue": false
}
]
What this means is for trade-data.json (left), the value at .[1].Transaction is true, but for simulation-output.json (right), the value at .[1].Transaction is false.
And now that I know the difference, I can debug.
Looking back, we see data fluency allowed us to very quickly convert the raw data (data.csv) into a format (trade-data.json) that’s comparable to simulation-output.json. We’re able to leverage prewritten tools like csv-to-json and json-diff. We only had to write 7 lines of custom python (the overall python rlw file came from a template) which means very little debugging time. If the custom code had a bug, it’d be very easy to test because we can create test cases as a json file, pass it in using:
cat test-data.json | ./extract-trades | json-format - > test-output.json
and inspect the test-output manually using our favorite text editor.
Conclusion
As we’ve seen, data fluency speeds us up immensely. Data tasks can be quickly completed by reusing prewritten tools and small amounts of custom code. The custom code is fast to write and fast to debug. Data tasks are common, so these productivity gains are significant as part of overall programming productivity. And, as we push more complexity into our data, data tasks become even more common, further increasing our productivity gains.
Next week, we’ll apply data fluency and folding complexity into our data to synthesize “Wicked Fast Testing”, a fast testing methodology which speeds up one of the biggest development time syncs: testing.
If you don’t want to miss out, hit the subscribe now button below:
Tyler, I regularly enjoy your articles, and this one is the best so far! As a Linux admin of 20+ years, I am data fluent, but didn't have the words to discuss or explain it well. That alone would make this an excellent read, but.you've also helped me to better understand the power and usefulness of JSON. While I understood that JSON was useful, I didn't fully grasp the true power.
The meta-method of jsondiff spitting out JSON that can be re-consumed by the other JSON tools is mind-blowing. I'm fluent in bash and often use multiple grep's, sort's and other commands in the same command-line, but I hadn't considered doing that with JSON. You've helped me realize that I really need to get more familiar with jq and other JSON tools.
I now realize that jq is like SQL on steroids, because you can easily pipe data into and out of it.
Thanks,
Jason Edgecombe