Push complexity into your data
“Fold knowledge into data, so program logic can be stupid and robust.” — The Art Of Unix Programming
Good design principles can help us code 1000% faster. 10x engineers are not just 10x because they fly through jq, vi, i3, and tmux on a black and white screen. They’re 10x because they design code that’s fast to implement. Good design principles help us write code that doesn’t need to be debugged (debugging is incredibly time consuming) and code doesn’t need to be refactored as requirements change (refactoring is also incredibly time consuming).
In this post, we’ll cover one of my favorite design principles, "pushing complexity into your data." We’ll start by going over a story of how I applied it just last week and then we’ll dig into the theory and related ideas.
Partial Test Updates
I’m building new product where customers value our correctness. Correctness requires comprehensive testing. Comprehensive testing using standard techniques time consuming to write. We have to write out each case and assert its expected value.
To handle this, I invented a new testing style called Wicked Fast Testing. It works like this:
cat expected-tests.json | ./tester | json-format - > actual-tests.json
# Regular diff, vimdiff, and git diff also work well to see failures
json-diff expected-tests.json actual-tests.json
# assuming tests pass
mv actual-tests.json expected-tests.json
It’s fast because the computer writes the test output and the test input/output format is JSON. However, in my current-at-the-time implementation, the tester would always run all tests that its fed, which means I could only update all tests. I want to update only some tests.
Here’s my thought process, how I messed up, and was saved by the design principle of pushing complexity into your data:
Since I have a tester which already works for any list of tests, maybe I can use jq to work with it to write something like:
jq 'map(if condition then ([.] | ./tester | .[0]) else .)'
Dang, jq doesn’t support shelling out yet. Maybe I could do it in python?
./run-on-match "lambda x: …" ./tester
But… this feels complicated and do I really need a generic run-on-match program? I probably won’t need it.
Okay, since I’m not pursuing a generic solution, I’ll bake the functionality directly into tester. Well the easiest way to do that is have an optional skip field. If it’s present and true, skip the test, if it’s not present, run the test. The code looks like:
def run_test(test):
if test.get("skip"):
return test
# otherwise do actual test
but the actual test file should never have skipped tests, so those should all be blank, so it has to clear the skips from the output. Which now means the code is
def run_test(test):
skip = test.get("skip")
if "skip" in test_data:
del test["skip"]
if skip:
return test
# otherwise do actual test
Not bad, it’s only a few lines, but it still feels more complicated than it should be. If I can’t simplify this in a few minutes, I’ll move on, its pretty small. Should I just let skip be required in the data? Keeps the code simpler…oh right! Push complexity into the data. Alright, now I have a solution. Make skip a required field, push it into the test file and keep the code as simple as possible:
# In tester (a python file)
def run_test(test):
if test["skip"]:
return test
# otherwise do actual test
and to migrate all of the tests (a good example for why wicked fast testing is wicked fast)
# In a bash shell
cat expected-tests.json | map(.skip = false) > a
mv a expected-tests.json
Done.
Although the tiny simplicity gains in this example don’t live up to the headline of 10x faster coding, the small, self-contained nature of this example is great pedagogically to showcase how to apply the idea. When applied at a larger scale, pushing complexity into data can speed up development time 10x. Complex systems turn into simple filters piping data from one end to the other.
Why?
Complexity has to live somewhere. Either it lives in the data, or in the code. And for humans and programs, data is faster to work with than code.
Self-contained
Sometimes understanding one small piece of code requires understanding every piece it interacts with, and every piece that interacts with that. And now you’ve got 9 files open across 2 monitors tracing the code just to understand one line.
Data is self-contained. It has a specification and values. Each value of the data is also self-contained. That’s it. All of the information about the data is there, right in front of you. This means reasoning about data is fast. There’s no need to learn about data somewhere else to understand this piece of data.
There is a half-exception in SQL where a row references other rows often in other tables. But understanding the data that’s directly in the row does not require knowing the data in the other row.
Time Invariant
Code when run, has time varying state. Understanding code in both the average and worst case, requires simulating the code in our head. This is slow and tedious. Stepping through the code. Line by line. Tracking every variable and every update. Humans are smart, but we’re very slow at simulating programs.
Data, however, isn’t “run”. It doesn’t have time varying state. Data doesn’t do anything. It just exists. Data can be transformed, but that’s new data. It’s not this data. Everything about this data is right in front of you. No stepping needed.
Data’s faster to manipulate
When pushing complexity out of code and into data, our code gets smaller and our data files get larger. Luckily, large data files are much faster to manipulate than large code files.
Imagine the tests were embedded in python as pytests rather than as data. Each test looks like:
def test_sample_test_foo():
expected_result = ...
actual_result = code.foo(bar, baz)
assert actual_result = expected_result
How would we add skipping to our program? Well, if our code is formatted right, and we’re sed and regular expression experts:
sed -i test-file.py '/^def test_/a \
skip = False
if skip:
assert True
'
Not so bad. Now, let’s say we want to skip tests where we the foo method and where baz is 3. Well maybe if our code is styled right and our data is nice and we know vi and regular expressions well, we might be able record a vi macro that looks for "actual_result = code.foo([^,], 3.*" and then goes up a few lines and changes skip, assuming you’re an expert in vi and regular expressions. To undo it, assuming our code is styled right, we could also use a global regex to set it to False.
What if our test data is data? Adding skipping is easy with jq:
cat expected-tests.json | jq 'map(.skip = false)' > a
mv a expected-tests.json
# Plus add if test["skip"]; return test to the code
Using jq is significantly here easier than sed, but if you’re an expert in both, it’s about the same. Now let’s implement skipping tests where we run foo with equal to 3, for whatever reason those tests are slow. Imagine the test runner looks like:
# actually run test
import code
def run_test(test):
if test["skip"]:
return test
return getattr(code, test["name"])(*test["params"])
# More boiler plate to handle json and IO
paired along with test data that looks like:
[
{ "name": "foo", "params": ["bar", 0], "result": ...},
{ "name": "foo", "params": ["bar", 3], "result": ...},
...
]
then, we write a quick jq script to pick which tests to skip:
cat expected-tests.json | jq 'map(.skip = (.method == "foo" and .params[1] == 3)' | ./tester > actual-tests.json
This is fast to write, likely correct, leaves no trail, and works for an arbitrarily large number of tests.
Although manipulating data and code by hand are just as time consuming, data is much easier to programmatically manipulate. Manipulating code can be done, but is more difficult, error prone, and to even work depends on consistent, clean, code style, and requires expertise in regular expressions and ancient unix text manipulation tools.
Code as data…is not a magic bullet
To an interpreter, code is data represented as an Abstract Syntax Tree. There are ways to render ASTs as JSON. In theory, we could manipulate the code by manipulating the JSON AST with something like:
code -> ast -> json | jq jq_cmd | json -> ast -> code
But the AST structure for code is quite complicated and the jq_cmd would be more complicated to write. To insert code to skip tests, we’d have to learn the AST structure for if statements and return statements, navigate the AST to top level methods starting with “test_” and insert our custom ASTs, which we’ll probably have to debug since ASTs are non-trivial. These complications take time and is much slower than modifying a simpler data file with map(.skip = true).
Conclusion
We saw a real world use case where we could push complexity into data. And we’ve discussed some reasons why complexity in data is faster to work with than complexity in code.
Next week, since we opened the subject of using data with code, we’ll discuss Data Fluency, a critical skill for any fast coder. If you don’t want to miss out, just hit the “Subscribe now” button below.