Someone said this about FAANG jobs:
It’s so true! But why?
It turns out, FAANG interviews are all about algorithms and software engineering isn’t. Does python use quick sort or bubble sort when you call sorted? Doesn’t matter, just call sorted. What matters? Data structures. And not the kind from CS61B.
Rob Pike
Rob Pike, a unix legend and co-inventor of Go has this to say about data and algorithms.
Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
Personally, I’d change “data structures” to “data schemas”, but otherwise Rob Pike is spot on.
If you’ve never heard this before, sit on it. It’ll change how you see design and make you code much, much faster. We’ll dig into it deeper in this post.
Today’s post is on…
how to think about algos and data when architecting systems.
“Bad algos” are usually good enough
If you worked hard to get a degree in computer science, you’ve probably had it drilled into you that an inefficient algorithm is THE WORST THING EXCEPT RACE CONDITIONS. O(N^2) never terminates and O(NlogN) is so much better that when you replace bubble sort with quicksort at work, you’ll get a standing ovation.
Just kidding, someone already did that. Just use sorted and don’t worry about it.
C++ standard allocator
Once in my travels, I read a story about a C++ compiler that shipped with a simple memory allocator. Allocating memory efficiently requires knowledge of the program, so the compiler used a simple but extremely inefficient algorithm to allocate memory. It said “FIXME” and the authors assumed users of the compiler would tune it to allocate memory efficiently for their particular application.
To the horror of the author, many people never touched it. Why? The inefficient one was good enough. So, when you’re thinking about your design, don’t think too hard about the algos, which one you use rarely matters. Generally it’ll only matter when -
You’ve built a system
it can’t handle load because of an inefficient algo
The data schema doesn’t permit swapping in a more efficient algo (even when algos are a problem, they’re not really the problem)
Why should data dominate?
Neat theory that data dominates, but why? Why does data matter so much? And why don’t schools teach this?
Data depends on nobody
It turns out, programs depend on data schemas, but data schemas don’t depend on the programs. This means data and data schemas form the foundation of the architecture. Everything else stems from there.
Programs are mediums for data
Another way to think about this, and one I personally use a lot, is that programs, and computers are merely mediums for data. Programs, at the end of the data just read data and write data. Data flows through programs and computers. It’s pooled in databases and people’s heads. From people’s heads, data flows through their hands, onto keyboards, into their computer, into the front end, into the back end, and settles in the DB. Or it rises up from the DB, gets plumbed through the backend, into the front end, onto a monitor and into a person’s head.
Thinking at this level, web applications look like the following:
User <-> Front end (browser or native app) <-> Back end < -> DB
Data flows between the user and DB via the frontend and backend. How the data moves is mostly irrelevant. It’d work perfectly fine even if the internet traffic was carried by pigeons.
For something like Facebook messenger, it looks like:
User A <-> Front end <-> Back end <-> Front end <-> User B.
Group chats would look like a hub and spoke.
What technology is used for the front end and backend doesn’t matter.
Python or Java? Doesn’t matter
React.js or vue.js? Doesn’t matter
Protobufs or JSON? Matters a bit, but not a lot
The fields of a message object? Matters a whole lot
It doesn’t really matter how each system is made. Java vs python? Doesn't affect the architecture. The front ends can be a browser, iOS, Android, and even a CLI client. It doesn’t affect the architecture.
What does affect the architecture: data schemas
Persisted data schemas
Once you persist data in a particular schema, it lives forever. All future systems must handle it, whether directly or by migrating old data schemas into more modern ones.
Non-persisted data.
Persisted data at least can be migrated. Non-persisted data is gone forever. Be careful when throwing away data if you might need it in the future.
States a data schema can represent
This is the big one. When writing a data schema, be careful to ensure it can represent all possible states. I saw a small crisis happen because a data schema couldn’t represent a valid state.
What happens when you mess up? Pain.
I once saw an architecture that only kept around the most recent id of processed items. This was based on the assumption that updating was always in order. If 5 was processed, then we know 4, 3, 2, 1, and 0 were also processed, so we only need to store that 5 was processed. It cleverly uses O(1) space instead of O(n) space and makes finding the next id to process O(1) instead of O(n). Clever. Too clever. It could only represent states that matched this assumption. 0 was processed. 0 and 1 were processed. 0, 1, and 2 were processed. It could not represent processing 0 and 2, but not 1.
That assumption was radically violated. A new feature was added and now ids were not necessarily processed in order. Sweating ensued. Frustration and redesigns and thinking. Migrations. Code being rewritten. Collapsing state is extremely dangerous. It adds efficiency, but can blow up when you collapse too much.
I don’t remember how they fixed it, but if I had to, I’d migrate systems back to the naive O(n) solution. That can represent every possible satate. There would be a table recording which ids were processed? Was 4 processed? Check row 4 in the table. Was 3 updated, check row 3. Looking to process a new id? Scan the table. It always works.
If I needed more operational efficiency, I’d add other state types to handle common cases. For example, most of the time ids are processed in order, so one state type would be “last processed id.” However, should this become violated, the system would fall back on a for-sure way to handle all ids.
And while we’re on the topic, O(n) is perfectly fine. If we have to process 1 Billion ids, a bit map takes up…128MB, or $0.0027 (based on a 4TB drive that costs $85). If that data becomes cumbersome in practice, we can add other state types, knowing we have the fallback to handle every case.
Serialization format matters…a little bit.
JSON vs protobuf? Which should you use? Doesn’t matter so much, both can represent roughly the same set of data schemas.
However, protocols are more persistent than systems. If one system speaks JSON, all of its sister systems must too. All of the systems can be swapped out, but the JSON will remain. This is like the ship of theseus which remains a ship even when all of its parts change.
Whether or not it’s the same ship is a debate amongst people who don’t build things. As someone who builds things, I can firmly say the ship is the same. The ship is the protocol between the parts. The physical nature of the parts is irrelevant. An application in the cloud can run on different hardware every day. It’s still the same application.
However, converting between serialization formats is a mechanical task. As such, even though protocols live, working around poor choices is easy.
Data schemas matter
Of course, what matters most is what data gets serialized. What data objects exist and their relationships with each other. It’s hard to talk about, however, without specific examples. I’ll discuss one of my favorite rules when moving money.
When moving money, use an idempotence token
Idempotence is worth a post in its own right, so I’ll give a quick explanation here:
An API is idempotent if “it’s safe to call it again.” For example, you call an API to move $50,000. You get a network timeout. Do you
A) resend $50,000, and risk sending $100,000 total.
B) Not resend and risk paying $0 total.
These are both terrible options. If the API is “idempotent”, option A becomes
A) Try sending $50,000 again, if it hasn’t been sent.
So, when designing financial systems, it’s extremely important that money moving APIs have an idempotence token field in their schema. Designing the architecture with idempotence in mind is important. And it matters far more than whether you use json, xml, python, erlang, or a 56k baud modem to call the API.
Conclusion
FAANG interviews are about algorithms, but engineering is about data schemas.
Think carefully about the data schemas, ensure all states can be represented.
Don’t think too much about the algorithms or the technology itself.
If you learned something today, or just want to show your support, take a second to click the like button.
Every FAANG job
I'm curious which textbooks/references emphasizes data schemas in real-world applications? (e.g. payment gateways, social network sites, games, etc.). Although I don't have FAANG experience, as a web dev I find myself thinking about how to format and process data rather than how to implement algorithms. I'd love to know more about these data schemas and how to do correct data schemas.