Dense Code

Jan 26, 2024

One of my hobbies is taking a model from Huggingface Transformers, and just progressively removing every piece of code until all that’s left is a tiny, hundred or so line definition of the raw model.

Occasionally, there’s some dense bits, and you’ll need to play with the ops a bit to see if there’s any opportunities to compress the code. But usually not. Usually, it’s just thousands of lines of boilerplate, strange abstractions, and a lot of branching.

There’s something crazy about doing this. You’d think by removing all these parts - I’m just removing features. Surely I won’t be able to run inference now. Surely I won’t be able to do distributed training. Surely it’s not even possible to load the weights anymore.

And yet usually that tiny remaining codebase, with a pretty tiny amount of trainer code - runs substantially faster, with perfectly identical outputs (unless you swap in FA, which I usually do), and loads all the weights just fine.

Why is this so often the case? I don’t have a solid answer. Chesterton's Fence says maybe I’m missing something. Job security for MLE’s?

One theory I have is around the habits of programmers. Working under the assumption that you can only hold so much in your brains working memory at any given moment, when you encounter a complex system, there’s two paths:

Slice up the problem, creating abstraction at the borders. Now you can load a small chunk into memory without worrying about the others.
Conceptually compress the problem down, until all that remains is its essence.

Obviously there’s a time for each, but I think sometimes a habit emerges where engineers instinctively reach for the “abstraction” approach for every marginally complex problem they see.

Maybe early on this was the only way they could understand the big thing. Maybe a few of the particularly complex problems encountered really did need all that abstraction, so now they make sure to put it in for good measure. Or maybe it’s for consistency - make sure the codebase all has the same feel to it.

There’s a big advantage to be had here, in just practicing writing dense code. I’m not talking about random thousand line monstrosities or code golf, just a hundred lines or so of pretty thick code. The obvious starting advantage here is just that you can actually begin to just see the entire algorithm all at once - and that unlocks a lot of insight into where the code is malleable and how you can improve it holistically.

But downstream of this I think arises a bigger advantage - you start to grow the ability to sense how big new problems really are, how much complexity is really there. When you go open a new library up and start skimming the code - you can feel how much is actually going on. Sometimes you’ll rip it out, just so the logic is all out in front of you. Or sometimes it’ll be very obvious how deep the rabbit hole goes.

This is a huge agency unlock - it lets you decide what problems are worth filing away behind big abstractions, and what problems you should peak behind the curtain for. When every abstraction ever created is a leaky one, there’s lots of utility in knowing when it’s worth figuring out what’s really going on.

Random points:

I think this works a lot better in ML stuff - there’s generally not hundreds of corner cases or giant set of features that all need to mesh together.
It’s fantastic that stuff like Karpathy’s nanoGPT exists - a perfect demonstration of what happens when you remove all the layers. Hint: nothing bad happens.
TinyGrad is a great public case study in trying to keep things dense even with a very substantial amount of scope. On first contact it feels too dense, but after a few hours with it, it clicks that you could definitely wrap your head around everything here with some work. The same could never be said for Torch.

Zela Labs

Discussion about this post