An Artisanal Way to Do Data.
There’s much debate over what the most critical aspect of a model, is it the perfect architecture, the best data, or pure FLOPs. I’ve got no idea, but I definitely know data scores pretty well.
A little ‘tacit’ piece of knowledge I’ve picked up talking to really great ML people is that obsession with your data really is one of the most under-appreciated yet highly predictive qualities of amazing researchers.
The more I dig into the data I train models on, the more viscerally I understand why this is true. A while back I spent a considerable amount of time constructing my own subset of common crawl. Even starting with a heavily pre-cleaned dataset, The Pile, I was pretty shocked.
Imagine every single shitty realestate listing you’ve seen. Imagine the footer section of your local dentist’s website. Imagine the most incoherent 2015 YouTube comments section you’ve read. I skimmed for quite some time just in shock. Imagine what a child would learn if this is all we exposed them to? Yet that’s all our models see - LLMs are insane.
Back in the days of MNIST or LJSpeech, I can imagine it wasn’t too difficult for someone to gain exposure to their entire dataset. But with the scale of modern ML this is firmly impossible. You learn a lot playing with toy models on MNIST - you notice which examples are hard to learn, where all the loss is going. You learn that some of the labels are pretty iffy, and maybe you’re not even sure you’d agree. But this just doesn’t scale. LJSpeech, for example, is comprised of 24 hours of speech, and its 13,000 segments have trained huge numbers of speech models. Go up to LibriSpeech, 40x larger, at 1,000 hours. You could probably sit through all of LJSpeech - LibriSpeech is becoming pretty rough though. Now look at what Whisper v3 was trained on… a total of 5 million hours of audio. Just a small 5000x increase again.
What’s even going on in those big datasets? The modern approach I’ve mostly seen has thoroughly embraced scale - and thoroughly given up on any method of actually seeing their data.
The industry has really bought into the fantastic utility of giant reproducible pipelines but not found a way to reconcile those pipelines with the old fashion hand crafting of a high quality dataset. Finding tooling to help you move 5TB of text from here to there as fast as possible is far easier today than finding tooling to help you see what text is in there.
It’s my opinion that in reality, most datasets today are hilariously irreproducible, despite all the ‘science’ and tooling that goes into making them so. Really, they’re still cobbled together from the intuitions and heuristics of a few researchers. The researchers probably were stuck dealing with a bunch of parquet files in a Jupyter notebook, so they printed out the first 200 examples, and wrote whatever code they could to get that 200 examples looking clean, and pressed the big green “SCALE” button.
So, what’s the alternative I’m proposing here? Hand crafted datasets. Instead of pretending, I’d say we embrace this.
Maybe our datasets should be stored in relational databases with fancy web UI’s built over them, rather than giant parquet files. Rather than a giant deterministic set of pipelines, maybe our datasets should be something that you hand craft and nurture.
Make it possible, even encourage, making tiny changes to individual samples manually. Then find ways to scale those changes. Put as much effort into building tooling to play with your data. Make a “CRM” for your data. Make it possible to see all the pages, not just the first 100. Make finding strange and wacky things as easy as possible everywhere.
Show metrics everywhere, like the loss various models report on all these samples. Embed everything, make it searchable. Auto-cluster things in 15 different ways. Make writing heuristics, finding similar samples, or finding dissimilar examples part of the app.
This concept’s obviously not novel, but it’s worth thinking about more.
I’ve been treating my own datasets this way. I renamed my data warehouse, “Shed”.