Sam Kaplan
January 9, 2023
While GPT3 feels like the beginning of a whole new frontier, in many ways it’s the last of an old wave. For the past decade, improvements in AI and machine learning have been driven by one main force: increasing model size. I first remember this race with ImageNet, the first large dataset for image classification. The biggest and baddest teams would compete each year for the crown. Each year brought new techniques and tricks, but it also brought larger models and more computation. GPT is no different. If you read the last three GPT papers, they all use almost exactly the same architecture.
For a second, let’s think about what a deep neural network really is. I think it’s easy to get confused by thinking of it as inspired by the brain. Its structure isn’t really the beginning, but the natural conclusion given the constraints. Thinking from first principles, what do we need to train a machine learning model? Well a model to predict Y from X, is at its core a function of two things: X and weights. So, to create a good generalizable machine learning architecture, we need something that just by changing the weights, we can predict Y for large classes of functions. So, we need an expressive architecture, but we need one more thing - an architecture where we can derive the weights tractably. The key to neural networks is really one big thing: stochastic gradient descent. If you can write your model as a differentiable function, then you can easily train it.
Neural networks aren’t really anything smart. They’re just flexible differentiable functions which allow us to brute force complicated models. This is exactly why adversarial examples will always exist for a single network. Neural networks aren’t exactly learning the input distribution. They’re just brute forcing a very close approximation.
A deep learning network, AlexNet, won the ImageNet competition for the first time in 2012. So, it makes complete sense that the past 10 years have been dominated by larger and larger improvements to model size. Researchers who have believed in limits to improvements from model size have been left in the dust.
And, GPT3 is a great example of why. Language models are in many ways the utopian dream of AI. Language, at its core, is just the encoding of ideas. So, a sufficiently advanced language model is a human or even something more. GPT3 is impressive and surprising, and most of all fun. It and its successors will have lots of applications and impact on the world. But, we should also be cognizant of its limitations. GPT3, no matter how intelligent it seems, is just capturing patterns in the distribution of text data. It’s performing a geometric transformation. Translating between shapes, high dimensional shapes, but shapes nonetheless.
This is most visible when you try to do math with GPT3. It fails pretty quickly. Why? Because if you’re not actually performing logic and computation, and just transforming between shapes, then the problem space is giant. Multiplication of just two four digit numbers (which it fails for some numbers) involves ten million possibilities. Try giving GPT3 problems with nested loops or recursion. It may work one or two or even three levels deep, but it will very rarely work deeper than that. That’s because all computation is being performed at training time. So, the levels of recursion are bounded by the complexity of encoding additional data in the weights of the neural network. I’m not sure, but my guess for the scaling factors are:
To add a single data point or example: O(log(N)) where N is the size of the whole network.
To support another level of recursion: O(M^K) where M is the number of data points involved in the recursion and K is the number of levels of recursion.
So, a language model based on a single large network can efficiently add more examples (though there is some increasing cost as the network gets bigger). But, it’s very hard to store more levels of thought or computation.
If you think this makes sense, then it’s easy to predict the future of GPT. GPT4 and GPT5 and 6 and beyond will be better and better and better. I’m sure GPT5 will blow our socks off. But, as long as GPT stays a single model, it will have strong bounds on how much computation / levels of thinking it can do, and looser ones on the amount of information it can store.
As I said, language models are the holy grail of AI. So, I think it’s fitting to say that language models will also be the clearest example of each generation of AI technology. GPT’s star is only beginning to shine, but its limits are also in sight. So, it’s time to start thinking about what’s next.
Right now, building machine learning models is like the early days of programming. Coming up with different network architectures is essentially a research problem. Implementing neural nets is complex and time consuming. It’s like we’re still programming in punch cards or fortran. Early programmers spent more time thinking about performance than the program itself. Machine learning is still highly, highly constrained around performance.
However, as machine learning technologies get more mature, the role of ML engineers will go from scientists and magicians to what programmers do today: writing business logic in expressive programming languages at high levels of abstraction. ML engineers will be no different, but instead of just writing code, they will be fundamentally encoding human priors into a larger model.
With a sufficiently large amount of data, ML can learn any problem. But, for more complicated problems, data are too sparse for ML to learn by itself. Imagine a company building an AI assistant which automatically planned your life (aka Decision Tech). There’s not enough data to automatically learn such a complex problem. Instead, you could imagine this company having thousands of engineers, each team responsible for writing one part of the larger model. There could be an exercise team which teaches the model about different types of exercise and when they’re appropriate. There could be an employment team which connects the model to employment marketplaces and teaches it how to match people to jobs.
Each team would be writing in something close to English - just like code. Their goal is simply to give the best priors to the model as possible. The cool part is that this isn’t static. The model doesn’t solely follow the “code” / priors that humans encode. Instead, these priors are just that - initializations that the network uses as the basis to learn from. At the end of the day, all these subcomponents from thousands of engineers are combined automatically and then given a life of their own as the model learns for itself.
Pretty cool, right? So, what’s next? We’re a long way from this final vision. However, progress is being made every day. GPT3 might be the most advanced single neural nets. But, lots of other commercial applications are being built with heavy investment. For instance, Tesla has publicly shared their architecture for their self-driving project. They’re using lots of separate models developed by different teams which combine together to power a very complex real-world application.
Still, it’s early days for machine learning and AI. To get to this vision, we need a few things. First, maturity in neural network architectures so they can be abstracted away from the average engineer. Second, new “programming” languages and tooling to make encoding priors as easy as writing English. Third, new techniques to train and learn very large models across multiple separate neural networks and sub-models. I think this will require moving past the feed-forward differentiable architecture of current networks. The reason the brain is so powerful is that it operates in 3D not just one direction.
So, AI won’t progress simply by increasing scale, or by one genius new architecture. Rather, the field will progress slowly as these fundamental problems are solved. Single neural network models produced by small group of exceptional researchers will transform to complex models across much bigger domains built by large teams collaborating on interoperable systems. But, this transformation won’t happen on its own. Over the coming decades, we’ll need an army of bright, committed researchers and innovators to push progress forward one breakthrough at a time. All in the pursuit of building something that will surprise, scare, delight, and, above all, change us.