Is Bigger Always Better for Large Language Models?
While increasing the size of LLMs has led to impressive gains in performance, such as complex reasoning and accuracy, it now appears that this approach is hitting its limits.
A recent article shows that while the pretraining loss in LLMs follows a predictable power law as a function of compute, model parameters, and data, their downstream capabilities, such as the ability to perform complex reasoning, are not predictable. For example, on complex tasks, especially math reasoning, simply scaling up model parameters may not lead to significant improvements. In fact, models may be unable to solve the most challenging problems, even with substantial increases in model size.
This sounds a little concerning, given that large corporations like X are spending billions of dollars hoarding GPUs from NVIDIA to build massive data centers, with the assumption that scaling compute will lead to higher performance.
So what can we do about it?
A workaround that has recently emerged is to scale the test-time compute, meaning allowing the model to “think” longer and self-correct before giving an answer. This has led to some gains in performance (o-1 model) compared to previous models, but it is a temporary patch to a larger fundamental problem.
LLMs, by their very nature, are designed to predict the next token given a previous set of tokens using a self-attention mechanism. For example, given the sentence “I will take my dog for a…”, the LLM will predict that the most likely next word is “walk,” based on its extensive training examples.
Simple enough. However, impressively, from this very basic rule, LLMs somehow learned real-world concepts encoded in the structure of natural language. They learned human thinking patterns just from the language, which is incredible. We still don’t fully understand how this emergent capability developed to this day.
Realistically, though, it is unlikely that LLMs can reason independently. They simply match patterns found in the structure of human language. And this is where the problem lies (see this paper from Apple). With the current neural network architecture, which is based solely on pattern recognition and a limited amount of internet data (think of it as a database for mapping patterns), it is unlikely that we will reach the AGI dream.
A new architecture based on understanding concepts related to logic, math, and physics needs to be developed. For example, Wolfram Alpha is pairing symbolic computation with LLMs to make them more deterministic. This is a good step forward but we most likely need to embed symbolic computation within the model architecture itself.
Meanwhile, as we work toward a breakthrough in symbolic AI, we should focus on using the current LLMs to extract maximum value and increase efficiency and productivity. Inefficiencies are everywhere, from customer support to airlines to drug discovery and beyond. The opportunities right now are tremendous for builders to take advantage of the current LLM capabilities and help solve some of the most pressing issues.