It has been more than a year since the release of ChatGPT and the ensuing repositioning of the industry. I jotted down some quick notes at the time, and they seem not to have aged all that poorly. Yet, it is time to check the pulse as the year ends.
These days, everyone is a Nostradamus when it comes to AI, and I am going to try to avoid direct predictions. The future is always uncertain, yet some clusters of opportunity are still visible.
To refresh the basics, the main breakthroughs so far have come from:
Compute (training/inference) and the ability to store and handle large amounts of data
Development of Word2Vec and subsequent advances in efficiently learning word embeddings
Transformer architecture that dispensed with the sequential processing of RNN-type architectures, enabling the practical training of large models
The main unknown factor was when exactly the compute and the data would lead to breakthroughs. Just like with CNNs for vision, there turned out to be a threshold past which things just started to work.
Again, it's important to remember that all an LLM does is predict the next token. So, how is it exactly useful?
Well - isn't that what people do? That's why hallucination is not a bug, but the key feature. One could conflate the LLM temperature setting (a hyperparameter used to tweak the probability distribution for selecting the next word) with the number of drinks at a party.
It has also been fascinating to observe the evolution of open-source LLMs and various claims of models of GPT 3.x quality (interestingly, no one so far has reached GPT-4 quality even according to benchmarks - and one should be very skeptical of benchmarks at this stage). Why is that?
Well - let's look deeper at what influences LLM quality:
The dataset to train the base model: if your base dataset contains clean, large, high-quality data - such as well-written books, legal libraries, scientific articles, as opposed to data scraped from the web - your next-word predictions will be more aligned with what Seneca would have said versus a modern-day guru on Twitter/X.
The amount and quality of effort put into instructing the model - that is, old-fashioned continuous human-assisted feedback.
The ability to train a large enough model, appropriately matched (in terms of the number of parameters) to the size of the dataset.
The bottleneck is actually not (3), but (1) and (2). Training the model is mostly just money and fixed time (not to minimize the effort going into optimizing and managing training pipelines, but it is still reasonably straightforward), while (1) and (2) require a lot of custom infrastructure (both compute and human), sufficient user feedback (how many users are using the platform), a lot of (indeterminable) time and is quite error prone. The outcome is highly sensitive to quality data and feedback. Finally, it inevitably raises a number of legal and data ownership issues. Most importantly, these advantages (or lack thereof) will compound over time, making it largely a winner-take-all game.
So - how probable is it that open-source LLMs will catch up? Not very. That’s not to say that there’s no room for smaller task-oriented models (i.e., classification tasks) - but it is hard to see, short of a seismic shift in how data is shared and infrastructure is made available, how open-source/small players will be practically viable in a general-purpose sense.
I should also mention the dangers of overdoing the fine-tuning step. You can kind of see that with the constant flux of quality with well-known public models as they kept being fine-tuned for safety, etc. Fine-tune it long enough - and you will get mostly canned answers (an phenomenon known as catastrophic forgetting. Fine-tuning in general, without the access to the original data and training artifacts is tricky (for obvious basic reasons) - that’s why releasing “open-source” model weights without releasing the original dataset is hardly “open”.
Now - what about AGI that people are so concerned about? Here, you must agree with Jedi Master Yann LeCun that auto-regressive generation (LLMs) is “an exponentially-divergent diffusion process, hence not controllable”. Hence, ultimately, a new architecture, combined with a practical way to efficiently learn and optimize to a world model, capable of hierarchical planning, is needed. So far, this is nowhere in sight.
In the meantime, we have to do the latter part by hand, via software (even though written with assistance from LLMs), where LLMs become like building blocks in the operating system.
What does this mean then for the ecosystem and for us as humans? Here are some possible outcomes:
Jobs
While everyone is talking about many jobs that will become unnecessary for humans to perform, a more important question is: what kind of jobs will be created instead?
It is already clear that “clean high-quality” data is the new oil. Human expertise that can be put to use to improve these data (in a broad sense, including expert fine-tuning) is necessarily going to be valuable. Hence, it is not a stretch to imagine people will be paid to instruct the models.
For example, a legal brief can be presented to several top lawyers for a review, or a code PR can be presented to top coders for review and be incorporated into the model. Tools and services that enable that will therefore be new machine tools for humans. Of course - AI will determine how much those humans will be paid, possibly in AI usage credits 🙂
The inept regulation could become a monkey wrench for the ecosystem and the progress - yet, if done reasonably well, could take society to the next level.
Model Wars
It is hard to see open-source or second-tier LLMs becoming practical due to data and effort limitations, and it is not likely that hardware advancements will be overly impactful (considering the constant need to improve quality/performance/costs, and there’s still a long way to go there).
Furthermore - if, shall we say, a major player has a large, good-quality LLM with all the datasets and infrastructure - it is easy to downsize it and produce special-purpose small models almost at will.
The overlooked point here is that when one has the infrastructure to collect and manage data and feedback - that affords an interesting advantage at scale. Companies like
‘s Scale.ai could be extremely well-positioned here in the long run.The Software
Until radical new architectures come into play, a combination of LLM pipelines will need to be used to construct applications. Dynamic performance will be increasingly important as LLMs continue to be exceedingly resource-hungry, and multiple interactions are required to produce a good-quality outcome. What are the components of such frameworks:
- Inference pipeline execution/orchestration - both in streaming and batch fashion, focused on controlling response times, failure management, costs, and routing between different task-specific models.
- The routing logic needs to be a lot more sophisticated, controllable, and self-programmable.
- The usual conveniences for prompt management, context, history, tools, and RAG.
- Multi-modal/image/video side of the world will more commonly need to be incorporated.
- User feedback response and dataset management for fine-tuning.
- System performance evaluation/testing in a continuous fashion, especially when more advanced fine-tuning tricks are used (such LwF and EWC).
The bigger elephant in the room is Python. While super easy to use and convenient, it is arguably a JavaScript/Ruby/Perl-class language in terms of its (non)-safety, extreme error-proneness (that exponentially compounds with larger teams), and (lack of)-performance. So - either a JIT-enabled and a syntactic layer (akin to TypeScript) is going to need to be invented, or we are going to need a lot more multi-language frameworks to go mainstream.
One interesting evolution to observe will be how much of self-writing code capability will be incorporated natively into the frameworks (and their evolution) themselves to produce a truly LLM-first system.
Let’s see how this ages.
Happy New Year,
Ruslan