In this article, I want to clearly outline the difference between “traditional” machine learning (ML), and Generative AI. This is targeted towards a non-technical audience. If you’ve ever wondered:
I want to give simple, intuitive answers to these questions here.
Reading time: 6 minutes
Machine learning is broadly split into supervised learning, unsupervised learning and reinforcement learning. Traditional ML approaches within supervised and unsupervised learning were promising, but previously faced certain bottlenecks.Supervised learningSupervised learning is a method where a computer learns to perform a task by studying many labeled examples provided by a teacher, allowing it to then accurately handle new, unseen instances of the task on its own. For example, if you are training a model to identify the difference between trucks and cars, you show it thousands of images of trucks and cars, and tell the model for each image whether it is a truck or a car.It learns to recognize the differences between the two, and can now make guesses about images it hasn’t seen before to identify trucks and cars.
This approach has led to significant advancements in object detection (like facial recognition on Snapchat) and image classification (the foundation of Google Image Search).However, there are two major bottlenecks with supervised learning:
Unsupervised learningUnsupervised learning is a method where a computer learns to find hidden patterns or structures in a dataset on its own, without being given any labeled examples or specific guidance on what to look for.It can be used to automatically group similar news articles together based on their content, without needing humans to pre-categorize any articles, allowing readers to easily discover related stories and topics.
We can then go and look at examples of datapoints in each cluster to identify what categories has the model divided the data into. If done correctly, we could identify topics in the clusters above and categorize them into something like below. All without labelling anything! The AI just finds and marks out patterns, and we need to make sense of those patterns to see if there are any interesting insights that come out of that.
Two of the biggest problems with traditional unsupervised learning approaches were:
Language modelsDeveloping a model like ChatGPT involves three main phases: pretraining, supervised finetuning, and RLHF.Pre-training for completionThe first phase is pre-training, where a large language model (LLM) is trained on a massive amount of text data scraped from the internet to learn statistical information about language, i.e. how likely something (e.g. a word, a character) is to appear in a given context.If you're fluent in a language, you have this statistical knowledge too. For example, if you’re fluent in English, you instinctively know that "green" fits better than "car" in the sentence "My favorite color is __."A good language model should also be able to fill in that blank. You can think of a language model as a “completion machine”: given a text (prompt), it can generate a response to complete that text.As simple as it sounds, completion turned out to be incredibly powerful, as many tasks can be framed as completion tasks: translation, summarization, writing code, doing math, etc. For example, give the prompt: How are you in French is ...
, a language model might be able to complete it with: Comment ça va
, effectively translating from one language to another.This completion approach also gave us the ability to train models on HUGE amounts of data from across the internet without needing human labelers.Here’s the approach:
Since we don’t need to create data for each example, and can instead leverage the trillions of words that we have on the internet for our data instead, this was a massive step up in the amounts of data we could train the model on.Supervised FinetuningWe’ve taught the model about completing language through pretraining. However, when you ask the pretrained model a question like "What are the best tourist attractions in Paris" any of the following could be correct completions:
for a family of four?
? Where should I stay?
The Eiffel tower, the Champs-Élysées, and Arc de Triomphe are all popular destinations.
When using a model like ChatGPT, we are probably looking for an answer like option 3. Supervised finetuning allows us to show the model these examples of questions and ideal answers (known as demonstration data), so the model mimics the behavior that we want it to. OpenAI calls this approach “behaviour cloning”.Demonstration data for behaviour cloning is generated by highly educated labelers who pass a screen test. Among those who labeled demonstration data for InstructGPT, ~90% have at least a college degree and more than one-third have a master’s degree.These people are called “AI tutors” in industry, and need to be subject specific: they are the ceiling on what your model can be. If you want to make your model better at code generation, you need great software engineers. Improving summarization capabilities? You could benefit from hiring specialized newspaper editors.Reinforcement Learning through Human Feedback (RLHF)The third phase is RLHF, which consists of two parts.First, a reward model is trained to act as a judge that measures the quality of a response given a prompt. This is done by having human labelers compare different responses and decide which one is better. Once trained, the reward model simulates human preferences, so if it likes a response, it’s likely that a human user would too, therefore that response will be rewarded.Once trained, the reward model is used to further improve the fine tuned model using reinforcement learning techniques, such as Proximal Policy Optimization (PPO).
LimitationsWhile RLHF has been shown to improve the overall performance of language models, it is not without its limitations.One major issue is hallucination, where the model generates convincing, but false or misleading information. Some researchers believe that hallucination occurs because the model lacks an understanding of the cause and effect of its actions, while others attribute it to the mismatch between the model's internal knowledge and the human labelers' knowledge. Efforts are being made to address this issue, such as asking the model to explain the sources of its answers or punishing it more severely for making things up.DPOAnother approach to improving language models is Direct Policy Optimization (DPO), which aims to directly optimize the model's output to align with human preferences. Unlike RLHF, which relies on a separate reward model to guide the optimization process, DPO directly uses human feedback to update the language model's parameters.Imagine you're teaching a student to write an essay. With RLHF, you'd have a separate "grading model" that scores the student's essays, and the student would try to optimize their writing to get a better grade from this model. In contrast, with DPO, you'd directly provide feedback on the student's writing, and they would update their writing style based on your suggestions. According to the , this direct approach can lead to more efficient and effective optimization, as the model learns directly from human preferences rather than relying on a potentially imperfect reward model. However, DPO may require more human feedback and computational resources compared to RLHF, as the model needs to be updated with each round of feedback.Further improvementsTechniques such as few-shot prompting, chain of thought reasoning, and Retrieval Augmented Generation (RAG) all improve the accuracy of these models and mitigate hallucinations. In a future article, I’ll go into further detail about what these techniques include, their strengths and weaknesses, and how you can implement them in real-world applications.To summarize…Language models offer immense flexibility. You can think of these language models as college students who can be instructed to complete a large variety of tasks. In contrast, traditional ML models are like genetically editing and growing a new person in a test tube each time you want a new task completed.With a better understanding of how these LLMs are trained, you can hopefully find ways to improve your LLM for your application.
Please reach out to me at vedantk@stanford.edu, I love talking about this.