1.5B AI Model crushed GPT-4 and Claude-3.5 in Math

The Secret Sauce revealed

Jan 24, 2025

Imagining: a lightweight AI model with just 1.5 billion parameters outsmarting giants like GPT-4o and Claude-3.5-Sonnet in advanced math competitions. Sounds impossible?

Meet DeepSeek-R1-Distill-Qwen-1.5B, the open-source underdog rewriting the rules of AI reasoning. In this deep dive, we’ll unpack how this “David” slayed the “Goliaths” of AI.

The DeepSeek-R1-Distill-Qwen-1.5B is related to the DeepSeek-R1 model which I will write more about. Please subscribe below to stay tuned! Thank you for your support!

The Secret Sauce: How a Tiny Model Outperformed Giants

Knowledge Distillation — Squeezing Genius into a Smaller Package
The magic lies in knowledge distillation, a process can be thought as “boiling down a 671B-parameter supermodel into a concentrated 1.5B essence”. Here’s how it works:

Teacher-Student Learning: The massive DeepSeek-R1 (671B parameters) acts as a “teacher,” transferring its reasoning patterns to smaller “student” models like the 1.5B version.
Efficiency Meets Power: Despite its size, the distilled model retains self-verification, reflection, and chain-of-thought (CoT) abilities critical for solving complex problems

Benchmark Smackdown: By the Numbers

Let’s break down the jaw-dropping results:

This 1.5B model isn’t just outperforming rivals — it’s doubling GPT-4o’s AIME score and outperforming Claude-3.5 by a wide margin.

The Training Breakthrough: No Hand-Holding, Just Pure RL

A. Reinforcement Learning (RL) Without Training Wheels

Unlike traditional models that rely on supervised fine-tuning (SFT), DeepSeek-R1-Zero — the precursor to the 1.5B model — learned entirely through RL. Think of it as teaching a robot to solve puzzles by trial and error, not by memorizing answers. The key innovations:

Group Relative Policy Optimization (GRPO): A novel RL algorithm that skips the need for a separate critic model, slashing training costs while encouraging creative problem-solving.
Reward System: Combines accuracy rewards (did the answer match ground truth?) and format rewards (was the reasoning structured clearly?) to guide learning.

B. “Aha Moment”: When the AI Learned to Self-Correct

During training, researchers observed an “Aha Moment” — the model spontaneously began re-evaluating flawed solutions and testing alternative approaches, mirroring human problem-solving. This emergent behavior wasn’t programmed; it evolved purely through RL’s reward-driven exploration. Wow, does it sound like evolution?

The Future: Small Models, Big Dreams

Beyond Math: Multimodal Reasoning
While excelling in math/code, DeepSeek’s team hints at expanding into visual reasoning — imagine an AI that solves geometry problems by “seeing” diagrams.

The best part: Run It on Your Laptop

Forget needing a supercomputer: the 1.5B model runs on consumer-grade GPUs easily, opening advanced reasoning to indie developers and researchers 38. Even the larger 32B and 70B distilled versions outperform OpenAI’s o1-mini while costing 1/50th the compute.

The model is available at here.

Thanks for reading!

TechPractice’s Substack

Discussion about this post