Skip to content

ui-insight/distill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distill

Create a mathematical reasoning model through distillation.

unnamed poop

 

Model distillation is a process where a large teacher model trains a smaller student model to replicate its behavior, allowing the student to achieve comparable performance with far fewer parameters. By transferring the teacher’s knowledge, the student retains much of the accuracy while being faster, cheaper, and easier to deploy.

Distilled models can be trained to specialize in particular tasks or domains such as science, coding, mathematics, and information extraction by inheriting the teacher’s expertise in those areas while remaining compact and efficient for on-prem deployment.

 

knowledge-distillation copy

 


Types of Model Distillation

Model distillation can be carried out in several different ways, each offering a balance between simplicity, effectiveness, and computational cost.

The most common approaches are response distillation and feature distillation, which differ in how much of the teacher’s knowledge is exposed to the student.

1. Response Distillation

Response distillation is the simplest form of model distillation, where the student learns to mimic the outputs of the teacher model. This can happen at the token level, where the student matches the teacher’s predicted responses directly, or at the logit level, where the student learns from the teacher’s full probability distribution over tokens. Both approaches transfer knowledge from the teacher’s outputs, but the logit level provides richer guidance than hard token targets.

3-kd

a. Token Response Distillation

Concept:
The student model learns directly from the final token outputs (responses) generated by the teacher.

In Token Response Distillation, also known as token-level or sequence-level distillation, the student model is trained to directly mimic the final token output of the teacher model. For a given input, the teacher model generates a response (e.g., a sequence of text). This output is then used as the target for training the student model. The student's goal is to produce an output that is as close as possible to the teacher's output. This method is straightforward to implement as it only requires the final predictions of the teacher model.

Advantages:

  • Simple to implement.
  • Requires no access to internal teacher information (e.g., logits or hidden states).
  • Works well when you only need the student to mimic end results.

Disadvantages:

  • Provides limited guidance since only the final outputs are used.
  • Less effective at capturing the teacher’s nuanced internl decision boundaries.

Complexity:

  • Easiest and fastest method.
  • Minimal additional infrastructure needed beyond standard supervised learning.

b. Logit Response Distillation

Concept:
The student is trained on the teacher’s probability distribution (soft logits) over possible outputs rather than just the hard labels.

Logit-level distillation involves training the student model to match the logits of the teacher model. Logits are the raw, unnormalized prediction scores that a model produces before the very final activation function (like softmax) is applied. These logits contain richer information than the final hard predictions, as they reveal the teacher model's confidence for each possible output. By training on these soft targets, the student model can learn a more nuanced understanding of the data and often generalize better than if it were trained solely on the final, hard target labels.

Temperature is often used to soften the probability distribution of the teacher's logits, further enhancing the information transferred to the student.

Advantages:

  • Encodes richer information about uncertainty and relationships between classes.
  • Helps the student generalize better than token response distillation.
  • Higher accuracy with fewer parameters.

Disadvantages:

  • Requires access to teacher logits during training, which can be memory and compute intensive.
  • More sensitive to hyperparameters such as #temperature#.

Complexity:

  • Moderate complexity.
  • Needs additional training pipelines to capture and use logits.

3. Feature Distillation

Concept:
The student is trained to match internal hidden states or representations of the teacher at various layers.

2-kd-2

Feature Distillation, goes a step further by training the student model to mimic the internal representations of the teacher model at one or more of its hidden layers. The outputs of these intermediate layers, often referred to as embeddings or feature maps, capture abstract representations of the input data. By aligning the student's hidden layer activations with the teacher's, the student can learn the teacher's feature extraction process. This method allows for a more fine-grained transfer of knowledge, as the student learns not just what the teacher predicts, but also how it arrives at that prediction by understanding the intermediate data transformations.

Advantages:

  • Transfers internal structural and representational knowledge, not just outputs.
  • Can yield the closest performance to the teacher.
  • Useful when student and teacher architectures are similar.

Disadvantages:

  • Requires significant access to teacher internals.
  • More computationally expensive due to alignment across multiple layers.
  • Risk of over-constraining the student if architectures differ.

Complexity:

  • Highest complexity.
  • Requires careful mapping of teacher layers to student layers.

Let's Have Qwen3 Teach Gemma3 to do Math!!

Untitled 2

 

Here, we demonstrate simple Token-level Response Distillation.

In distill.py we use load a 4B parameter Qwen/Qwen3-4B-Thinking-2507 as the teaching model, as it will output properly-formatted Chain-of-Thought thinking traces before answering.

We use Google's 1B parameter Gemma3-1b instruct model as the smaller student. It is a tiny 1B model and doesn't emit CoT by default and struggles with tasks like math.

For speed and simplicity, we use the Unsloth.ai python library to perform simple causal supervised fine-tuning (SFT) using the SFTTrainer trainer to teach the student Gemma3:1b model how to reason about math.

_1d5a5685-2d88-44ca-b50f-ba432cd646ef_9CGCY8lvw4D9JkOdueqsk jpeg copy

Method

  • We use token-level response distillation
  • Load the teacher model (Qwen3)
  • Load the student model (Gemma3)
  • Load the training data (GSM8K)
  • Use Quantized Low-Rank Adaptation (QLoRA) applied to the student (Gemma3)
  • The teacher model reads the questions from GSM8K and produces full response
  • This creates an interstitial training set:
{
  "question": "Mimi picked up 2 dozen seashells...",
  "qwen3_answer": "<think>First Mimi has 24...</think> Therefore, answer is 16."
}
  • The student is then presented with properly formatted training data from the teacher:
messages = [
    {"role": "user", "content": "Question: Mimi picked up 2 dozen seashells..."},
    {"role": "assistant", "content": "<think>First Mimi has 24...</think> Therefore, answer is 16."},
]
  • We trained for 4 epochs and save QLoRA checkpoints at the end of each epoch
  • Given the current batch size and models, this required at least 48GB of GPU RAM. I used 2 Nvidia RTX 4090 GPUs.

EXAMPLE #1

Gemma3:1b output before distillation

(i.e., BEFORE learning how to think from Qwen3)

QUESTION:
Sarah has 3 boxes of crayons. Each box has 24 crayons. She gives 
10 crayons to her friend. How many crayons does Sarah have left?

MODEL'S ANSWER:
----------------------------------------------------------------------
Let C be the number of crayons in each box. Let N be the number of 
boxes Sarah has. We are given that Sarah has 3 boxes of crayons, 
so N = 3. Each box has 24 crayons, so C = 24. The total number of 
crayons is N × C = 3 × 24 = 72 crayons. Sarah gives 10 crayons to her 
friend. So, the number of crayons Sarah has left is 72 - 10 = 62. 

Final Answer: The final answer is 62.

Gemma3:1b output after QLoRA response distillation:

(i.e., AFTER learning how to think from Qwen3)

Note the apparent reflection, self-doubt, double-checking

QUESTION:
Sarah has 3 boxes of crayons. Each box has 24 crayons. She gives 
10 crayons to her friend. How many crayons does Sarah have left?

MODEL'S ANSWER:
----------------------------------------------------------------------
<think>
First, I need to find out how many crayons Sarah has left after giving 
10 crayons to her friend. She starts with 3 boxes of crayons. Each box 
has 24 crayons. So, the total number of crayons she has initially is 
3 boxes * 24 crayons/box = 72 crayons.

Then, she gives 10 crayons to her friend. So, I need to subtract 
10 from the total.

72 - 10 = 62.

Wait, let me double-check. If she has 3 boxes * 24 = 72 crayons. 
Then she gives 10, so 72 - 10 = 62. That seems right.

I don't think I made a mistake here. The problem says "she gives 
10 crayons to her friend," so it's straightforward subtraction.
</think>

Sarah starts with 3 boxes of crayons, each containing 24 crayons.
Total crayons = 3 * 24 = 72 crayons.

She gives 10 crayons to her friend.
Crayons left = 72 - 10 = 62.

The final answer is 62.

62

EXAMPLE #2

Gemma3:1b output before distillation

(i.e., BEFORE learning how to think from Qwen3)

QUESTION:
At a local café, a small smoothie costs \$4. A group of friends buys 7 small smoothies. How much do they spend in total, and what is the average cost per smoothie?

MODEL'S ANSWER:
----------------------------------------------------------------------
Here's the breakdown:

  Total cost:  7 smoothies * $4/smoothie = $28
  Average cost per smoothie: $28 / 7 = $4

Answer: They spend $28 in total, and the average cost per smoothie is $4.

Gemma3:1b output after distillation

(i.e., AFTER learning how to think from Qwen3)

Note the apparent reflection, self-doubt, double-checking

QUESTION:
At a local café, a small smoothie costs $4. A group of friends buys 7 small smoothies. How much do they spend in total, and what is the average cost per smoothie?

MODEL'S ANSWER:
----------------------------------------------------------------------
<think>
First, I need to find out how much the group spends in total. The problem says a small smoothie costs $4. Since they buy 7 small smoothies, I should multiply 7 by 4.

7 times 4 is 28. So the total cost is $28.

Next, I need to find the average cost per smoothie. To do this, I divide the total cost by the number of smoothies.

28 divided by 7 is 4.

So the average cost per smoothie is $4.

Let me double-check to make sure I didn't make a mistake. If they buy 7 smoothies at $4 each, that's 7*4=28. 28 divided by 7 is 4. That seems right.

I don't think there's any trick here. The problem is straightforward: calculate the total cost and then find the average.
</think>

To determine the total cost and the average cost per smoothie, I will follow these steps:

1. **Calculate the total cost**: A small smoothie costs $4.  
   Since there are 7 small smoothies, the total cost is:  
   $7 times 4 = $28.

2. **Calculate the average cost**: Divide the total cost by the number of smoothies:  
   $28 div 7 = $4.

The average cost per smoothie is $4.

I have verified this by multiplying 7 by 4 and dividing by 7.  
$7 times 4 = $28, $28 div 7 = $4.  
The result confirms the average cost per smoothie is $4.

Final answer: $4

EXAMPLE #3: IMPOSSIBLE PROBLEM

(INCONSISTENT CONSTRAINTS)

Gemma3:1b output before distillation

(i.e., BEFORE learning how to think from Qwen3)

In the town of Maple Junction, three bus routes connect the school, the library, and the train station.

* The number of passengers on the school bus was twice the number on the library bus.
* The number of passengers on the train station bus was 15 more than the number on the library bus.
* The total number of passengers on all three buses was exactly 100.
* Together, the school bus and the train station bus carried exactly 70 passengers.
* More passengers rode the library bus than the school bus.

Question: How many passengers rode each bus?

MODEL'S ANSWER:
----------------------------------------------------------------------

Let's use variables to represent the number of passengers on each bus:

   s = Number of passengers on the school bus
   l = Number of passengers on the library bus
   t = Number of passengers on the train station bus

We can translate the given information into equations:

   s = 2l  (The school bus had twice the number of passengers as the library bus)
   t = l + 15 (The train station bus had 15 more passengers than the library bus)
   s + t = 100 (The total number of passengers was 100)
   s + l = 70 (More passengers rode the library bus than the school bus)

Now we can solve the system of equations:

1. Substitute:  Substitute the first two equations into the third equation:
   (2l) + (l + 15) = 100

2. Simplify and solve for l:
   3l + 15 = 100
   3l = 85
   l = 85/3  = 28.33 (approximately)

Since the number of passengers must be an integer, let's round l to 28.
If l = 28, then s = 2 * 28 = 56
Then, s + t = 56 + 28 = 84, which is not 100.

Let's try to adjust the numbers slightly to make the sum 100.
Let's assume the school bus has 56 passengers.
Then, the library bus has 28 passengers.
The train station bus has 15 more passengers than the library bus, so it has 28 + 15 = 43 passengers.
The total number of passengers is 56 + 28 + 43 = 127. This is not 100.

Let's try to adjust the numbers slightly to make the sum 100.
Let's assume the school bus has 56 passengers.
Then, the library bus has 28 passengers.
The train station bus has 15 more passengers than the library bus, so it has 28 + 15 = 43 passengers.
The total number of passengers is 56 + 28 + 43 = 127. This is not 100.

Let's try to adjust the numbers slightly to make the sum 100.
Let's assume the school bus has x passengers.
Then, the library bus has x - 15 passengers.
The train station bus has x + 15 passengers.
We have the equation: x + (x - 15) + (x + 15) = 100
3x = 100
x = 33.333...

Let's try x = 33.
School bus = 33
Library bus = 33 - 15 = 18
Train station bus = 33 + 15 = 48
Total = 33 + 18 + 48 = 99

Let's try x = 34.
School bus = 34
Library bus = 34 - 15 = 19
Train station bus = 34 + 15 = 49
Total = 34 + 19 + 49 = 102

So, we need to adjust the numbers slightly.

Let's assume the school bus has 56 passengers.
The library bus has 28 passengers.
The train station bus has 15 more passengers than the library bus, so it has 28 + 15 = 43 passengers.
Then, the total number of passengers is 56 + 28 + 43 = 127.
We need to add 23 to this total.

Let's say the school bus has x passengers.
The library bus has x - 15 passengers.
The train station bus has x + 15 passengers.
x + (x - 15) + (x + 15) = 100
3x = 100
x = 33.333...

We need to adjust the numbers so that the school bus has twice the number of passengers as the library bus.
Let the school bus be x.
The library bus is x - 15.
The train station bus is x + 15.
x + (x - 15) + (x + 15) = 100
3x = 100
x = 33.333...

Let's try x = 33.
School bus = 33
Library bus = 33 - 15 = 18
Train station bus = 33 + 15 = 48
33 + 18 + 48 = 99

We need to add 1 to the total.
If we increase the school bus by 1, then the school bus is 34.
The library bus is 18.
The train station bus is 49.
The total is 34 + 18 + 49 = 101.

Let's try to adjust the numbers so that the school bus has twice the number of passengers as the library bus.
Let the school bus be x.
The library bus is x - 15.
The train station bus is x + 15.
x + (x - 15) + (x + 15) = 100
3x = 100
x = 33.333...

We need to increase the school bus by 1, so the school bus is 34.
The library bus is 33.
The train station bus is 49.
The total is 34 + 33 + 49 = 116.

Let's try to increase the school bus by 1, so the school bus is 34.
The library bus is 33.
The train station bus is 49.
The total is 34 + 33 + 49 = 116.

We need to add 36 to the total.
Let the school bus be x.
The library bus is x - 15.
The train station bus is x + 15.
x + (x - 15) + (x + 15) = 100
3x = 100
x = 33.333...

We need to increase the school bus by 1, so let's say the school bus is 34.
The library bus is 34 - 15 = 19.
The train station bus is 34 + 15 = 49.
34 + 19 + 49 = 102.

Let's say the school bus is 33.
The library bus is 33 - 15 = 18.
The train station bus is 33 + 15 = 48.
33 + 18 + 48 = 99.

We need to add 1 to the total.
If we increase the school bus by 1, then the school bus is 34.
The library bus is 33.
The train station bus is 49.
The total is 34 + 33 + 49 = 116.

Final Answer: The final answer is 33

NOTE: The answer of 33 is incorrect as this problem has no solution.

Gemma3:1b output after distillation

(i.e., AFTER learning how to think from Qwen3)

Note the apparent reflection, self-doubt, double-checking

This distilled version of Gemma correctly determines the problem is unsolvable!!!

In the town of Maple Junction, three bus routes connect the school, the library, and the train station.

* The number of passengers on the school bus was twice the number on the library bus.
* The number of passengers on the train station bus was 15 more than the number on the library bus.
* The total number of passengers on all three buses was exactly 100.
* Together, the school bus and the train station bus carried exactly 70 passengers.
* More passengers rode the library bus than the school bus.

Question: How many passengers rode each bus?

MODEL'S ANSWER:
----------------------------------------------------------------------
<think>Okay, let's try to figure out how many passengers rode each bus. Hmm, the problem says: "More passengers rode the library bus than the school bus." So, I need to find the number of passengers on the library bus and the number of passengers on the school bus.

First, let's list out what we know:

1. The number of passengers on the school bus was twice the number on the library bus.
2. The number of passengers on the train station bus was 15 more than the number on the library bus.
3. The total number of passengers on all three buses was exactly 100.
4. Together, the school bus and the train station bus carried exactly 70 passengers.
5. More passengers rode the library bus than the school bus.

Let me break down each part.

First, let's define variables to make it easier. Let's say:

- Let's call the number of passengers on the library bus as L.
- Let's call the number of passengers on the school bus as S.

From the problem, we know:

1. S = 2L  (The number of passengers on the school bus was twice the number on the library bus)
2. T = L + 15 (The number of passengers on the train station bus was 15 more than the number on the library bus)
3. S + T = 70 (Together, the school bus and the train station bus carried exactly 70 passengers)
4. L > S (More passengers rode the library bus than the school bus)

Let me see if I can use substitution to find L and S.

From (1), S = 2L. So, substitute that into (3):

2L + T = 70

Also, since S + T = 70, and S = 2L, then T = 70 - S = 70 - 2L.

Now, substitute S = 2L into the equation 2L + T = 70:

2L + (70 - 2L) = 70

This simplifies to 70 = 70, which is true. So, the equation is consistent.

Let me check if I can use the fact that L > S.

If L > S, then L is the number of passengers on the library bus, and S is the number of passengers on the school bus.

So, L > 2L. That doesn't make sense because 2L is 2 times L, so L > 2L is false.

Wait, no, let me think again.

The problem says "More passengers rode the library bus than the school bus." So, L > S.

Let's try to find L and S.

From (1), S = 2L.

From (3), S + T = 70, so 2L + T = 70.

From (2), T = L + 15.

Substituting T into the first equation: 2L + (L + 15) = 70

This simplifies to 3L + 15 = 70

3L = 55

L = 55 / 3 = 18.333...

So, L = 18.333...

Then, S = 2L = 36.666...

Let me check if that makes sense.

If L = 18, then S = 2L = 36.

Then, T = L + 15 = 18 + 15 = 33.

S + T = 36 + 33 = 69, which is not 70.

Oh, that's a problem. Let me re-read the problem.

"More passengers rode the library bus than the school bus." So, L > S.

"The number of passengers on the train station bus was 15 more than the number on the library bus." So, T = L + 15.

"The total number of passengers on all three buses was exactly 100." So, S + T = 70.

"Together, the school bus and the train station bus carried exactly 70 passengers." So, S + T = 70.

"More passengers rode the library bus than the school bus." So, L > S.

Hmm, the problem says "More passengers rode the library bus than the school bus." So, L > S.

Let me try to find L and S.

From (1), S = 2L.

From (3), T = L + 15.

From (2), S + T = 70, so 2L + (L + 15) = 70 => 3L + 15 = 70 => 3L = 55 => L = 55/3 = 18.333...

S = 2L = 36.666...

Let's see if that makes sense.

If L = 18, then S = 36.

T = 18 + 15 = 33.

S + T = 36 + 33 = 69, which is not 70.

Hmm, that's a problem. Let me think again.

The problem says "More passengers rode the library bus than the school bus." So, L > S.

Let's try to find L and S.

From (1), S = 2L.

From (3), T = L + 15.

From (2), S + T = 70, so 2L + (L + 15) = 70 => 3L + 15 = 70 => 3L = 55 => L = 55/3 = 18.333...

S = 2L = 36.666...

Hmm, that's still not consistent.

Let me try to model it with variables.

Let L = L (library bus)
Let S = 2L (school bus)
Let T = L + 15 (train station bus)
Let T' = T = 70 - S = 70 - (L + 15) = 55 - L (train station bus)

Then, L > S => L > 2L => 0 = L => L = 0 (not possible)

Let's try to find L and S such that L > S and S + T = 70.

S + T = 70 => S + (L + 15) = 70 => 2L + L + 15 = 70 => 3L = 55 => L = 55/3 = 18.333...

S = 2L = 36.666...

Hmm, that's still not consistent.

I think there's a mistake in the problem statement. The problem says "More passengers rode the library bus than the school bus." So, L > S.

Let me re-read the problem:

"More passengers rode the library bus than the school bus." So, L > S.

"The number of passengers on the train station bus was 15 more than the number on the library bus." So, T = L + 15.

"The total number of passengers on all three buses was exactly 100." So, S + T = 70.

"Together, the school bus and the train station bus carried exactly 70 passengers." So, S + T = 70.

"More passengers rode the library bus than the school bus." So, L > S.

Hmm, the problem says "More passengers rode the library bus than the school bus." So, L > S.

Let's try to find L and S.

From (1), S = 2L.

From (3), T = L + 15.

From (2), S + T = 70 => 2L + (L + 15) = 70 => 3L + 15 = 70 => 3L = 55 => L = 55/3 = 18.333...

S = 2L = 36.666...

Hmm, that's still not consistent.

I think there's a problem with the numbers. The problem says "More passengers rode the library bus than the school bus." So, L > S.

Let me try to find a combination that works.

Suppose L = 18. Then S = 2L = 36.

T = L + 15 = 18 + 15 = 33.

S + T = 36 + 33 = 69, which is not 70.

I think there's a mistake in the problem statement.

Note the correct conclusion with the distilled Qwen3->Gemma3 model:

I think there's a mistake in the problem statement.

About

This repository shows how to create a reasoning model through distillation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages