An article on Knowledge Distillation: Making "Small Models" Have "Big Wisdom".

popularization of science5mos agoupdate AiFun

1,757 0

The Big Model Era ofKnowledge Distillation

existartificial intelligence (AI)(AI) andmachine learningdomain, Knowledge Distillation (KD), as an important model compression technique, is receiving more and more attention in the present era of big models.

Simply put, it improves the performance and efficiency of a complex "teacher" model by transferring knowledge from it to a smaller, lighter "student" model, thus addressing the challenges of deploying complex models in resource-constrained real-world environments. This addresses the challenge of deploying complex models in resource-constrained real-world environments, allowing smaller models to run on poorer devices while maintaining similar performance to models with higher parameters. Here's a brief overview of the basics of knowledge distillation:

I. Basic principles of knowledge distillation

First of all, let's be clear, in the current era of large models, if we take a 70B or even higher parameter model and distill a small model like 8B or 10B, it's perfectly possible to give a significant performance boost to the small model (which doesn't seem to be possible on the OCR side).

The core idea of knowledge distillation is knowledge transfer. Teacher models are trained on large amounts of data and are able to capture complex data patterns and features, and this knowledge is critical for student models to learn.

一文了解知识蒸馏（Knowledge Distillation）：让“小模型”拥有“大智慧”

Before understanding distillation techniques for large models, it is important to understand some of the basic concepts of model distillation:

Model Definition::
- Teacher model: Usually a large, complex deep learning model such as a deep convolutional neural network (CNN) or a transformer, fully trained with high accuracy and robustness.
- Student model: Relatively small, more simply structured models designed to reduce computational complexity while maintaining high performance.
Soft versus hard targets::
- hard target: traditional training objectives such as actual labels (0 or 1) in classification tasks.
- soft target: The probability distribution of the output of the teacher model contains relative confidence in each category. Compared to the hard target, the soft target provides richer information, especially the similarity between categories.
temperature regulation::
- When generating soft targets, a temperature parameter (T) is introduced to control the smoothness of the output by adjusting its probability distribution. Higher temperature makes the probability distribution flatter and helps the student model to learn more finesse.
loss function
- In knowledge distillation, the loss function usually consists of two parts:
  1. Distillation losses: This component measures the difference between the soft targets produced by the teacher model and the predictions of the student model, and is usually computed via Kullback-Leibler scatter or cross-entropy.
  2. Student losses: This is the standard cross-entropy loss between student model predictions and true labels.
- Forward Kullback-Leibler Divergence (often abbreviated as KL divergence) is a measure of the difference between two probability distributions. Specifically, it is used to assess the relative loss of information between one distribution (usually the "true" distribution) and another (usually the "approximate" distribution).

The final total loss is a weighted sum of these two components where the weights are balanced by the hyperparameter α. By adjusting α, we have the flexibility to control the influence of distillation loss and student loss in the total loss, thus optimizing the model performance.

II. Implementation steps for knowledge distillation

The specific implementation of the classical knowledge distillation process consists of the following steps:

Teacher modeling training::
- Select the appropriate deep learning architecture and dataset and train the teacher model to ensure its high performance on specific tasks. After the training is completed, evaluate its performance on the validation and test sets.
Generate soft targets::
- The trained teacher model is used to reason about the training data and generate corresponding soft targets. These soft targets will be used as learning references for the student model.
Training of student models::
- When training the student model, both the generated soft target and the original hard target are used. By minimizing the difference between the output of the student model and the soft objective of the teacher model (usually using a cross-entropy loss function), the student model gradually learns what is in the teacher model.
Model Evaluation and Optimization::
- After training is complete, the student model is evaluated to check its performance on the test set. Hyperparameter tuning is performed as needed to ensure that it strikes a balance between inference speed and accuracy.

III. Application cases of knowledge distillation

Knowledge distillation is used in a wide range of applications, here are some classic examples:

Mobile devices and embedded systems::
- On mobile devices, computational resources and battery life are limited. Through knowledge distillation, knowledge from large models can be transferred to small models, allowing small models to reason quickly and run at low power while maintaining high accuracy.
Speech Recognition and Natural Language Processing::
- In speech recognition systems, knowledge distillation can be used to simplify complex speech recognition models to improve response speed. In natural language processing tasks, distillation techniques can transfer knowledge from large language models to lighter models, enabling them to efficiently process textual tasks.
Image classification and target detection::
- In the field of computer vision, knowledge distillation has been widely used in image classification and target detection tasks. By transferring knowledge from a teacher model to a smaller model, researchers are able to reduce inference time while maintaining high accuracy.

And there are some of the more classic distillation techniques in the field of large models, which I'll take you through briefly below:

MiniLLM: Knowledge Distillation of Large Language Models

White Box Classification Model: A white box classification model is a model that is easy to understand and explain. Like a transparent box, the working principle inside is visible, and users can clearly know how the model makes decisions. For example, the open source LLM can let everyone know what works inside it, and everyone can understand its structure as well as its workflow based on open source information.

Black Box Classification Models: Black box classification models are the opposite, their internal structure and workings are not easily understood. We can only see the inputs and outputs without knowing how the model processes the data. For example, the ChatGPT family is a typical black-box model, because we can't understand how it works internally, we can only see its inputs and outputs (in fact, we take the text generated by gpt to train the model).

While previous knowledge distillation methods have mainly been applied to white-box classification models, or training small models to mimic black-box models like ChatGPT, this paper proposes to replace the Forward Kullback-Leibler Scattering (KLD) target in the standard knowledge distillation method with the Reverse KLD because the Reverse KLD is more suited to knowledge distillation of a⽣-style language model in order to prevent the learning The model overestimates low probability regions of the teacher model distribution.

The difference between these two KL dispersions can be understood like this:

Forward KL dispersion: First, we sample a cue word from the corpus and then use the teacher model to generate a response text based on this cue word. Then, the task of the student model is to try to approximate the conditional probability of the teacher model on this generated text.
Backward KL dispersion: Relatively speaking, we sample a generated response text from the student model, and then allow the teacher model to instruct this response text according to its own preferences.

And they derive an efficient optimization method to learn this goal. The authors made three improvements:

one-step decomposition: This step extracts the quality of each generation step separately from the lost gradient, which reduces fluctuations during training and helps the model converge faster.
Teacher-directed sampling: The previously mentioned inverse KL dispersion was sampled from the student model. In this improvement, the authors mixed the distributions of the teacher and student models when sampling, using a hyperparameter, which was set to 0.2 in the article.Also, the authors made adjustments in the calculation of the loss function by adding an importance weight.
length regularization: The authors noticed that the current loss could lead to shorter sequences generated by the distilled model, so a regularization term was added to the loss function to address this issue.

The experimental results are shown below, the solid line represents the model trained with the knowledge distillation method mentioned in the text, and it can also be seen from the experimental results that the model needs less than half the parameters to achieve the performance of two or three times the previous parameters:

Meta: Distilling System2 into System1

There are two types of reasoning systems in the big model, system 1 and system 2, system 1 is able to recognize and respond quickly, also called fast thinking, system 2 is considered to deal with complex and logical problems that require a period of time before responding, also called slow thinking, for example, the CoT chain of thought, the most typical is that we add STEP BY STEP prompts in the cue word, the big model in the dealing with our problem will go through step by step thinking and reasoning and finally give the answer.

This paper explores how to distill the "System 2" reasoning process in LLM into a "System 1" output. Simply put, this means that reasoning processes that require complex thinking and intermediate steps can be turned into results that can be directly generated by the model, thus improving efficiency and reducing costs.

What did this paper do?

Defines the process of System 2 distillation::
- An unsupervised approach is used to distill high-quality output from System 2 through a large amount of unlabeled data.
- This process does not require an intermediate sequence of reasoning tokens, and directly "compiles" the capabilities of System 2 into System 1.
experimental verification::
- Experiments were conducted on four different System 2 methods and five tasks.
- The results show that the refined System 1 model meets or exceeds the original System 2 in many cases, and the computational cost is greatly reduced.
Similar phenomena in human behavior are explored::
- Analogous to the human process of moving from conscious complex reasoning to unconscious automation.
- Point out that this automated process is equally important for the development of artificial intelligence.

The paper uses the System 2 approach:

Restatement and Response (RaR) methodology::
- Let the model first rewrite the input questions to give richer textual information.
- Based on the information in the text, have the model respond.
- The advantage is the ability to take full advantage of the big model using your own knowledge and maximizing your body of knowledge to understand the question and answer it.
- Significant gains were made on the final letter connection and coin flip reasoning tasks.
- For example, in the coin flipping task, the refined model achieved an accuracy of 75.691 TP4T, compared to 56.11 TP4T for the original System 1 model.
System 2 Attention (S2A) approach::
- Let the big model filter out unimportant or invalid information and focus more on the problem to be solved, and then rewrite the problem and output a response based on it.
- Excellent performance in dealing with biased inputs and the ability to minimize the effects of bias.
- The refined model has an accuracy of 78.691 TP4T on the relevant assessment set.
Branch Solution Merger (BSM) methodology::
- Decompose a question into multiple sub-questions, reason about the answers to the sub-questions, and then summarize the answers to all the sub-questions to get the final output.
- Outperforms the original System 2 methodology on multiple assessment benchmarks.
- Improved human consistency evaluation.
- The downside is that the resources required for reasoning are too high.
Chain of Thought (CoT) approach::
- Although it performed poorly on some tasks, the paper also points out its limitations.

IV. Ways of distilling knowledge

Online Distillation
Online distillation is a training method that simply means that the teacher model and the student model are "learning" at the same time. During each training session, the student model is influenced by the teacher model and adjusts its parameters in real time. This approach allows the student model to learn directly from the teacher model, especially if the teacher model is complex and expresses a lot of things.

In this process, the teacher model and the student model use the same training data set, and the student model can learn directly from the teacher model. The key to online distillation is that students not only learn the labels of the training data, but also get soft predictions from the teacher's model, which is like getting some "suggestions" rather than hard and fast rules.

In contrast to traditional supervised learning, the predictions of traditional models are usually hard predictions, that is, each category is given a definite label, usually the one with the highest probability. Soft predictions, on the other hand, are where the model outputs a probability distribution that indicates confidence in each possible category. With the use of the softmax function, the output layer of a deep learning model typically generates such a probability distribution, with the probability of each category reflecting the likelihood that the model believes the input data belongs to that category.

Offline Distillation
Offline distillation is a static learning approach in which the student model is trained with prior knowledge learned by the teacher model, which remains unchanged. The advantage of offline distillation is that it is simple and easy to use. Moreover, the teacher model is trained beforehand and the student model can acquire knowledge from a large amount of data without direct access to that data.

In this case, the teacher model is first trained on a large dataset until optimal results are achieved. Once training is complete, the knowledge of the teacher model (usually its soft predictions) is used to guide the learning of the student model. The student model does not have direct access to the raw data, but learns by imitating the output of the teacher model.

Self-Distillation
Self-distillation is an approach where the network plays both teacher and student at different stages of training to improve performance through self-learning. This approach has the advantage of saving computational resources by eliminating the need for an additional teacher model. In addition, the student model learns directly from its own early predictions, better capturing internal knowledge.

In self-distillation, the teacher model and the student model are actually the same network, just playing different roles at different stages of training. The network generates soft predictions at an early stage and subsequently utilizes these predictions to train itself further, thus improving performance.

Differences between distillation modes
The main difference between online and offline distillation is the level of involvement of the teacher model in the training of the student model. In online distillation, the teacher model is dynamic, whereas in offline distillation, the teacher model is static. Self-distillation can be seen as a special kind of online distillation where both the teacher and the student are representations of the same model at different stages.

Online distillation typically requires more computational resources because two models have to be trained at the same time, but it is more flexible and can adjust the student model in real time to cope with data changes. In contrast, offline distillation is suitable for situations where computational resources are limited because only the student model needs to be trained.

Self-distillation provides a compromise that does not require an additional teacher model, but still benefits from one. This approach is particularly suitable for scenarios where knowledge transfer is desired within the same model.

Each of these three knowledge distillation models has its own advantages and applicability. Choosing the appropriate distillation mode depends on specific task requirements, available computational resources, and expectations of model performance. By choosing and using these modes appropriately, we can effectively transfer and utilize the knowledge in deep learning models, thus achieving more efficient model deployment in various applications.

V. Future development of knowledge distillation

With the continuous progress of artificial intelligence technology, the development of knowledge distillation is promising, and possible research directions include:

adaptive knowledge distillation::
- Develop methods that can dynamically adjust distillation strategies based on the architecture and tasks of student models to achieve higher training efficiency and effectiveness.
Multi-task learning and transfer learning::
- In a multi-task learning scenario, investigate how knowledge from multiple tasks can be shared through knowledge distillation to enhance the overall performance of the student model. At the same time, explore how to migrate knowledge from the teacher model to completely new tasks.
Combining other model compression techniques::
- Combine knowledge distillation with other model compression techniques such as pruning and quantization to further improve model efficiency and usability.

VI. Summary

Knowledge distillation, as an efficient model compression and knowledge transfer technique, can significantly reduce computational costs and resource requirements while maintaining model performance. Knowledge distillation will play an increasingly important role in both edge computing, mobile devices, and large-scale machine learning tasks. Through continuous research and practice, knowledge distillation is expected to bring more innovations and breakthroughs to the development of AI.

The copyright of the article belongs to the author, please do not reprint without permission.

An article on Knowledge Distillation: Making "Small Models" Have "Big Wisdom".

The Big Model Era ofKnowledge Distillation

I. Basic principles of knowledge distillation

II. Implementation steps for knowledge distillation

III. Application cases of knowledge distillation

IV. Ways of distilling knowledge

V. Future development of knowledge distillation

VI. Summary

7B? 13B? 65B?...? An article explaining the parameters of the big models

One thought on "Understanding Reasoning Large Models - Understanding Reasoning LLMs

Related posts

Explanation of the principles and applications of AI Agent technology in one article

One thought on "Understanding Reasoning Large Models - Understanding Reasoning LLMs

7B? 13B? 65B?...? An article explaining the parameters of the big models

A Brief History of AI: An Article on the Past and Present of Artificial Intelligence

No comments

Popular Articles

Popular Sites