学习笔记 – CLS-RL(R1-V)代码精读

前言

每次进入新的工作，都得精读下新工作所需要的代码，不然都不知道怎么做他的后续工作

那么这次精读的是MLLM目前主要的一个框架R1-V，的一个后续工作，用于将R1-V迁移到通用分类任务上。

我也是第一次接触MLLM的代码，就边看边记录吧，希望对大家有点帮助。

grpo_direct

主程序入口

def(main)

reward_funcs = [reward_funcs_registry['accuracy'] ]

首先是奖励方法的定义，direct代表CLS-RL的直接回答版本，也就是不需要对结构<think>进行奖励，而是只对<answer>中的回答是否正确进行奖励。

接下来载入数据集，这里我在测试时使用的是dtd数据集。

dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)

dataset是一个形状为(92, 5)的字典，由于我们目前为训练阶段，所以其key为’train’，value为Hugging Face的Dataset类。Dataset合计有5个列，92行数据。

problem	image	image_width	image_height	solution
What type of texture is in the photo?\nPlease choose one from list [ interlaced, …	PIL类（图像对象）	500	500	frilly

对数据集具体数据感兴趣，可以访问https://huggingface.co/datasets/afdsafas/dtd-4shot-b2n，CLS-R1的数据集格式为afdsafas/{DATASET}-4shot-b2n

接下来是对话构建，分别是系统提示词和问题。

    # Format into conversation
    def make_conversation(example):
        return {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": example["problem"]},
            ],
        }

其中，系统提示词是提前构建好的：

SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)

在这里定义了基本的回答格式，即<think>和<answer>

至于example["problem"]，正是我们先前定义的dataset当中的problem列，在此处我们还不知道example是如何构建的，所以我们可以认为example应该是先前dataset当中其中的一个样本。

继续向下看，我们看到了问题的构建。

QUESTION_TEMPLATE = "{Question}\n Please directly output the answer."

其中{Question}应该是后续会替换的内容，而“Please directly output the answer.”就是论文No-Thinking-CLS-RL方法的设计，我们在论文中可以看到与之对应的描述：

Instruction prompt. Instead of the prompt in CLS-RL, which encourages models to think, the
prompt in the No-Thinking-CLS-RL method discourages or even prohibits the model from thinking. The prompt is designed as: {Question} Please directly output the answer. Here
{Question} will be replaced by each specific question.

紧接着，我们就看到了对于{Question}的替换操作：

    def make_conversation_image(example):
        #print(example['solution'])
        #print(example["problem"])
        return {
            "prompt": [
                {
                    "role": "user",
                    "content": [
                        {"type": "image"},
                        {"type": "text", "text": QUESTION_TEMPLATE.format(Question=example['problem']) },
                    ],
                },
            ],
        }

example我们假定是dataset的单一样本，那么这里的函数名我们可以推断其开始构建与图像相关的提示，在文本部分，我们注意到QUESTION_TEMPLATE.format(Question=example['problem']，即在调用此函数时，会将提示中{Question}的部分替换为数据集的problem数据，并加入到prompt当中。

接着向下看：

    if "image" in dataset[script_args.dataset_train_split].features:
        print("has image in dataset")
        dataset = dataset.map(make_conversation_image)

这里有个新的常量，即script_args.dataset_train_split，先前我们说到dataset当中的key为’train’，那么这里直接使用script_args.dataset_train_split，对应的就是’train’，而script_args.dataset_train_split则对应的是’test’。dataset['train'].features则就是我们先前说的dataset的数据了，5行92条数据，此处if判断为5行中是否存在”image”，数据集为基于图像的数据集时则对数据集的数据按照刚刚的make_conversation_image方法进行替换，也就是说通常make_conversation不会被调用。

经过处理后，我们再访问dataset['train'].features，会发现字典会多了一个‘prompt’，里面的内容就是我们先前提到的提示内容的最终组合：“What type of texture is in the photo?\nPlease choose one from list [ interlaced, … \n Please directly output the answer.”

接下来是开始传入配置到模型当中，首先是模型的定义

    trainer_cls = Qwen2VLGRPOTrainer if not training_args.use_vllm else Qwen2VLGRPOVLLMTrainer
    print("using: ", trainer_cls)

此处输出的是Qwen2VLGRPOTrainer，所以稍后我们可以看下Qwen2VLGRPOTrainer类当中是如何定义模型的具体细节的。

接下来是将配置文件设定和数据集传入到模型当中：

    trainer = trainer_cls(
        model=model_args.model_name_or_path,
        reward_funcs=reward_funcs,
        args=training_args,
        train_dataset=dataset[script_args.dataset_train_split],
        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
        peft_config=get_peft_config(model_args),
        attn_implementation=model_args.attn_implementation,
        max_pixels=script_args.max_pixels,
        min_pixels=script_args.min_pixels,
    )

其中，model为Qwen/Qwen2-VL-2B-Instruct，是论文的实验设置，奖励函数由先前定义，为准确率奖励，训练参数为各损失函数的权重等训练配置，训练集为数据集的’train’部分，评估集为数据集的‘test’部分，目前我们是训练阶段故没有载入。PEFT是微调配置，为空，意味着没有引入LoRA，具体可以关注下目前该项目的issue：https://github.com/minglllli/CLS-RL/issues/5

注意力实现为flash-attn-2，像素的最小值和最大值用于过滤数据集的图像，基本位于56×56到3.5K的区间，一般应该也不会有低于这个或高于这个图像的数据集，如果数据集属于特殊情况（例如经典的24×24），需要留意下这里。

到此为止，整个入口文件的解读基本结束，我们已经了解了论文引入的新的奖励函数，数据集的构建方式，以及提示的构建方式，接下来我们开始进入到模型的定义当中。

class Qwen2VLGRPOTrainer(Trainer)

模型定义文件

初始化

进入先前定义的千问模型GRPO类，我们先从初始化开始看，我在这里删了些不算太重要的部分，感兴趣的可以去自行看完整文件，这里还是只跟着模型运行走一遍：

    def __init__(...):
        # Models
        # Trained model
        model_init_kwargs = args.model_init_kwargs or {}
        model_init_kwargs["attn_implementation"] = attn_implementation
        if isinstance(model, str):
            model_id = model
            torch_dtype = model_init_kwargs.get("torch_dtype")
            if isinstance(torch_dtype, torch.dtype) or torch_dtype == "auto" or torch_dtype is None:
                pass  # torch_dtype is already a torch.dtype or "auto" or None
            elif isinstance(torch_dtype, str):  # it's a str, but not "auto"
                torch_dtype = getattr(torch, torch_dtype)
                model_init_kwargs["torch_dtype"] = torch_dtype
            else:
                raise ValueError(
                    "Invalid `torch_dtype` passed to `GRPOConfig`. Expected either 'auto' or a string representing "
                    f"a `torch.dtype` (e.g., 'float32'), but got {torch_dtype}."
                )
            # Disable caching if gradient checkpointing is enabled (not supported)
            model_init_kwargs["use_cache"] = (
                False if args.gradient_checkpointing else model_init_kwargs.get("use_cache")
            )
            if "Qwen2-VL" in model_id:
                model = Qwen2VLForConditionalGeneration.from_pretrained(model, **model_init_kwargs)
            elif "Qwen2.5-VL" in model_id:
                model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model, **model_init_kwargs)
            elif "Aria" in model_id:
                model_init_kwargs.pop("use_cache")
                model = AriaForConditionalGeneration.from_pretrained(model, **model_init_kwargs)
            else:
                model = AutoModelForCausalLM.from_pretrained(model, **model_init_kwargs)

首先是设定‘attn_implementation’，先前定义的就是flash_attention_2，然后进入实例化的判断，本处我们刚开始初始化，所以是True。

model_id即我们的模型，所以就是Qwen/Qwen2-VL-2B-Instruct。

然后就是模型载入，这里会从hugging face载入权重：

            if "Qwen2-VL" in model_id:
                model = Qwen2VLForConditionalGeneration.from_pretrained(model, **model_init_kwargs)

到此为止，我们已经载入了模型，接下来是对deepseed的配置：

        # Reference model
        if is_deepspeed_zero3_enabled():
            if "Qwen2-VL" in model_id:
                self.ref_model = Qwen2VLForConditionalGeneration.from_pretrained(model_id, **model_init_kwargs)

然后是进行提取processing class的变量：

        # Processing class
        if processing_class is None:
            if "Qwen2-VL" in model_id or "Qwen2.5-VL" in model_id or "Aria" in model_id:
                processing_class = AutoProcessor.from_pretrained(model_id)
                pad_token_id = processing_class.tokenizer.pad_token_id
                processing_class.pad_token_id = pad_token_id
                processing_class.eos_token_id = processing_class.tokenizer.eos_token_id
                if "Qwen" in model_id or "Qwen2.5-VL" in model_id:
                    processing_class.image_processor.max_pixels = max_pixels
                    processing_class.image_processor.min_pixels = min_pixels

这里首先是将模型用AutoProcessor转为对象，然后从对象中读取和设定模型的配置，首先是读取了分词器当中pad_token的ID，pad_token即padding token，是用于填充长度的token。然后将该id赋值回processing_class的上一级对象，大概是为了方便访问吧？

接下来是对eos_token和min/max_pixels做一样的事情，eos是休止符，min/max_pixels我们先前有讲。

然后是对奖励函数进行实例化：

        # Reward functions
        if not isinstance(reward_funcs, list):
            reward_funcs = [reward_funcs]
        for i, reward_func in enumerate(reward_funcs):
            if isinstance(reward_func, str):
                reward_funcs[i] = AutoModelForSequenceClassification.from_pretrained(
                    reward_func, num_labels=1, **model_init_kwargs
                )
        self.reward_funcs = reward_funcs

虽然有个迭代器，但是我们这里只有一个奖励函数，也就是准确率奖励函数。

接下来是构建独立的奖励模型，首先是读取目前奖励函数的分词器，默认是None，此处就会进入reward_processing_classes = [None] * len(reward_funcs)，得到[None]。假如是两个奖励函数，那么这里就会是[None, None]：

        # Reward processing class
        if reward_processing_classes is None:
            reward_processing_classes = [None] * len(reward_funcs)
        elif not isinstance(reward_processing_classes, list):
            reward_processing_classes = [reward_processing_classes]
        else:
            if len(reward_processing_classes) != len(reward_funcs):
                raise ValueError("The number of reward processing classes must match the number of reward functions.")

        for i, (reward_processing_class, reward_func) in enumerate(zip(reward_processing_classes, reward_funcs)):
            if isinstance(reward_func, PreTrainedModel):
                if reward_processing_class is None:
                    reward_processing_class = AutoTokenizer.from_pretrained(reward_func.config._name_or_path)
                if reward_processing_class.pad_token_id is None:
                    reward_processing_class.pad_token = reward_processing_class.eos_token
                # The reward model computes the reward for the latest non-padded token in the input sequence.
                # So it's important to set the pad token ID to the padding token ID of the processing class.
                reward_func.config.pad_token_id = reward_processing_class.pad_token_id
                reward_processing_classes[i] = reward_processing_class
        self.reward_processing_classes = reward_processing_classes

由于奖励函数是我们先前在另一个类中定义的函数，而不是模型，所以在if isinstance(reward_func, PreTrainedModel)的时候会直接跳过。

最后循环i次，也就是i个奖励函数后（这里是1，所以一次就跳出来了），会将分词器放在self当中声明传递，此处为[None]，意味着我们自定义的奖励函数不需要分词器。除特殊情况外（如指定分词器，或者使用了预训练的奖励模型），否则此处都应该是None。

接下来是一些传入的训练参数的设定：

        # Training arguments
        self.max_prompt_length = args.max_prompt_length
        self.max_completion_length = args.max_completion_length  # = |o_i| in the GRPO paper
        self.num_generations = args.num_generations  # = G in the GRPO paper
        self.generation_config = GenerationConfig(
            max_new_tokens=self.max_completion_length,
            do_sample=True,  
            temperature=1, # HACK
            num_return_sequences=self.num_generations,
            pad_token_id=pad_token_id,
        )
        self.beta = args.beta

max_prompt_length是最大提示长度，CLS-RL默认设定是2048，一般不会超，不过imagenet类除外，所以论文将imagenet类缩小到了100个，在1000个中随机挑选100个类出来作为list。

max_completion_length是生成提示的最大长度，避免无限生成，默认是1024

num_generations是一组生成的数量，GRPO利用这组生成计算相对优势，并给予鼓励/惩罚，CLS-RL这里定义的是4

generation_config中主要的是do_sample，表示生成的内容带有随机性，避免组内生成过于一致。temperature则用于调整其生成的内容的发散性，温度越高越发散，越可能偏离语义或更多样，这里默认是1。

self.beta = args.beta是一个平衡权重，用来控制token与token之间的KL散度的，用于权衡原始语义信息和奖励之间的影响，后续在计算损失的时候可以看到，具体代码为：per_token_loss = -(per_token_loss - self.beta * per_token_kl)

不过在进入损失函数部分之前，还有一段初始化：

        self.model_accepts_loss_kwargs = False

我们来看看Trainer类是怎么描述的：

How the loss is computed by Trainer. By default, all models return the loss in the first element. Subclass and override for custom behavior. If you are not using num_items_in_batch when computing your loss, make sure to overwrite self.model_accepts_loss_kwargs to False. Otherwise, the loss calculationg might be slightly inacurate when performing gradient accumulation.

由于我们自定的损失计算是由外部进行计算(Trainer类当中)，而不是在模型内的，所以计算中没有使用num_items_in_batch，故这里设定为False，这里可以具体看之后损失函数是如何对模型的logit进行计算的，或者忽略也可以，如果哪天需要用到模型直接返回的loss了，这里设置为True就好。

到此为止，所有初始化的部分都看完了，我们可以开始分析损失计算部分了。

损失计算

损失函数部分比较长，我们可以先截取一些一点点分析：

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        if return_outputs:
            raise ValueError("The GRPOTrainer does not support returning outputs")
        prompts = [x["prompt"] for x in inputs]
        prompts_text = [maybe_apply_chat_template(example, self.processing_class)["prompt"] for example in inputs]

首先是模型用prompts = [x["prompt"] for x in inputs]，将提示载入，由于我们的batch_size设定为1，所以这里的x为1，只有一个inputs。

故将inputs[0]['prompt']提取到prompts当中，现在我们的prompts[0]就是先前在另一个类构建的prompt，这里可视化一下：

‘content’	{‘text’: ‘None’, ‘text’: ‘Image’} {‘text’: ‘What type of texture is in the photo?\nPlease choose one from list [ bubbly, knitted, …aced].\n Please directly output the answer.’, ‘type’: ‘text’}
‘role’	‘user’

注意到，现在我们的prompt还是纯文本，不过格式已经可以了。

接下来经过一个模板函数maybe_apply_chat_template进行分词，我们会得到新格式化后的prompt_text：

这个格式化模板来自：https://huggingface.co/docs/trl/v0.21.0/en/data_utils#trl.maybe_apply_chat_template

根据文档，我们可以得知，传入的example迭代就是我们的输入，而先前初始化的self.processing_class就是用于对prompt分词的分词器

接下来开始处理图像

        images = []
        for x in inputs:
            img_temp = x["image"].resize((384, 384), Image.Resampling.LANCZOS)
            images.append(img_temp)
        prompt_inputs = self.processing_class(
            text=prompts_text,
            images=images,
            return_tensors="pt",
            padding=True,
            padding_side="left",
            add_special_tokens=False,
        )
        prompt_inputs = super()._prepare_inputs(prompt_inputs)
        #print(prompt_inputs)

        prompt_ids, prompt_mask = prompt_inputs["input_ids"], prompt_inputs["attention_mask"]
        pixel_values = prompt_inputs["pixel_values"]
        image_grid_thw = prompt_inputs["image_grid_thw"]

首先是将原先数据集当中的PIL格式图像取出，然后重新变换到384×384，这个应该是SigCLIP的大小。

接下来就是将文本和图像共同输入到Transformers的分词器，最后我们会得到向量化后的提示，新的提示包含文本和图像。

随后将其中的一些参数取出，后续用于计算损失。

检查提示长度是否符合长度要求：

        if self.max_prompt_length is not None:
            prompt_ids = prompt_ids[:, -self.max_prompt_length :]
            prompt_mask = prompt_mask[:, -self.max_prompt_length :]

接下来开始进行提示生成：

        # Generate completions
        with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model:
            prompt_completion_ids = unwrapped_model.generate(**prompt_inputs, generation_config=self.generation_config)

            prompt_length = prompt_ids.size(1)
            prompt_ids = prompt_completion_ids[:, :prompt_length]
            completion_ids = prompt_completion_ids[:, prompt_length:]
            prompt_mask = prompt_mask.repeat_interleave(self.num_generations, dim=0)

抛开with unwrap_model_for_generation这行包装不谈，我们先关注提示的生成部分。

生成的提示prompt_completion由输入和回答两个部分组成，先前我们设定的生成数量为4，所以此时prompt_completion_ids的第一个维度为4，形状为(4, 316)，意味着提示+回答的长度就是316。

接下来我们从中读取prompt_length，就可以拿到输入提示的长度，这里为312，非常长，因为回顾CLS-RL的提示组织方式，我们就知道提示中是包含了完整类别的。反过来而言，生成的答案就是4，非常短了，可以猜到是直接生成了类别名出来。

接下来是分别取出生成当中的提示和回答，对应的就是prompt_ids和completion_ids，最后生成一个与提示长度与之对应的掩码，故prompt_mask的形状和prompt_ids是一致的。

由于我们不需要遮盖掉提示部分，只需要遮盖答案，所以这里提示掩码张量中的值均为1。

到此我们完成了生成部分的分析，接下来就是EOS标签并进行终止：

        # Mask everything after the first EOS token
        is_eos = completion_ids == self.processing_class.eos_token_id
        device = self.accelerator.device
        eos_idx = torch.full((is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=device)
        eos_idx[is_eos.any(dim=1)] = is_eos.int().argmax(dim=1)[is_eos.any(dim=1)]
        sequence_indices = torch.arange(is_eos.size(1), device=device).expand(is_eos.size(0), -1)
        completion_mask = (sequence_indices <= eos_idx.unsqueeze(1)).int()

首先是对比生成的回答中停止符号的位置，根据比较每个生成样本中的向量是否与eos相符即可。新的张量is_eos与completion_ids形状一致，当中用True/False表明EOS的位置，True为EOS，新的is_eos形状为(4, 4)。

接下来两行有点复杂，首先是先初始化一个eos_idx，然后对is_eos进行遍历，将True和False转换为0和1，随后只保留1位置的index，我们就得到了一个长度为1的新张量，eos_idx形状为(4, 1)

接下来sequence_indices是将eos_idx重新扩展回原始长度，且只是index，表示序列位置，那么新的张量实际统一为[0, 1, 2, 3]，共计4个维度。

sequence_indices主要是作为一个创建completion_mask的过程，我们生成回答掩码时需要有完整的长度和位置信息，所以我们用sequence_indices作为基本张量，再把eos_idx插入回来，并利用大小判断和数类型转换快速实现替换。已知我们在eos_idx当中的张量是[1, 1, 3, 3]，意味着第一个生成维度的张量中EOS位置在1，同理，第三个维度中，EOS位置在3，那么又知sequence_indices是一个有序的张量，所以当两者进行比较时，等于eos_idx的部分，自然是EOS，小于的是有效值，大于的就是被遮盖值。正确和错误由于类型转换，从True / False 变为了1/0。

因此新的completion_mask的第一个生成对应的张量就是[1, 1, 0, 0]，同理，第三个生成对应的张量就是[1, 1, 1, 1]。

接下来构建注意力机制需要的数据：

        # Concatenate prompt_mask with completion_mask for logit computation
        attention_mask = torch.cat([prompt_mask, completion_mask], dim=1)  # (B*G, P+C)
        pixel_values = prompt_inputs["pixel_values"].repeat(self.num_generations, 1)
        image_grid_thw = prompt_inputs["image_grid_thw"].repeat_interleave(self.num_generations, dim=0)

        per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw)
        # Get rid of the prompt (-1 because of the shift done in get_per_token_logps)
        per_token_logps = per_token_logps[:, prompt_length - 1 :]

首先是注意力的掩码，将先前生成的提示掩码和生成掩码进行拼接，最终形状自然是(4, 316)

随后是对图像进行复制，扩展到与生成数量相同的尺寸，由于我们的batch_size为1，所以新的图像数就变成了1*4，即复制四次。

image_grid_thw表示的是我们的原始图像在分片后的数据信息，形状分别是(T, H, W)，具体是如何预处理的，感兴趣可以看：https://zhuanlan.zhihu.com/p/28205969434

这个过程太复杂了，就暂时不在这里展开。

接下来进入计算部分：

        per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw)

将所有所需的参数输入，开始计算每个token的logit，具体的计算函数如下：

def _get_per_token_logps(self, model, input_ids, attention_mask, pixel_values, image_grid_thw):
        logits = model(input_ids, attention_mask=attention_mask, pixel_values=pixel_values, image_grid_thw=image_grid_thw).logits  # (B, L, V)
        logits = logits[:, :-1, :]  # (B, L-1, V), exclude the last logit: it corresponds to the next token pred
        input_ids = input_ids[:, 1:]  # (B, L-1), exclude the first input ID since we don't have logits for it
        # Compute the log probabilities for the input tokens. Use a loop to reduce memory peak.
        per_token_logps = []
        for logits_row, input_ids_row in zip(logits, input_ids):
            log_probs = logits_row.log_softmax(dim=-1)
            token_log_prob = torch.gather(log_probs, dim=1, index=input_ids_row.unsqueeze(1)).squeeze(1)
            per_token_logps.append(token_log_prob)
        return torch.stack(per_token_logps)

通过这个函数，我们得到了每个序列中每个token的生成概率，其中第一行是通过前向传播得到logit，接下来进行移位，首先倒数第一个值是对序列外的一个值的预测，不具备意义，所以删除。接下来是将第一个起始标签[BOS]的位置进行移除，因为[BOS]标签本身没有预测下一个词的意义。

最后进行softmax累加和循环，代码此处为了避免显存溢出，用了显式循环。（时间换空间）

接下来回到损失函数计算部分，我们在得到logit后，还要删掉提示部分的logit，因为在这篇工作中，我们不需要提示部分的logit，只需要生成部分的logit：

per_token_logps = per_token_logps[:, prompt_length - 1 :]

接下来计算参考模型的logit，参考模型指的是冻结的模型，用于衡量当前我们正在训练的模型与冻结模型直接生成内容的差异。冻结模型的生成意味着过往的先验经验生成出的可靠信息，如果当前模型与先验生成的语义差异过大，代表我们训练的模型即便在得分上特别好，生成出的内容也和一开始大相径庭，很有可能已经出现了灾难性遗忘的问题。

        with torch.inference_mode():
            if self.ref_model is not None:
                ref_per_token_logps = self._get_per_token_logps(self.ref_model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw)
            else:
                with self.accelerator.unwrap_model(model).disable_adapter():
                    ref_per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw)
        ref_per_token_logps = ref_per_token_logps[:, prompt_length - 1 :]

        # Compute the KL divergence between the model and the reference model
        per_token_kl = torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1

ref_per_token_logps就是我们以相同方式得到的logits，然后以相同方式将提示部分删除。

随后计算两者的K-L散度，这是GRPO引入的方法。

接下来是将生成的内容解码出来，同时加入到回答的格式当中：

        # Decode the generated completions
        completions = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True)
        if is_conversational(inputs[0]):
            completions = [[{"role": "assistant", "content": completion}] for completion in completions]

由于从向量解码回了标准的字符串格式，我们得以管中窥豹，生成的4个答案分别是[‘grid’, ‘grid’, ‘Chequered’, ‘chequered’]。

接下来计算我们结果的奖惩：

        rewards_per_func = torch.zeros(len(prompts), len(self.reward_funcs), device=device)
        for i, (reward_func, reward_processing_class) in enumerate(
            zip(self.reward_funcs, self.reward_processing_classes)
        ):
                # Repeat all input columns (but "prompt" and "completion") to match the number of generations
                reward_kwargs = {key: [] for key in inputs[0].keys() if key not in ["prompt", "completion"]}
                for key in reward_kwargs:
                    for example in inputs:
                        # Repeat each value in the column for `num_generations` times
                        reward_kwargs[key].extend([example[key]] * self.num_generations)
                output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs)
                rewards_per_func[:, i] = torch.tensor(output_reward_func, dtype=torch.float32, device=device)

由于CLS-RL是自定义的奖惩函数，所以此处为方便删掉了一个判断式。

可以看到，整个部分主要是负责对齐奖惩函数的输入，而不是做计算，具体计算交给了我们先前在另一个类定义的奖惩函数上。首先初始化rewards_per_func，然后建立一个包含了’prompt’和’completion’键值的字典，并将其传递到奖励函数中，最后得到每个生成的奖励函数的奖励值。

接下来对每个生成的奖励函数求和：

        # Sum the rewards from all reward functions
        rewards = rewards_per_func.sum(dim=1)

由于我们这里只有一个奖励函数，所以基本没差，维度略有变化，原先是(4,1)，即4个生成，每个生成返回了一个奖励值。

然后计算均值方差：

        # Compute grouped-wise rewards
        mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
        std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)

接下来求相对优势：

        # Normalize the rewards to compute the advantages
        mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
        std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
        advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)

这段与论文公式一致：

然后是计算KL散度：

        # x - x.detach() allows for preserving gradients from x
        per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
        per_token_loss = -(per_token_loss - self.beta * per_token_kl)
        loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()

对应论文公式：

到此为止，主要的计算已经结束，接下来是指标的显示：

        # Log the metrics
        completion_length = self.accelerator.gather_for_metrics(completion_mask.sum(1)).float().mean().item()
        self._metrics["completion_length"].append(completion_length)

        reward_per_func = self.accelerator.gather_for_metrics(rewards_per_func).mean(0)
        for i, reward_func in enumerate(self.reward_funcs):
            if isinstance(reward_func, PreTrainedModel):
                reward_func_name = reward_func.config._name_or_path.split("/")[-1]
            else:
                reward_func_name = reward_func.__name__
            self._metrics[f"rewards/{reward_func_name}"].append(reward_per_func[i].item())

        self._metrics["reward"].append(self.accelerator.gather_for_metrics(rewards).mean().item())

        self._metrics["reward_std"].append(self.accelerator.gather_for_metrics(std_grouped_rewards).mean().item())

        mean_kl = ((per_token_kl * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
        self._metrics["kl"].append(self.accelerator.gather_for_metrics(mean_kl).mean().item())

        return loss

第一个指标是：completion_length，表示生成的长度

第二个指标是：reward_func_name，表示每个奖励函数原始给予的奖励值

第三个指标是：reward，表示求和后的总奖励

第四个指标是：reward_std，表示奖励的组内标准差，用以衡量组内奖励间分数差异

第五个指标是：kl，表示生成序列与参考模型间的平均KL散度

最后返回损失值。

到此为止，整个类基本介绍完毕了。

尾言

接下来会再做一篇Visionary-R1的代码精读，与CoOp的时候一样，类似的部分就一笔带过了。

Visionary-R1对于提示部分带来的改动更多，所以与本篇的代码相比起来，简单很多，也更便于入门理解，姑且就决定先从这篇代码开始了。

这次新的工作打算边写边记录，之后有机会整理下发出来吧，先前还说要做下风格迁移的整理，只好等有时间再说了。

前言

grpo_direct

def(main)

class Qwen2VLGRPOTrainer(Trainer)

初始化

损失计算

尾言

发送评论 编辑评论

发送评论编辑评论