没有标记数据集，如何做大模型指令微调？介绍一款有潜力的标记数据集生成模型

发布于 2024-6-20 09:49

3990浏览

0收藏

在构建大模型应用时，通常有两种方式来改进效果，一种是构建外部知识库，利用RAG来完成。但RAG并不是万能的，对于特定领域的LLM应用，以及无需示例，就能完成特定任务等场合就需要进行微调。然而，微调本身相较于RAG来讲，需要更多的算力资源和时间周期，但更大的瓶颈在于微调需要标记过的样本数据。这对于很多企业来讲，很难有这样高质量的数据积累，他们的数据通常是未经标记的，可能是一篇一篇的文章或者规章制度，并不是以问答对的方式而存在。

为了完成微调，传统做法就是通过人工的方式进行问答对构造，在此基础上斯坦福研究团队也提出了Alpaca使用GPT-4这样的强模型模仿种子样本生成标记数据集。

没有标记数据集，如何做大模型指令微调？介绍一款有潜力的标记数据集生成模型-AI.x社区

https://arxiv.org/pdf/2402.18334

笔者介绍一个新的样本数据生成的项目Bonito（https://github.com/BatsResearch/bonito），一个用于条件任务生成的开源模型，它可以将未标注的文本转换为特定任务的训练数据集，用于指令微调。根据论文介绍，该模型本身是在 mistralai/Mistral-7B-v0.1 的基础上，利用包含 165 万个示例的数据集（https://huggingface.co/datasets/BatsResearch/ctga-v1）进行微调，支持多种任务类型，包括多选题回答、是非题回答、自然语言推理、主题分类等。

没有标记数据集，如何做大模型指令微调？介绍一款有潜力的标记数据集生成模型-AI.x社区

Benito项目本身是一个数据生成的LLM应用，模型由vllm加速，使用方法比较简单。基本过程为将文档内容提取出来（datasets），比如PDF等，然后指定生成任务类型，并将其传给bonito.generate_task即可。

Bonito定义：

class Bonito(LLM, AbstractBonito):
    def generate_tasks(
        self,
        text_dataset: Dataset,
        context_col: str,
        task_type: str,
        sampling_params: SamplingParams,
        **kwargs,
    ):
        """
        Generates tasks using the Bonito model.


        This method takes a text dataset, a context column name,
        a task type, and sampling parameters, and generates tasks
        using the Bonito model. It processes the input dataset,
        generates outputs, collects multiple generations into
        one dataset object, and filters out the examples that
        cannot be parsed.


        Args:
            text_dataset (Dataset): The dataset that provides the text
                for the tasks.
            context_col (str): The name of the column in the dataset
                that provides the context for the tasks.
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (SamplingParams): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dataset: The synthetic dataset with the generated tasks.
        """
        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col, **kwargs
        )
        outputs = self.generate(processed_dataset["input"], sampling_params)


        # collect multiple generations into one dataset object
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            for output in outputs[i].outputs:
                examples.append(
                    {"context": example[context_col], "prediction": output.text.strip()}
                )


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset = self._postprocess_dataset(
            synthetic_dataset, context_col="context", **kwargs
        )


        return synthetic_dataset1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.

基本使用：

from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset


# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")


# load dataset with unannotated text
unannotated_text = load_dataset(
    "BatsResearch/bonito-experiment",
    "unannotated_contract_nli"
)["train"].select(range(10))


# Generate synthetic instruction tuning dataset
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    unannotated_text,
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params
)1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.

如果想要在显存较小的GPU上运行，如T4，可对模型进行量化。

from typing import Optional, List, Dict
from datasets import Dataset
from awq import AutoAWQForCausalLM
from bonito import AbstractBonito
from transformers import AutoTokenizer




class QuantizedBonito(AbstractBonito):
    def __init__(self, model_name_or_path):
        self.model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True).cuda()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)


    def generate_task(
        self,
        unannotated_paragraph: str,
        task_type: str,
        sampling_params: dict,
    ) -> Dict:
        """
        Generates synthetic instruction tuning pair using the Quantized Bonito model.
        This method takes a text unannotated text, a task type, and sampling parameters,
        and generates synthetic input-output pair.


        Args:
            unannotated_paragraph (str): The unannotated text or a paragraph
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (dict): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dict: The synthetic input-output pair for the task type.
        """


        text_dataset = Dataset.from_list([{"input": unannotated_paragraph}])


        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col="input"
        )


        outputs = self._generate_text(processed_dataset["input"], sampling_params)
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            output = outputs[i]
            example["prediction"] = output.strip()
            examples.append(example)


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset_dict = self._postprocess_dataset(
            synthetic_dataset, context_col="input"
        ).to_list()[0]


        return synthetic_dataset_dict


    def _generate_text(
        self,
        dataset: Dataset,
        sampling_params: dict,
        ) -> List[str]:
        """
        Generate text using huggingface transformers generate function.


        This method takes a dataset of prompts, encodes them,
        generates text using the model, decodes the generated
        text, and appends it to a list.


        Args:
            dataset (Dataset): A dataset containing prompts for text generation.
            sampling_params (dict): Parameters for sampling during generation.


        Returns:
            List[str]: A list of generated texts corresponding to the prompts.
        """
        generated_texts = []


        for prompt in dataset:
            input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
            input_ids = input_ids.cuda()


            output = self.model.generate(
                input_ids,
                do_sample=True,
                **sampling_params
            )


            generated_text = self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
            generated_texts.append(generated_text)


        return generated_texts1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
111.

以tasktype为ynqa，即yes-or-no问题为例，其生成的结果如下：

sampling_params = {'max_new_tokens':256, 'top_p':0.95, 'temperature':0.7, 'num_return_sequences':1}
synthetic_dataset = bonito.generate_task(
    unannotated_paragraph,
    task_type="ynqa",
    sampling_params=sampling_params
)
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset["input"]}')
pprint(f'Output: {synthetic_dataset["output"]}')


'----Generated Instructions----'
('Input: Based on the following passage, is a written communication '
 'confidential? 1. “Confidential Information”, whenever used in this '
 'Agreement, shall mean any data, document, specification and other '
 'information or material, that is delivered or disclosed by UNHCR to the '
 'Recipient in any form whatsoever, whether orally, visually in writing or '
 'otherwise (including computerized form), and that, at the time of disclosure '
 'to the Recipient, is designated as confidential.')
'Output: Yes'1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.

其中，tasktype支持的任务类型如下：

提取式问答（exqa）：根据给定的文本片段生成问题答案，直接从文本中提取答案。
多选问题回答（mcqa）：提供一组多选问题的答案。
问题生成（qg）：根据提供的文本内容创建问题。
无选择问答（qa）：在不提供多项选择选项的情况下回答问题。
是-否问题回答（ynqa）：生成问题的是或否答案。
共指消解 (coref)：标识文本中引用同一实体的引用。
释义生成 (paraphrase)：重写具有不同措辞的句子或短语，同时保留原意。
释义识别 (paraphrase_id)：确定两个句子或短语是否传达相同的含义。
句子补全（sent_comp）：补全句子中缺失的部分。
情感分析 (sentiment)：识别文本中表达的情绪，如积极、消极或中性。
摘要(summarization)：将较长的文本浓缩成较短的摘要，抓住要点。
文本生成（Text_gen）：基于提示创建连贯且与上下文相关的文本。
主题分类（Topic_class）：将文本分类为预定义的主题。
词义消歧（wsd）：根据上下文确定单词的含义。
文本蕴含（te）：预测一个给定的文本是否在逻辑上遵循另一个文本。
自然语言推理（nli）：确定两段文本之间的关系，如矛盾、隐含或中性。

在性能上，相较于GPT-4的方案，bonito在三个数据集中两个上取得了超越GPT4的好成绩。

没有标记数据集，如何做大模型指令微调？介绍一款有潜力的标记数据集生成模型-AI.x社区

小结：

相较于使用GPT-4生成标记样本的方法，经过专门面向数据集生成微调的模型Bonito来讲，支持zero-shot级别的样本生成，并且可以使用开源的模型，这在开放性，成本、性能上都能具备较强的优势。

随着微调技术的不断普及，相信数据样本质量和生产成本将受到越来越多的重视，benito等这样的数据集生成模型也将迎来更大的发展。

本文转载自 AI工程化，作者： ully

标签

数据集

指令微调

生成模型

51CTO

51CTO博客

51CTO学堂

没有标记数据集，如何做大模型指令微调？介绍一款有潜力的标记数据集生成模型