score = log2 (1 + upvotes) rounded to the nearest integer, plus 1 if the questioner accepted the answer (we assign a score of −1 if the number of upvotes is negative).
对奖励模型,我们将看到每个问题总是需要两个答案对比。有些问题有很多答案,可以产生很多对,我们只取十个以限制每个问题的数据量。最后,我们把格式从 HTML 转化到 Markdown 以提高输出的可读性。你可以看到数据集和处理过程的 [笔记本]。(https://huggingface.co/datasets/lvwerra/stack-exchange-paired。)
# load model in 8bit model = AutoModelForCausalLM.from_pretrained( args.model_path, load_in_8bit=True, device_map={"": Accelerator().local_process_index} ) model = prepare_model_for_int8_training(model)
# add LoRA to model lora_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)): question_tensors = batch["input_ids"]
# sample from the policy and generate responses response_tensors = ppo_trainer.generate( question_tensors, return_prompt=False, length_sampler=output_length_sampler, **generation_kwargs, ) batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
# Compute sentiment score texts = [q + r for q, r in zip(batch["query"], batch["response"])] pipe_outputs = sentiment_pipe(texts, **sent_kwargs) rewards = [torch.tensor(output[0]["score"] - script_args.reward_baseline) for output in pipe_outputs]
# Run PPO step stats = ppo_trainer.step(question_tensors, response_tensors, rewards) # Log stats to WandB ppo_trainer.log_stats(stats, batch, rewards)
@misc {beeching2023stackllama, author = { Edward Beeching and Younes Belkada and Kashif Rasul and Lewis Tunstall and Leandro von Werra and Nazneen Rajani and Nathan Lambert }, title = { StackLLaMA: An RL Fine-tuned LLaMA Model for Stack Exchange Question and Answering }, year = 2023, url = { https://huggingface.co/blog/stackllama }, doi = { 10.57967/hf/0513 }, publisher = { Hugging Face Blog } }
感谢
我们感谢 Philipp Schmid 分享了他对文本生成绝妙的 demo, 我们的 demo 也是基于他的。我们也感谢 Omar Sanseviero 和 Louis Castricato 对我们博客的草稿提供宝贵详尽的反馈。
英文原文: https://hf.co/blog/stackllama
作者: Edward Beeching, Kashif Rasul, Younes Belkada, Lewis Tunstall, Leandro von Werra Nazneen Rajani, Nathan Lambert
Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称,一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集,帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。
Sublime Text
Sublime Text具有漂亮的用户界面和强大的功能,例如代码缩略图,Python的插件,代码段等。还可自定义键绑定,菜单和工具栏。Sublime Text 的主要功能包括:拼写检查,书签,完整的 Python API , Goto 功能,即时项目切换,多选择,多窗口等等。Sublime Text 是一个跨平台的编辑器,同时支持Windows、Linux、Mac OS X等操作系统。