RLHF (Reinforcement Learning from Human Feedback) [devfest Cloud 2023]

NLP

RLHF (Reinforcement Learning from Human Feedback) [devfest Cloud 2023]

oneonlee 2023. 12. 10. 17:37

2023년 12월에 열린 Devfest Cloud 2023에 참가하여,
Google ML Tech Lead인 Erwin Huizenga님께서 발표하셨던

How to do supervised tuning for a language model using Vertex AI 세션을 듣고 정리한 글입니다.

"How to do supervised tuning for a language model using Vertex AI" 관련 시리즈
- (1) Why Adapter Tuning?
- ~~(2) Supervised Fine Tuning~~
- (3) RLHF (Reinforcement Learning from Human Feedback)

0. RLHF 관련 논문

[NeurIPS 2017] Deep Reinforcement Learning from Human Preferences (https://arxiv.org/abs/1706.03741)

Deep reinforcement learning from human preferences

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs

arxiv.org

[NeurIPS 2022] Training language models to follow instructions with human feedback (https://arxiv.org/abs/2203.02155)

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not ali

arxiv.org

1. What is RLHF?

Reinforcement learning from human feedback (RLHF)은 대규모 언어 모델을 훈련하는 기법으로, OpenAI의 ChatGPT 및 InstructGPT 모델, DeepMind의 Sparrow, Anthropic의 Claude 등에 핵심적인 역할을 해왔다. RLHF는 강화 학습을 사용하여 사람의 피드백을 통해 언어 모델을 직접 최적화한다.

Pre-trained model과 Reward model 중 어느 한쪽으로 over-fitting 되지 않게 balance를 맞추는 것이 중요하다.

2. RLHF Performance

Win Rate (using AutoSxS)
SFT-T5-11B	RLHF-T5-11B
18.15%	81.85%
Flan-T5-11B (prompted)	RLHF-T5-11B
3.97%	96.03%
Text-Bison (prompted)	RLHF Text-Bison (prompted)
16.04%	83.96%
Win Rate (using human raters)
Text-Bison (prompted)	RLHF Text-Bison (prompted)
20.12%	79.88%

[Internal Benchmarks]

3. Vertex AI를 이용한 RLHF tuning 및 결론

결론(?) : RLHF의 파이프라인을 혼자서 scratch부터 구현하려면 힘들기 때문에, Vertex AI의 Pipeline을 사용해보세요.

4. 관련 자료

- 언어 기반 모델 조정

언어 기반 모델 조정 | Vertex AI | Google Cloud

의견 보내기 언어 기반 모델 조정 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요. 기반 모델을 조정하여 성능을 향상시킬 수 있습니다. 기반 모델은 퓨샷 프

cloud.google.com

- RLHF 조정을 사용한 텍스트 모델 조정

RLHF 조정을 사용한 텍스트 모델 조정 | Vertex AI | Google Cloud

의견 보내기 RLHF 조정을 사용한 텍스트 모델 조정 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요. 미리보기 인간 피드백 강화 학습(RLHF)은 미리보기 상태로

cloud.google.com

- GoogleCloudPlatform GitHub - vertex-ai-samples/notebooks/official/generative_ai/rlhf_tune_llm.ipynb

- GoogleCloudPlatform GitHub - generative-ai

Disclaimer: 본 글은 개인이 작성한 글입니다. 본 글의 내용과 'devfest Cloud 2023' 행사에서 표현된 의견은 Google을 대변하지 않습니다.

저작자표시 비영리 동일조건 (새창열림)