Skip to content

Improving Language Models to Incorporate Self-Enhancement Autonomously

Utilizing implicit details in user preferences, instead of explicitly defining criteria through prompts, can prove beneficial.

Refining Language Models to Instinctively Embrace Self-Enhancement
Refining Language Models to Instinctively Embrace Self-Enhancement

Improving Language Models to Incorporate Self-Enhancement Autonomously

In a groundbreaking development, researchers have introduced a new approach called PIT (Permutation Invariant Training) that allows large language models (LLMs) to implicitly learn self-improvement from human preference data instead of relying on explicit prompts.

The key insight behind this innovation is that the preference data used to train the LLM already provides implicit guidance on what constitutes an improvement in quality. By employing curriculum reinforcement learning, PIT starts with easy-to-improve references, such as human-labeled bad responses, and then switches to the LLM's own samples for further enhancement.

Comprehensive experiments have validated PIT's capabilities on two real-world dialog datasets, as well as one synthetic instruction-following dataset. Across conditions, PIT improved response quality by 7-34% compared to the original LLM samples, as measured by third-party evaluator models.

One of the critical aspects of PIT is its ability to train a reward model to judge quality gaps without hand-engineering criteria into prompts. Instead of the model learning from manually crafted prompts, it learns from human preference signals that compare outputs on various responses. The model implicitly improves by adjusting its behavior to maximize alignment with preferred responses, as judged by humans or simulated human labelers.

The standard Reinforcement Learning from Human Feedback (RLHF) objective optimizes an LLM policy to maximize the expected quality of generated responses. However, PIT maximizes the gap in quality between the original response and an improved response, conditioned on having the original as a reference point. PIT reformulates the RLHF objective to maximize the response quality gap conditioned on a reference response.

Ablation studies confirm the importance of the full curriculum reinforcement learning procedure. Removing either the first stage of easy examples or the second stage of improving the LLM's own samples substantially degrades performance. The second stage of improving samples drawn from the LLM itself is crucial but highly challenging.

The techniques open the door to LLMs that continuously align better with human values as they learn from experience. PIT is able to learn nuanced objectives like making responses more helpful, harmless, or relevant without prompts explicitly defining those criteria. Lower temperatures around 0.4-0.6 work best for PIT, restricting diversity to focus improvement.

Reducing reliance on human intervention also facilitates expanding access to LLMs by allowing them to adapt to niche domains or under-served use cases that lack resources for oversight. With PIT, the need for manual distillation of criteria into prompts is minimized, as this implicit information can be leveraged instead.

In summary, PIT, originally a technique used primarily in tasks like speech separation to address label ambiguity, has been adapted to the LLM domain to enable models to learn self-improvement from human preference data. This development is crucial as autonomous self-improvement will be critical as these models increase in capabilities and are deployed in sensitive real-world applications.

Artificial intelligence, in the form of the PIT (Permutation Invariant Training) technique, has been developed to allow large language models to implicitly learn self-improvement from human preference data, rather than relying on explicit prompts. By training a reward model to judge quality gaps without hand-engineering criteria into prompts, PIT enables the models to learn from human preference signals that compare outputs on various responses, thereby implicitly improving by adjusting its behavior to maximize alignment with preferred responses.

Read also:

    Latest