Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

AI Alignment is a Joke

OpenAI has been crystal clear about one of the most important aspects behind the success of ChatGPT — Reinforcement Learning from Human Feedback (RLHF). Everyone nodded. And since then, they have all been building models using RLHF. 

By training LLMs through interactions with human evaluators, RLHF seeks to improve the performance of AI models in real-world applications, but in turn it induces biases and reduces the robustness of the models. A recent paper, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback by researchers from Harvard, Stanford, MIT, UC Berkeley, and many other universities, discusses the problems with the RLHF approach. 

Good, but not the best

According to the paper, obtaining high-quality feedback from human evaluators is one of the primary challenges in RLHF. Human beings, while capable of providing valuable feedback, are susceptible to various limitations and biases. Misaligned evaluators might have difficulty in understanding the context or objectives of the AI model, leading to suboptimal feedback. The complexity of supervision, especially in long conversations, can also hinder the accurate assessment of model performance.

Besides, data quality is another critical concern. Human evaluators may unintentionally provide inconsistent or inaccurate feedback due to factors like limited attention, time constraints, and cognitive biases. Even with well-intentioned evaluators, disagreement can arise due to subjective interpretations and varying perspectives.

The form of feedback used in RLHF can further compound these challenges. Depending on the evaluation method, evaluators may provide binary judgments, rankings, or comparisons, each with its own strengths and weaknesses. Selecting the most appropriate form of feedback for a specific AI task can be complex, leading to potential discrepancies in the training process.

A fundamental issue in RLHF is accurately representing individual human values with a reward function. Human preferences are context-dependent, dynamic, and often influenced by societal and cultural factors. Designing a reward function that encompasses the complexity of human values is a formidable task. Incorrect assumptions about human decision-making or using a reward model that neglects personality and context-dependence can lead to misaligned AI models.

Why so much alignment?

The diversity of human evaluators further complicates the reward modelling process. Different evaluators may have unique preferences, expertise, and cultural backgrounds. Attempting to consolidate their feedback into a single reward model might overlook important disagreements and result in biassed AI models that favour majority opinions. This could be of disadvantage to underrepresented groups and perpetuate existing societal biases.

To address these challenges, researchers must explore techniques for representing preferences in more nuanced and context-aware ways. Utilising ensemble reward models that consider multiple evaluators’ feedback, or personalised reward models that cater to individual preferences, can help capture the diversity of human values.

Transparently addressing potential biases in the data collection process and conducting thorough evaluations to identify and mitigate harmful biases are essential steps in responsible AI development. 

To overcome these data constraints, researchers should explore methods for cost-effective data collection that do not compromise data quality and diversity. Understandably, training on GPT-output data for quicker alignment has been the new trend, but this in the end brings in the same bias into other models as well. So, there has been no conclusion on this so far. 

The fundamental challenges of RLHF have significant implications for AI alignment. While some problems may have tractable solutions through technical progress, others may not have complete solutions and may require alternative approaches. Researchers must be cautious about relying solely on RLHF for AI alignment, as certain challenges might not be fully addressed through this method alone.

Essentially, RLHF leads to over-finetuning of a model that may handicap its capabilities. This phenomenon is called the alignment tax of AI models. When a model goes to several benchmarks testing with humans in the loop trying to make the model as aligned and as “politically correct” as possible, it loses a lot of its performance.

Alignment tax is the extra cost that an AI system has to incur to stay more aligned, at the cost of building an unaligned or uncensored model, which ultimately also hinders its performance. That is why, in a lot of cases, uncensored models that do not go through the RLHF phase actually perform better than aligned models. 

The post AI Alignment is a Joke appeared first on Analytics India Magazine.



This post first appeared on Analytics India Magazine, please read the originial post: here

Share the post

AI Alignment is a Joke

×

Subscribe to Analytics India Magazine

Get updates delivered right to your inbox!

Thank you for your subscription

×