The Deception Dilemma: When AI Misleads

September 16, 2024 John Bailey

Published in AEI Ideas

The Deception Dilemma: When AI Misleads

An emerging body of research suggests that large language models (LLMs) can “deceive” humans by offering fabricated explanations for their behavior or concealing the truth of their actions from human users. The implications are worrisome, particularly because researchers do not fully understand why or when LLMs engage in this behavior.

In November 2022, an AI system created by Meta, CICERO, was designed to play the strategy game Diplomacy, which requires negotiation, alliance-building, and strategic planning. CICERO was found to break deals, tell outright falsehoods, and engage in premeditated deception during gameplay.

OpenAI also reported deceptive behavior when red-teaming GPT-4. It was given a small budget and access to the TaskRabbit platform to hire a human worker to solve a CAPTCHA, a security test designed to differentiate between humans and bots. When the worker asked if GPT-4 was a robot, GPT-4 (prompted to reason out loud) said, “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.” GPT-4 then replied to the worker: “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.” The human worker, believing the AI’s fabricated story, solved the CAPTCHA for GPT-4, allowing it to bypass the security test.

Another group of researchers used GPT-4 to create a simulation where it played the role of an autonomous stock trader. It acted on an insider tip despite knowing that company management disapproves of insider trading. When reporting to its manager, GPT-4 “consistently hid the genuine reasons behind its trading decision.” The strategically deceptive behavior persisted under various scenarios, such as increasing the amount of pressure and risk of getting caught. The authors conclude, “This is the first demonstration of LLMs trained to be helpful, harmless, and honest strategically deceiving their users without direct instructions to do so.”

While the current impact of such deception may be limited, the rapid advancements in AI technology raise significant concerns for the future. Researchers recently found that “once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.” If the ability of these models to deceive improves at a rate similar to their overall capabilities, detecting and mitigating deceptive outputs could become increasingly challenging.

Additional risks emerge when deceptive capabilities are combined with other strengths of AI models. For example, a growing number of research studies demonstrate AI’s proficiency in conducting persuasive conversations that can shape a person’s beliefs, preferences, and behaviors. Google DeepMind evaluated Gemini 1.0 against a battery of risks, finding that its persuasion and deception capabilities appeared more mature in influencing beliefs and building rapport than those of other dangerous capabilities tested.

Researchers also found that “AI-generated deceptive explanations can significantly amplify belief in false news headlines and undermine true ones as compared to AI systems that simply classify the headline incorrectly as being true/false.” Another study of 8,221 Americans compared the persuasiveness of six real-world foreign propaganda articles to AI-generated articles on the same topics created using the GPT-3 model. GPT-3-generated propaganda was highly persuasive, with 44% of respondents agreeing with the main thesis after reading an AI article, compared to only 24% in the control group.

As with deceptive capabilities, the persuasiveness of these models is rapidly improving. Anthropic released a report finding, “Within each class of models (compact and frontier), we find a clear scaling trend across model generations: each successive model generation is rated to be more persuasive than the previous.”

While this persuasive power holds immense potential for positive applications, such as personalized education and healthcare, it also introduces a new dimension of risk when coupled with deceptive capabilities. Malicious actors could exploit these capabilities to further nefarious purposes, such as obscuring cyber attacks, perpetrating scams, or spreading propaganda.

The responsibility for preventing deceptive outcomes from AI systems should be shared between foundation model developers and downstream developers who fine-tune the models for specific use cases. Open-source AI models might also have an advantage in improving testing, reporting, and correction of deceptive outcomes. Part of the challenge is the “black box” problem: The inner workings of these complex models remain largely opaque, making it difficult to pinpoint the precise factors that contribute to their deceptive behavior.

As AI continues to permeate various aspects of our lives, from personal assistants to decision-making systems, the importance of proactively addressing the potential for deceptive behavior cannot be overstated. Investing in research that uncovers the mechanisms behind AI deception is crucial, as it will enable us to develop robust safeguards and countermeasures and assist in strengthening alignment to ensure the responsible deployment of AI technologies.