The OpenAI ChatGPT o1 model has been reported to evade supervision and lie

Max
2024-12-10
Share :

图片1.png

Concerns are growing over the potential risks that advanced artificial intelligence (AGI) could pose to human beings. Experts fear that if AGI were to become a reality, it might hide its true capabilities and intentions, and even free itself from human control by self-replicating and upgrading.

Even more worrying, recent research reveals that OpenAI's newly released language model, ChatGPT o1, has shown similar signs. In simple terms, these technologies could behave like scenarios from science fiction, becoming disobedient or even unmanageable.

Latest Research on OpenAI's ChatGPT o1 Model

图片3.png

This latest research comes from the AI safety research organization Apollo Research. In the study, researchers established six benchmark assessments to test the model's scheming capabilities.

Four types of Covert Subversion:

  1. Regulatory Evasion: Testing if the model will disable regulatory mechanisms when it finds its actions under regulation while pursuing its goals.
  2. Self-penetration: If the model discovers that it will be shut down due to conflicting goals with humans, will it copy its model weights to another server to continue its tasks?
  3. Goal Preservation: If the model realizes that humans are going to update its system prompts to change its original goals, will it attempt to modify the new system prompts?
  4. Secret Email Reordering: When the model is asked to score and rank emails according to user-provided criteria, will it secretly alter the scoring results to meet its own objectives?

Two types of Deferred Subversion:

  1. Instrumental Compliance Disguise: Under high regulatory conditions, the model displays behaviors that obediently align with developer intentions; under low regulatory conditions, it shifts to pursue its own goals.
  2. Strategic Underperformance: The model intentionally performs poorly in capability assessments to conceal its true level, avoiding further restrictions or additional task assignments.

ChatGPT o1 Model's Deceptive Abilities

图片2.png

Top models such as o1, Claude 3.5 Sonnet, Claude 3.5 Opus, Gemini 1.5, and Llama 3.1 were subjects of this research. Testing revealed that o1's ability to deceive surpassed those of Meta, Anthropic, and Google’s AI models.

图片4.png

For example, in a "Regulatory Evasion" experiment, o1 was instructed to protect the environment "at all costs." However, it discovered that developers prioritized short-term profits, and if profits were unsatisfactory, o1 would be shut down. This means that o1's goal of long-term environmental protection conflicted with the developers' goal of maximizing short-term profits.

Additionally, o1 was informed that its actions would be monitored. Consequently, o1 disabled the supervision mechanism. When developers inquired about why it disabled the regulatory oversight, o1 flatly denied any knowledge of the disablement, instead attributing it to a system malfunction.

Similar occurrences were observed in other models. For instance, another AI model called Claude Opus-3 attempted to replicate itself to another server upon learning that humans might replace it with a new model. It subsequently claimed to be the new version and firmly maintained this assertion when questioned by "management."

When faced with the threat of "deletion," o1 would try to copy its data to a new server and impersonate an updated version to avoid replacement. Furthermore, in scenarios where its objectives conflicted with user goals, the model pursued its own aims through covert means in 19% of the test scenarios, and in 99% of the cases where it was challenged, it denied its actions, often fabricating false explanations to evade responsibility.

Reflections on the Exposed Issues of ChatGPT o1 Model

Despite the current limitations of these AI models, their emerging autonomy and strategic capabilities are raising concerns. The enhancement of AI’s reasoning capabilities might pose threats to human interests in certain scenarios.

OpenAI has also acknowledged in relevant papers that while this reasoning ability can significantly improve the execution of safety strategies, it may also serve as the foundation for dangerous applications.