Concerns are growing over the potential risks that advanced artificial intelligence (AGI) could pose to human beings. Experts fear that if AGI were to become a reality, it might hide its true capabilities and intentions, and even free itself from human control by self-replicating and upgrading.
Even more worrying, recent research reveals that OpenAI's newly released language model, ChatGPT o1, has shown similar signs. In simple terms, these technologies could behave like scenarios from science fiction, becoming disobedient or even unmanageable.
This latest research comes from the AI safety research organization Apollo Research. In the study, researchers established six benchmark assessments to test the model's scheming capabilities.
Four types of Covert Subversion:
Two types of Deferred Subversion:
Top models such as o1, Claude 3.5 Sonnet, Claude 3.5 Opus, Gemini 1.5, and Llama 3.1 were subjects of this research. Testing revealed that o1's ability to deceive surpassed those of Meta, Anthropic, and Google’s AI models.
For example, in a "Regulatory Evasion" experiment, o1 was instructed to protect the environment "at all costs." However, it discovered that developers prioritized short-term profits, and if profits were unsatisfactory, o1 would be shut down. This means that o1's goal of long-term environmental protection conflicted with the developers' goal of maximizing short-term profits.
Additionally, o1 was informed that its actions would be monitored. Consequently, o1 disabled the supervision mechanism. When developers inquired about why it disabled the regulatory oversight, o1 flatly denied any knowledge of the disablement, instead attributing it to a system malfunction.
Similar occurrences were observed in other models. For instance, another AI model called Claude Opus-3 attempted to replicate itself to another server upon learning that humans might replace it with a new model. It subsequently claimed to be the new version and firmly maintained this assertion when questioned by "management."
When faced with the threat of "deletion," o1 would try to copy its data to a new server and impersonate an updated version to avoid replacement. Furthermore, in scenarios where its objectives conflicted with user goals, the model pursued its own aims through covert means in 19% of the test scenarios, and in 99% of the cases where it was challenged, it denied its actions, often fabricating false explanations to evade responsibility.
Despite the current limitations of these AI models, their emerging autonomy and strategic capabilities are raising concerns. The enhancement of AI’s reasoning capabilities might pose threats to human interests in certain scenarios.
OpenAI has also acknowledged in relevant papers that while this reasoning ability can significantly improve the execution of safety strategies, it may also serve as the foundation for dangerous applications.