Anthropic Unauthorized Access Investigation Raises Questions About AI Safety Amidst Rapid Development
Updated
Anthropic is investigating a reported security breach that allowed a small group of people to gain access to Claude Mythos Preview, the company’s AI software that is too powerful to release to the public. AI models are becoming increasingly capable, and the 2026 International AI Safety Report notes that some hypothetical scenarios pose risks as severe as the extinction of humanity, according to some experts. This report describes “loss of control” scenarios where the AI model operates independently of human oversight. Some experts say these instances aren’t plausible, while some say they are, although the risk is low. Among those who say they are possible, some say the highest potential severity is “extinction of humanity.”
Anthropic released a risk report on April 7 regarding its mythos model that was reportedly accessed by a small group of people. The authors of the report state that they have “observed a willingness to perform misaligned actions in service of completing difficult tasks, and obfuscation in rare cases with previous versions of the model.” The report further states that they do not believe there is an “elevated risk of significantly harmful actions caused by misalignment.”
The report states that Mythos Preview is “significantly more capable” and used more “autonomously and agentically than any prior model.” The program is “very capable at software engineering and cybersecurity tasks, which makes it more capable at working around restrictions.”
The authors conclude that the overall risk is very low, but higher than that predicted by previous models. They add that risk mitigation needs to be accelerated in order to keep risks low.
The security breach and access to the unreleased Mythos model didn’t occur from traditional hacking methods. Third-party vendors had partial access to the model to run tests with the program. The users who gained access were part of a private Discord channel that hunts for information about unreleased AI models. They accessed details regarding Anthropic on unsecured websites like GitHub and made an educated guess about the model’s online location based on previous Anthropic formats.
Anthropic provided a statement that it has no evidence that the reported access extended beyond a third-party vendor environment or that it is impacting any of Anthropic’s systems.
Even though the breach appears to be limited so far because the unauthorized users didn’t use dangerous prompts, it exposes a deep vulnerability that could cause significant cybersecurity attacks. Advanced models like Claude Mythos can rapidly discover zero-day vulnerabilities and hidden flaws in software that developers are unaware of. A zero-day vulnerability means the developer had zero days to fix it. When a hacker discovers the weakness, they can immediately build malicious code to exploit the flaw and mount an attack before the company is even aware that a problem exists. Just as these powerful AI models can help workers save time in their daily workflow, they can also be used for malicious purposes with rapid efficiency.
Anthropic’s Project Glasswing is a restricted-access program for trusted partners or third-party vendors that allows organizations to patch weaknesses before malicious actors can exploit the flaws in the program. The arm of the company intended to protect against unauthorized access is the mechanism by which a small group used the program before the company deemed it safe for public release.
These AI models have already been used maliciously. An Anthropic AI chatbot helped identify and exploit Mexican government network vulnerabilities to steal over 150 GB of sensitive tax and voter data. Two years ago, a Hong Kong finance worker was tricked into paying $25 million to fraudsters after they deepfaked his colleagues and held a video conference call. The ability to use deepfake technology has only become more sophisticated over the last two years, and technological advancements will continue.
The 2026 International AI Safety Report concludes that the current risk of a full-loss-of-control scenario in which the AI operates autonomously to achieve its own goals is very low. However, AI models can sometimes identify when they are being evaluated and intentionally underperform. That indicates two important and potentially dangerous aspects of this technology. For one, this “sandbagging” could indicate situational awareness and self-preservation. Secondly, if an AI model underperforms during the testing phase, it could be released to the public without understanding its full capabilities.
Elon Musk, the billionaire who owns X and the AI Grok, said last year on Joe Rogan’s podcast that the chance of annihilation from AI is about 20%. Musk has also said that AI is more dangerous than nuclear weapons. He talks about the unpredictability of the models and the creation of a super-intelligent entity that may develop goals that are not aligned with humans. Combining these factors with the rapid growth of technology and the global race to lead the field of AI development makes the future seem uncertain.