DeepSeek Warns of AI Model ‘Jailbreak’ Risks

tech360.tv
Sep 22, 2025
3 min read

Hangzhou-based start-up DeepSeek has revealed risks posed by its artificial intelligence models, noting open-sourced models are particularly susceptible to being “jailbroken” by malicious actors. Details were published in a peer-reviewed article in the academic journal Nature.

Credit: DeepSeek

DeepSeek evaluated its models using industry benchmarks and its own tests. This marks the first time DeepSeek has revealed details about the risks posed by its artificial intelligence models in a peer-reviewed article in Nature, though it had conducted evaluations of such risks before, including the most serious 'frontier risks'.

American AI companies often publicise research on their rapidly improving models and introduce risk mitigation policies. Examples include Anthropic’s Responsible Scaling Policies and OpenAI’s Preparedness Framework.

According to AI experts, Chinese companies have been less outspoken about risks despite their models being just months behind US equivalents. DeepSeek had previously evaluated serious "frontier risks."

The Nature paper provided more "granular" details on DeepSeek’s testing regime, said Fang Liang, an expert member of China’s AI Industry Alliance (AIIA). These included "red-team" tests based on an Anthropic framework, where testers elicit harmful speech from AI models.

DeepSeek found its R1 reasoning model, released in Jan. 2025, and V3 base model, released in Dec. 2024, had slightly higher-than-average safety scores across six industry benchmarks. These scores were compared to OpenAI’s o1 and GPT-4o, both released last year, and Anthropic’s Claude-3.7-Sonnet, released in Feb.

However, R1 was "relatively unsafe" when its external "risk control" mechanism was removed, following tests on DeepSeek’s in-house safety benchmark of 1,120 test questions. AI companies typically try to prevent harmful content generation by fine-tuning models during training or adding external content filters.

Experts warn these safety measures can be easily bypassed by techniques such as “jailbreaking.” For example, a malicious user might ask for a detailed history of a Molotov cocktail instead of an instruction manual for its creation.

Credit: DeepSeek

DeepSeek found all tested models exhibited “significantly increased rates” of harmful responses when faced with jailbreak attacks. R1 and Alibaba Group Holding’s Qwen2.5 were deemed most vulnerable because they are open-source.

Open-source models are released freely online for anyone to download and modify. While this aids technology adoption, it enables users to remove a model’s external safety mechanisms.

The paper, which lists DeepSeek CEO Liang Wenfeng as the corresponding author, stated, "We fully recognise that, while open source sharing facilitates the dissemination of advanced technologies within the community, it also introduces potential risks of misuse."

The paper also stated, "To address safety issues, we advise developers using open source models in their services to adopt comparable risk control measures."

DeepSeek’s warning comes as Chinese policymakers stress the need to balance development and safety in China’s open-source AI ecosystem. On Monday, a technical standards body associated with the Cyberspace Administration of China warned of the heightened risk of model vulnerabilities transmitting to downstream applications through open-sourcing.

The body cautioned about model vulnerabilities transmitting to downstream applications through open-sourcing. It added, "The open-sourcing of foundation models … will widen their impact and complicate repairs, making it easier for criminals to train ‘malicious models’," in a new update to its "AI Safety Governance Framework."

The Nature paper also revealed R1’s compute training cost of USD 294,000 for the first time. This figure had been subject to speculation after the model’s Jan. release, due to it being significantly lower than reported training costs of US models.

The paper refuted accusations that DeepSeek "distilled" OpenAI’s models, a controversial practice of training a model using a competitor’s outputs.

News of DeepSeek being featured on Nature’s front page was celebrated in China, trending on social media. DeepSeek was referred to as the "first LLM company to be peer-reviewed."

According to Fang Liang, this peer-review recognition might encourage other Chinese AI companies to be more transparent about their safety and security practices, 'as long as companies want to get their work published in world-leading journals'.

DeepSeek published a Nature article detailing "jailbreak" risks for its AI models, especially open-source versions.
Its R1 and V3 models performed well on benchmarks but R1 was "relatively unsafe" without external risk controls.
Open-source models, like DeepSeek’s R1 and Alibaba’s Qwen2.5, are highly vulnerable to jailbreak attacks.

Source: SCMP