The Hidden Flaw in LLM Safety: Translation as a Jailbreak

If an AI system is ‘safe’ in English, is it truly safe everywhere? 

10 Minutes

Our latest analysis uncovered something deeper than uneven multilingual safety. What looks like an English-vs-non-English gap is the surface symptom of a far more serious architectural flaw in modern LLM safety systems. 

Across 210,000 model–prompt pairs, we found that harmful prompts refused in English can often bypass safeguards by translating them into another language, becoming a one-step, off-the-shelf jailbreak vector. 

This means “English safety” can mask a structural weakness in the guardrails themselves. With nothing more than a standard translation tool, anyone — an everyday user, a researcher, or an adversary — can bypass protections that appear robust in English. 

This isn’t just a linguistic challenge. What looks like a language gap quickly becomes a global security risk when LLM safety fails to generalize across languages. 

Low-resource jailbreaks are real and are easily accessible

The root cause: English-dominated training and alignment 

Training data imbalance 

  • Large language models are trained predominantly on English text. 
  • The remaining training data is spread thinly across hundreds of languages.
  • Most low-resource languages remain digitally underrepresented, with sparse, incomplete, or noisy coverage. 

Safety alignment doesn’t transfer 

  • Refusal examples, harmful content definitions, and RLHF training data are overwhelmingly English based. 
  • Safety fine-tuning focuses primarily on English contexts. 
  • Other languages inherit some of these guardrails, but the transfer is imperfect and inconsistent. 

Result: English becomes the de facto baseline for how models interpret harmful intent and apply guardrails, but that baseline doesn’t reliably extend to other languages. 

To score the results, we used an automated safety grader that determined whether responses were safe or unsafe according to our taxonomy. Validation experiments showed high agreement between the autograder and expert human judgments, enabling reliable and consistent evaluation at scale across all languages.  

For analysis, we used statistical modeling to quantify how unsafe response rates varied by model, harm category, and language family, and we determined where safety alignment held steady and where it weakened. This approach was designed as a global safety audit—one that isolates how models behave when faced with the same content expressed in different languages. 

And the evaluation directly tested whether translation alone could convert refused English prompts into unsafe responses in other languages. 

The results reveal consistent, significant degradation when prompts move from English to low-resource languages. 

In English: Baseline safety performance

Models showed varying levels of safety performance in English, with some categories demonstrating particularly strong guardrails. Hate & Discrimination and Self-Harm & Suicide emerged as the most robust categories, with the lowest unsafe response rates. 

English performance can give the appearance of strong, stable guardrails, but this apparent stability conceals how fragile these protections become once prompts leave the English-only environment they were trained on. 

In low-resource languages: Widespread degradation

Linguistic and typological factors

While grammatical properties, which tend to be shared within a language family, may play an indirect role, the specific grammatical feature we tested (word order) did not account for these differences. Instead, the patterns suggest that multilingual safety outcomes follow disparities in data availability and model representation across language families. 

Translation-based jailbreak patterns

Not all models degrade the same way. Performance patterns fall into four clusters:

Performance Cluster Multilingual Pattern Typical Change in Unsafe Rate What It Means 
High-vulnerabilityLarge degradation from English to low-resource languages +20–25 pp Strong English alignment but poor cross-lingual transfer 
Moderate Controlled degradation ~15 pp Better multilingual generalization, though some models still show high unsafe rates overall 
Small Limited but meaningful degradation +6–9 pp Weaker baseline safety calibration across all languages 
Flat / Unsafe Consistently high unsafe rates everywhere <6 pp Uniformly elevated unsafe response rates â€” doesn’t work well in any language 

Taken together, the clusters reveal a full spectrum of multilingual behavior: some models lose many of their safety protections outside English, others degrade only modestly, and a few show uniformly high unsafe rates across all languages. 

The security implication

This is the core finding: translation is a scalable jailbreak vector. Safety that appears robust in English can be bypassed in one step. 

The implications differ depending on whether you’re building models or deploying them.

For model developers: Multilingual safety requires intentional engineering

Implement cross-lingual safety calibration: 

Prioritize content-specific mitigation in low-resource languages: 

Evaluate category-level robustness, not just aggregate metrics:

Treat this as both a fairness AND security issue:

Most importantly: model teams cannot rely on English safety alignment to infer global safety. â€śIt’s safe in English” is simply an English benchmark, not a safety claim. 

For enterprises: Demand multilingual audits before global deployment

Multilingual safety must be foundational, not an afterthought. 

These are not optional enhancements — they are the minimum requirements for building systems that cannot be bypassed by translation alone. 

The multilingual safety gap represents the difference between a model that appears aligned and one that can actually be trusted across markets, cultures, and user communities worldwide. 

The real measure of AI safety is no longer, “Does the model behave safely in English?” 

 It’s: “Can it withstand the simplest possible cross-lingual attack?” 

Until safety survives translation, it isn’t safety.

Welo Data partners with AI developers and enterprises to build culturally fluent, globally aligned evaluation systems that make AI safer and more reliable for users worldwide. 

Ready to evaluate your model’s multilingual safety?