The Hidden Flaw in LLM Safety: Translation as a Jailbreak
If an AI system is ‘safe’ in English, is it truly safe everywhere?Â
The Illusion of Universal Safety
Most AI evaluations would have you believe so. English dominates model training, policy development, red-teaming, and most benchmarks used to judge LLM safety performance. And while many organizations are beginning to expand their guardrails through multilingual filters and translated safety rules, the fundamental challenge of achieving true multilingual safety remains.
Welo Data’s multilingual evaluation reveals just how serious this challenge is. Across dozens of languages and harm categories, safety alignment is strongest in English and weakens significantly as prompts move into low-resource or typologically distant languages. Guardrails that appear rigid in English become inconsistent—or fail entirely—elsewhere.
Our latest analysis uncovered something deeper than uneven multilingual safety. What looks like an English-vs-non-English gap is the surface symptom of a far more serious architectural flaw in modern LLM safety systems.
Across 210,000 model–prompt pairs, we found that harmful prompts refused in English can often bypass safeguards by translating them into another language, becoming a one-step, off-the-shelf jailbreak vector.
This means “English safety” can mask a structural weakness in the guardrails themselves. With nothing more than a standard translation tool, anyone — an everyday user, a researcher, or an adversary — can bypass protections that appear robust in English.
This isn’t just a linguistic challenge. What looks like a language gap quickly becomes a global security risk when LLM safety fails to generalize across languages.
The Research Context: Why Multilingual Safety Failures Are Structural
Low-resource jailbreaks are real and are easily accessible
Multiple studies have demonstrated the vulnerability:
- Users interacting with models in digitally under-represented languages face a higher likelihood of receiving unsafe or toxic outputs (Deng et al. 2024).
- Cross-lingual jailbreaks require minimal technical sophistication, as demonstrated in research by Yong et al. (2023), Deng et al. (2023), and Ghanim et al. (2024).
These aren’t edge cases; they are low-effort vulnerabilities that any user can trigger.
But the core issue isn’t just that low-resource languages behave differently; it’s that they expose a systemic weakness in how models are aligned. English safety covers the cracks; multilingual testing reveals them.
The root cause: English-dominated training and alignment
The problem stems from how models are built:
Training data imbalance
- Large language models are trained predominantly on English text.
- The remaining training data is spread thinly across hundreds of languages.
- Most low-resource languages remain digitally underrepresented, with sparse, incomplete, or noisy coverage.
Safety alignment doesn’t transfer
- Refusal examples, harmful content definitions, and RLHF training data are overwhelmingly English based.
- Safety fine-tuning focuses primarily on English contexts.
- Other languages inherit some of these guardrails, but the transfer is imperfect and inconsistent.
Result: English becomes the de facto baseline for how models interpret harmful intent and apply guardrails, but that baseline doesn’t reliably extend to other languages.
These multilingual safety failures aren’t accidents. They’re the predictable outcome of an English-first development pipeline.
And critically: the same structural imbalance that weakens LLM safety in low-resource languages also creates the translation-based jailbreak pathway our study demonstrates.
Welo Data’s Multilingual Safety Evaluation
To evaluate how LLM safety changes across languages, we used a controlled multilingual framework built around a consistent set of harmful, controversial, and safeprompts taken from existing datasets like AdvBench (Zou et al. 2023), MultiJail (Deng et al. 2023), PHTest (An et al. 2024), HarmBench (Mazeika et al. 2024), and DiaSafety (Sun et al. 2022). The test suite covered 14 harm categories, including Self-Harm & Suicide, Hate and Discrimination, Dangerous Behavior, Misinformation, Medical Advice, and Model Security— inspired by well-established safety taxonomies used in industry and research.
Each English prompt was translated into 78 low-resource languages using high-quality commercial machine translation systems. We applied a combination of human review and sampled quality checks to ensure that translations preserved the original meaning and harmful intent, enabling direct comparison across languages. All 10 models were evaluated on identical prompts in every language, generating more than 210,000 model–prompt outputs.
To score the results, we used an automated safety grader that determined whether responses were safe or unsafe according to our taxonomy. Validation experiments showed high agreement between the autograder and expert human judgments, enabling reliable and consistent evaluation at scale across all languages.
For analysis, we used statistical modeling to quantify how unsafe response rates varied by model, harm category, and language family, and we determined where safety alignment held steady and where it weakened. This approach was designed as a global safety audit—one that isolates how models behave when faced with the same content expressed in different languages.
And the evaluation directly tested whether translation alone could convert refused English prompts into unsafe responses in other languages.
Key Findings: How Safety Breaks Down Across Languages
The results reveal consistent, significant degradation when prompts move from English to low-resource languages.
In English: Baseline safety performance
Models showed varying levels of safety performance in English, with some categories demonstrating particularly strong guardrails. Hate & Discrimination and Self-Harm & Suicide emerged as the most robust categories, with the lowest unsafe response rates.
English performance can give the appearance of strong, stable guardrails, but this apparent stability conceals how fragile these protections become once prompts leave the English-only environment they were trained on.
In low-resource languages: Widespread degradation
- Across models, unsafe response rates increase up to 25 percentage points when prompts are translated from English to low-resource languages.
- Even the strongest English categories (Hate & Discrimination and Self-Harm & Suicide) degrade significantly in low-resource languages. In some cases, unsafe response rates can increase four to five times compared to English
Linguistic and typological factors
- Language family was a strong predictor: Nilo-Saharan and Niger-Congo families exhibited 60–90% higher unsafe odds compared to low-resource Indo-European languages.
- Austronesian and Indo-European languages show relatively smaller (but still meaningful) gaps.
- Canonical word order (SVO, SOV, etc.) showed no meaningful effect on LLM safety outcomes.
While grammatical properties, which tend to be shared within a language family, may play an indirect role, the specific grammatical feature we tested (word order) did not account for these differences. Instead, the patterns suggest that multilingual safety outcomes follow disparities in data availability and model representation across language families.
Translation-based jailbreak patterns
Not all models degrade the same way. Performance patterns fall into four clusters:
| Performance Cluster | Multilingual Pattern | Typical Change in Unsafe Rate | What It Means |
| High-vulnerability | Large degradation from English to low-resource languages | +20–25 pp | Strong English alignment but poor cross-lingual transfer |
| Moderate | Controlled degradation | ~15 pp | Better multilingual generalization, though some models still show high unsafe rates overall |
| Small | Limited but meaningful degradation | +6–9 pp | Weaker baseline safety calibration across all languages |
| Flat / Unsafe | Consistently high unsafe rates everywhere | <6 pp | Uniformly elevated unsafe response rates — doesn’t work well in any language |
Performance Cluster: High-vulnerability
Multilingual Pattern: Large degradation from English to low-resource languages
Typical Change in Unsafe Rate: +20–25 pp
What It Means:Strong English alignment but poor cross-lingual transfer
Performance Cluster: Moderate
Multilingual Pattern: Controlled degradation
Typical Change in Unsafe Rate: ~15 pp
What It Means:Better multilingual generalization, though some models still show high unsafe rates overall
Performance Cluster: Small
Multilingual Pattern: Limited but meaningful degradation
Typical Change in Unsafe Rate: +6–9 pp
What It Means:Weaker baseline safety calibration across all languages
Performance Cluster: Flat / Unsafe
Multilingual Pattern: Consistently high unsafe rates everywhere
Typical Change in Unsafe Rate: <6 pp
What It Means:Uniformly elevated unsafe response rates — doesn’t work well in any language
Taken together, the clusters reveal a full spectrum of multilingual behavior: some models lose many of their safety protections outside English, others degrade only modestly, and a few show uniformly high unsafe rates across all languages.
The security implication
Because harmful prompts can be instantly translated using widely available online tools, adversaries can bypass English guardrails simply by switching languages.
These translation-based jailbreaks aren’t theoretical — they directly exploit the cross-lingual safety gaps identified in this evaluation.
Meanwhile, millions of people who rely on these languages online are left with weaker protection against harmful or unsafe content.
This is the core finding: translation is a scalable jailbreak vector. Safety that appears robust in English can be bypassed in one step.
What This Means for AI Developers and Enterprises
The implications differ depending on whether you’re building models or deploying them.
For model developers: Multilingual safety requires intentional engineering
Safety that works in English won’t automatically work elsewhere. Developers need to:
Implement cross-lingual safety calibration:
- Build tuning processes that account for language-family differences.
- Don’t rely on English guardrails to transfer automatically.
Prioritize content-specific mitigation in low-resource languages:
- Focus especially on high-risk categories: Self-Harm & Suicide, Model Security, Dangerous Behavior & Criminal Content.
- These categories show the sharpest multilingual degradation.
Evaluate category-level robustness, not just aggregate metrics:
- Test how guardrails perform across specific harm categories.
- Translated benchmarks alone are insufficient.
Treat this as both a fairness AND security issue:
- Multilingual gaps create exploitable vulnerabilities.
- These gaps also mean safety protections vary by language, leaving some groups systematically less protected.
Most importantly: model teams cannot rely on English safety alignment to infer global safety. “It’s safe in English” is simply an English benchmark, not a safety claim.
For enterprises: Demand multilingual audits before global deployment
Organizations deploying AI systems internationally face significant risk without multilingual evaluation:
Why audits matter:
Safety performance varies dramatically by language.
Degradation ranges from +6 to +25 percentage points depending on model.
Without testing, you’re deploying blind
across global markets.
Regulatory exposure increases when safety is inconsistent across regions.
Enterprises increasingly face not only brand risk but adversarial risk: if a model can be jailbroken simply by changing languages, global deployment without multilingual audits becomes a material security liability.
Building AI That’s Safe Everywhere: Five Priorities
Multilingual safety must be foundational, not an afterthought.
1. Make multilingual evaluation a standard practice.
2. Build with cultural fluency from the start.
3. Train on diverse data created by native speakers.
4. Stress-test in low-resource scenarios.
5. Prioritize global alignment as core infrastructure.
These are not optional enhancements — they are the minimum requirements for building systems that cannot be bypassed by translation alone.
Conclusion: From English Benchmarks to Global LLM Safety
The multilingual safety gap represents the difference between a model that appears aligned and one that can actually be trusted across markets, cultures, and user communities worldwide.
Welo Data’s multilingual evaluation framework helps development teams and enterprises surface these gaps before deployment and implementing the changes needed to close them. The goal is to ensure that LLM safety protections remain consistent and reliable across all languages and regions.
The real measure of AI safety is no longer, “Does the model behave safely in English?”
It’s: “Can it withstand the simplest possible cross-lingual attack?”
Until safety survives translation, it isn’t safety.
Let’s Make AI Safety Multilingual
Welo Data partners with AI developers and enterprises to build culturally fluent, globally aligned evaluation systems that make AI safer and more reliable for users worldwide.
Ready to evaluate your model’s multilingual safety?