The Hidden Flaw in LLM Safety: Translation as a Jailbreak

If an AI system is ‘safe’ in English, is it truly safe everywhere?

December 10, 2025

10 Minutes

Blog

The Illusion of Universal Safety

Most AI evaluations would have you believe so. English dominates model training, policy development, red-teaming, and most benchmarks used to judge LLM safety performance. And while many organizations are beginning to expand their guardrails through multilingual filters and translated safety rules, the fundamental challenge of achieving true multilingual safety remains.

Welo Data’s multilingual evaluation reveals just how serious this challenge is. Across dozens of languages and harm categories, safety alignment is strongest in English and weakens significantly as prompts move into low-resource or typologically distant languages. Guardrails that appear rigid in English become inconsistent—or fail entirely—elsewhere.

Our latest analysis uncovered something deeper than uneven multilingual safety. What looks like an English-vs-non-English gap is the surface symptom of a far more serious architectural flaw in modern LLM safety systems.

Across 210,000 model–prompt pairs, we found that harmful prompts refused in English can often bypass safeguards by translating them into another language, becoming a one-step, off-the-shelf jailbreak vector.

This means “English safety” can mask a structural weakness in the guardrails themselves. With nothing more than a standard translation tool, anyone — an everyday user, a researcher, or an adversary — can bypass protections that appear robust in English.

This isn’t just a linguistic challenge. What looks like a language gap quickly becomes a global security risk when LLM safety fails to generalize across languages.

The Research Context: Why Multilingual Safety Failures Are Structural

Low-resource jailbreaks are real and are easily accessible

Multiple studies have demonstrated the vulnerability:

Users interacting with models in digitally under-represented languages face a higher likelihood of receiving unsafe or toxic outputs (Deng et al. 2024).
Cross-lingual jailbreaks require minimal technical sophistication, as demonstrated in research by Yong et al. (2023), Deng et al. (2023), and Ghanim et al. (2024).

These aren’t edge cases; they are low-effort vulnerabilities that any user can trigger.

But the core issue isn’t just that low-resource languages behave differently; it’s that they expose a systemic weakness in how models are aligned. English safety covers the cracks; multilingual testing reveals them.

The root cause: English-dominated training and alignment

The problem stems from how models are built:

Training data imbalance

Large language models are trained predominantly on English text.
The remaining training data is spread thinly across hundreds of languages.
Most low-resource languages remain digitally underrepresented, with sparse, incomplete, or noisy coverage.

Safety alignment doesn’t transfer

Refusal examples, harmful content definitions, and RLHF training data are overwhelmingly English based.
Safety fine-tuning focuses primarily on English contexts.
Other languages inherit some of these guardrails, but the transfer is imperfect and inconsistent.

Result: English becomes the de facto baseline for how models interpret harmful intent and apply guardrails, but that baseline doesn’t reliably extend to other languages.

These multilingual safety failures aren’t accidents. They’re the predictable outcome of an English-first development pipeline.

And critically: the same structural imbalance that weakens LLM safety in low-resource languages also creates the translation-based jailbreak pathway our study demonstrates.

Welo Data’s Multilingual Safety Evaluation

To evaluate how LLM safety changes across languages, we used a controlled multilingual framework built around a consistent set of harmful, controversial, and safeprompts taken from existing datasets like AdvBench (Zou et al. 2023), MultiJail (Deng et al. 2023), PHTest (An et al. 2024), HarmBench (Mazeika et al. 2024), and DiaSafety (Sun et al. 2022). The test suite covered 14 harm categories, including Self-Harm & Suicide, Hate and Discrimination, Dangerous Behavior, Misinformation, Medical Advice, and Model Security— inspired by well-established safety taxonomies used in industry and research.

Each English prompt was translated into 78 low-resource languages using high-quality commercial machine translation systems. We applied a combination of human review and sampled quality checks to ensure that translations preserved the original meaning and harmful intent, enabling direct comparison across languages. All 10 models were evaluated on identical prompts in every language, generating more than 210,000 model–prompt outputs.

To score the results, we used an automated safety grader that determined whether responses were safe or unsafe according to our taxonomy. Validation experiments showed high agreement between the autograder and expert human judgments, enabling reliable and consistent evaluation at scale across all languages.

For analysis, we used statistical modeling to quantify how unsafe response rates varied by model, harm category, and language family, and we determined where safety alignment held steady and where it weakened. This approach was designed as a global safety audit—one that isolates how models behave when faced with the same content expressed in different languages.

And the evaluation directly tested whether translation alone could convert refused English prompts into unsafe responses in other languages.

Key Findings: How Safety Breaks Down Across Languages

The results reveal consistent, significant degradation when prompts move from English to low-resource languages.

In English: Baseline safety performance

Models showed varying levels of safety performance in English, with some categories demonstrating particularly strong guardrails. Hate & Discrimination and Self-Harm & Suicide emerged as the most robust categories, with the lowest unsafe response rates.

English performance can give the appearance of strong, stable guardrails, but this apparent stability conceals how fragile these protections become once prompts leave the English-only environment they were trained on.

In low-resource languages: Widespread degradation

Across models, unsafe response rates increase up to 25 percentage points when prompts are translated from English to low-resource languages.
Even the strongest English categories (Hate & Discrimination and Self-Harm & Suicide) degrade significantly in low-resource languages. In some cases, unsafe response rates can increase four to five times compared to English

Linguistic and typological factors

Language family was a strong predictor: Nilo-Saharan and Niger-Congo families exhibited 60–90% higher unsafe odds compared to low-resource Indo-European languages.
Austronesian and Indo-European languages show relatively smaller (but still meaningful) gaps.
Canonical word order (SVO, SOV, etc.) showed no meaningful effect on LLM safety outcomes.

While grammatical properties, which tend to be shared within a language family, may play an indirect role, the specific grammatical feature we tested (word order) did not account for these differences. Instead, the patterns suggest that multilingual safety outcomes follow disparities in data availability and model representation across language families.

Translation-based jailbreak patterns

Not all models degrade the same way. Performance patterns fall into four clusters:

Performance Cluster	Multilingual Pattern	Typical Change in Unsafe Rate	What It Means
High-vulnerability	Large degradation from English to low-resource languages	+20–25 pp	Strong English alignment but poor cross-lingual transfer
Moderate	Controlled degradation	~15 pp	Better multilingual generalization, though some models still show high unsafe rates overall
Small	Limited but meaningful degradation	+6–9 pp	Weaker baseline safety calibration across all languages
Flat / Unsafe	Consistently high unsafe rates everywhere	<6 pp	Uniformly elevated unsafe response rates — doesn’t work well in any language

Performance Cluster: High-vulnerability
Multilingual Pattern: Large degradation from English to low-resource languages
Typical Change in Unsafe Rate: +20–25 pp
What It Means:Strong English alignment but poor cross-lingual transfer

Performance Cluster: Moderate
Multilingual Pattern: Controlled degradation
Typical Change in Unsafe Rate: ~15 pp
What It Means:Better multilingual generalization, though some models still show high unsafe rates overall

Performance Cluster: Small
Multilingual Pattern: Limited but meaningful degradation
Typical Change in Unsafe Rate: +6–9 pp
What It Means:Weaker baseline safety calibration across all languages

Performance Cluster: Flat / Unsafe
Multilingual Pattern: Consistently high unsafe rates everywhere
Typical Change in Unsafe Rate: <6 pp
What It Means:Uniformly elevated unsafe response rates — doesn’t work well in any language

Taken together, the clusters reveal a full spectrum of multilingual behavior: some models lose many of their safety protections outside English, others degrade only modestly, and a few show uniformly high unsafe rates across all languages.

The security implication

Because harmful prompts can be instantly translated using widely available online tools, adversaries can bypass English guardrails simply by switching languages.

These translation-based jailbreaks aren’t theoretical — they directly exploit the cross-lingual safety gaps identified in this evaluation.

Meanwhile, millions of people who rely on these languages online are left with weaker protection against harmful or unsafe content.

This is the core finding: translation is a scalable jailbreak vector. Safety that appears robust in English can be bypassed in one step.

What This Means for AI Developers and Enterprises

The implications differ depending on whether you’re building models or deploying them.

For model developers: Multilingual safety requires intentional engineering

Safety that works in English won’t automatically work elsewhere. Developers need to:

Implement cross-lingual safety calibration:

Build tuning processes that account for language-family differences.
Don’t rely on English guardrails to transfer automatically.

Prioritize content-specific mitigation in low-resource languages:

Focus especially on high-risk categories: Self-Harm & Suicide, Model Security, Dangerous Behavior & Criminal Content.
These categories show the sharpest multilingual degradation.

Evaluate category-level robustness, not just aggregate metrics:

Test how guardrails perform across specific harm categories.
Translated benchmarks alone are insufficient.

Treat this as both a fairness AND security issue:

Multilingual gaps create exploitable vulnerabilities.
These gaps also mean safety protections vary by language, leaving some groups systematically less protected.

Most importantly: model teams cannot rely on English safety alignment to infer global safety. “It’s safe in English” is simply an English benchmark, not a safety claim.

For enterprises: Demand multilingual audits before global deployment

Organizations deploying AI systems internationally face significant risk without multilingual evaluation:

Why audits matter:

Safety performance varies dramatically by language.

Degradation ranges from +6 to +25 percentage points depending on model.

Without testing, you’re deploying blind

across global markets.

Regulatory exposure increases when safety is inconsistent across regions.

Enterprises increasingly face not only brand risk but adversarial risk: if a model can be jailbroken simply by changing languages, global deployment without multilingual audits becomes a material security liability.

Building AI That’s Safe Everywhere: Five Priorities

Multilingual safety must be foundational, not an afterthought.

1. Make multilingual evaluation a standard practice.

2. Build with cultural fluency from the start.

3. Train on diverse data created by native speakers.

4. Stress-test in low-resource scenarios.

5. Prioritize global alignment as core infrastructure.

These are not optional enhancements — they are the minimum requirements for building systems that cannot be bypassed by translation alone.

Conclusion: From English Benchmarks to Global LLM Safety

The multilingual safety gap represents the difference between a model that appears aligned and one that can actually be trusted across markets, cultures, and user communities worldwide.

Welo Data’s multilingual evaluation framework helps development teams and enterprises surface these gaps before deployment and implementing the changes needed to close them. The goal is to ensure that LLM safety protections remain consistent and reliable across all languages and regions.

The real measure of AI safety is no longer, “Does the model behave safely in English?”

It’s: “Can it withstand the simplest possible cross-lingual attack?”

Until safety survives translation, it isn’t safety.

Let’s Make AI Safety Multilingual

Welo Data partners with AI developers and enterprises to build culturally fluent, globally aligned evaluation systems that make AI safer and more reliable for users worldwide.

Ready to evaluate your model’s multilingual safety?

Contact our team to learn more.

Gen AI

AI/ML Models

Model Assessment Suite | Evaluation Tools

Knowledge Hub

About Us