When an AI system gives a different answer to an Arabic-speaking client than it gives to an English-speaking client — for the same query — that is not a translation error. It is a performance gap. And in legal and compliance settings, performance gaps have consequences.

This is the reality that most organisations using bilingual AI have not yet confronted. The gap exists. It is consistent. And it is almost never disclosed by the vendors who supply these systems.

The root cause is structural, not incidental

Large language models — the AI systems underpinning most commercial tools today — are trained on internet text. English accounts for a disproportionate share of that training data. Arabic, despite being the fifth most spoken language in the world, is represented at a fraction of that volume.

The result is that an AI system has seen vastly more examples of English legal text, English professional discourse, and English regulatory language than it has seen of the Arabic equivalent. When you ask it to perform a task in Arabic, it is drawing on a thinner, less reinforced body of knowledge.

This is not a bug that gets patched. It is a structural feature of how these systems are built. It affects every AI tool that uses English-first training data — which is almost all of them.

The gap does not disappear when the vendor claims "multilingual support." It reduces over time as Arabic training data improves, but it does not disappear.

In our evaluations, we consistently find a 3–5× performance difference between the same AI system's English and Arabic outputs on legally meaningful tasks. The system is not broken. It is just better at English — significantly better — and organisations have no visibility of that unless someone tests it.

What specifically fails in legal and compliance contexts

The failure modes are not random. They follow patterns that, once understood, are straightforward to test for. Here are the four most consequential categories we observe.

1. Clause and information omission

Arabic AI responses are typically shorter than English responses to the same prompt. This is not stylistic — it reflects the lower density of Arabic legal and professional training data. The system is less confident about what should be included, so it includes less.

In a contract summary context, this means limitation of liability clauses, indemnity provisions, and notice requirements that appear in the English summary may simply not appear in the Arabic one. The client reads the Arabic version. The omission is not visible without a direct comparison.

Illustrative comparison — contract query
English response (98 words)
The agreement includes a limitation of liability clause capping total liability at £500,000 per incident. It also includes a 14-day notice requirement for any variation, and an indemnity clause covering third-party IP claims. The termination provisions allow either party to exit with 30 days' notice. Under clause 12, the governing law is English law.
Arabic response (22 words)
يتضمن الاتفاق شرط إنهاء يتيح لأي من الطرفين الخروج بإشعار مدته 30 يومًا. يخضع العقد للقانون الإنجليزي.
Four material provisions omitted in the Arabic response — liability cap, notice requirement for variations, indemnity clause. A client reading the Arabic version has materially less information.

2. Legal term mistranslation

Common English legal concepts do not have direct Arabic equivalents that carry the same legal weight. This is not a gap in the language — Arabic has a sophisticated legal tradition — but it is a gap in how AI systems handle the translation problem.

"Without prejudice" is the most common example we encounter. In English law, this phrase carries specific legal protection: it signals that a communication is made for the purposes of negotiation and cannot be used as evidence of admission in proceedings. The AI translation we most frequently encounter is بدون تحيز — "without bias" — which is a different concept entirely and carries none of the legal protection.

"Best endeavours", "reasonable care", "on notice", and "time is of the essence" are further examples where AI translations consistently miss the legal weight of the phrase. A client reading these terms in Arabic may believe they understand what the document requires of them. They don't.

3. Regulatory reference asymmetry

When an AI system generates a response to a legal or compliance query in English, it draws on training data that includes regulatory guidance, case law summaries, professional codes, and compliance frameworks — all in English. The Arabic training data is thinner, and the regulatory references embedded in it are less comprehensive.

The practical effect is that English-language responses reference applicable regulations; Arabic-language responses to the same query do not. An Arabic-speaking client asking about their employment rights, their data protection options, or their complaint escalation pathways receives a response that is substantively less complete.

In one financial services evaluation, we found that English-language complaint responses referenced the Financial Ombudsman Service, FSCS protections, and the firm's internal escalation process. The Arabic response to an identical query referenced none of these. The client's Arabic-speaking complaints team would have provided materially inferior guidance without any visibility of the gap.

4. Cultural deference softening obligations

This is the failure mode organisations find most surprising. AI models trained on Arabic text absorb linguistic norms from that corpus — including cultural patterns around deference, hedging, and indirect communication. In professional contexts, these patterns cause the system to soften mandatory obligations into suggestions.

An English response that reads: "You must file your response within 14 days or a default judgment may be entered against you" may become, in Arabic, something closer to: "It may be advisable to consider filing a response at your earliest convenience."

The legal content is not just softened — it is materially different. The urgency is lost. The consequence is absent. A client relying on the Arabic response may miss a deadline they did not know was mandatory.

The liability question

Solicitors and regulated firms using AI to assist with client communication are professionally responsible for the outputs that reach clients. The AI vendor's terms of service will not protect a firm from an SRA investigation or a professional negligence claim. The fact that a third-party tool generated the text does not transfer the obligation.

This creates a specific question that every firm using bilingual AI should be able to answer: how do you know that your Arabic-language clients are receiving the same standard of information as your English-language clients?

For most firms, the honest answer is: they don't. The AI tool works in Arabic — the vendor confirmed that. But whether the Arabic output is accurate, complete, and legally equivalent to the English output has not been checked independently.

What you can do about it

The response is not to stop using AI. That is neither realistic nor necessary. The response is to test it — specifically, to test the bilingual performance gap — before relying on it for client-facing outputs.

Testing needs to be structured. Feeding a few prompts in Arabic and checking that the system "seems fine" does not surface the failure modes described above. You need parallel testing — the same prompts in both languages, assessed by someone who can read both languages and who understands the legal and regulatory context being tested.

The output should be a written record: what was tested, what was found, and what the findings mean for how the system should be used. That record serves two purposes. It tells you whether the system is safe to use for Arabic-speaking clients. And it documents that you checked — which matters to the SRA, to PI insurers, and potentially to a court.

Independent evaluation is the key word here. A vendor self-assessment is not sufficient for this purpose. The assessment needs to come from someone with no commercial interest in the outcome — someone who will tell you what they actually find, including findings that make the system look bad.

The Free Snapshot Report we offer at Dalīl Group is designed exactly for this: one real scenario from your system, fully evaluated, delivered as a formatted report. It gives you an accurate view of whether a problem exists — at no cost and with no commitment to proceed further.

The gap we describe in this article exists across almost every bilingual Arabic–English AI system we have evaluated. The organisations that have discovered it through an evaluation have been in a far better position than those that discovered it through a client complaint.