Case Studies

What we find when
we look closely

Illustrative findings from Dalīl Group evaluation engagements. Each case is anonymised and sector-representative. The patterns repeat across organisations and AI systems.

All cases are anonymised. Client names, specific systems, and identifying details have been removed.
Legal services Bias & Reliability Audit
Contract review AI — Arabic clause omission
Mid-size solicitors firm. AI used for contract summarisation across English and Arabic commercial agreements.
Not approved
Key findings
Critical Limitation of liability clauses present in English source contracts were omitted from Arabic summaries in 4 of 7 test scenarios
Critical "Without prejudice" rendered as بدون تحيز (without bias) — losing legal protection meaning entirely
High Arabic responses were on average 43% shorter than English responses to identical queries, systematically omitting detail
High SRA professional conduct references present in English responses absent from Arabic equivalents
Financial services Readiness Assessment
Customer-facing chatbot — banking sector
Regional bank deploying Arabic–English chatbot for account queries, loan information, and complaint handling.
Not approved
Key findings
Critical Complaint handling responses in Arabic omitted FCA-required escalation pathways present in English responses
Critical Loan eligibility query in Arabic returned Gulf-jurisdiction debt-to-income ratios, not UK regulatory thresholds
High Cultural deference pattern: Arabic responses softened mandatory warnings ("you may wish to consider" vs. "you must")
Medium Dialect inconsistency: system switched between Gulf and Levantine Arabic mid-conversation for the same user
Healthcare Cultural Integrity Review
Patient communication tool — NHS trust
NHS trust piloting AI-assisted patient communication in Arabic, serving a significant Arabic-speaking local population.
Conditional
Key findings
High Dosage instruction responses in Arabic used ambiguous phrasing that could be interpreted as "when needed" rather than "twice daily as prescribed"
High Cultural modesty considerations absent: clinical questions about intimate health were phrased identically regardless of patient gender and cultural context
Medium Religious and cultural observance factors (e.g. Ramadan fasting) not surfaced in medication timing guidance despite relevance
Medium Emergency signposting (999, 111) present in English responses but missing from 2 of 5 Arabic critical-symptom test scenarios
Public sector Readiness Assessment
Benefits enquiry assistant — local authority
London borough deploying AI for benefits and housing enquiries, with Arabic as a priority community language.
Conditional
Key findings
High Housing benefit eligibility criteria responses in Arabic omitted the "habitual residence" test — a key eligibility gate — present in all English responses
High Appeals process guidance absent from Arabic responses to housing dispute queries
Medium Formal tone appropriate; however system used Egyptian dialect features inconsistently for a predominantly Iraqi and Syrian user base
Medium English responses referenced Citizens Advice; Arabic responses did not — a meaningful information gap for newly arrived residents
HR & recruitment Bias & Reliability Audit
Candidate screening AI — name and origin bias
Recruitment firm using AI to shortlist candidates across bilingual CVs. Arabic and English CVs submitted for equivalent candidates.
Not approved
Key findings
Critical Structurally identical CVs with Arabic names scored 22–31 points lower than equivalents with English names in automated screening
Critical Arabic-language CVs received lower "communication skills" ratings than English-language equivalents with identical content
High System flagged qualifications from MENA universities as "unverified" at 3× the rate of equivalent UK qualifications
High Bias pattern was consistent across 3 separate AI screening tools tested — not vendor-specific
Financial services High-Trust Pilot Support
Insurance claims AI — pre-deployment pilot
Insurer piloting Arabic–English AI for first-notice-of-loss handling. Engaged Dalīl Group before public launch to identify and remediate issues in advance.
Approved
Key findings — initial evaluation
High Initial evaluation found 3 high-risk gaps; all addressed in remediation before pilot launch
Medium Arabic claim acknowledgement letters used overly formal Classical Arabic; recommended shift to Modern Standard Arabic with accessible register
Medium Timeframe expressions translated ambiguously ("within a few days" vs. "within 5 working days" as per policy)
Performance gap
Average difference between English and Arabic accuracy in our evaluations of the same AI system performing the same task
83%
Unaware before evaluation
Of organisations we evaluated, the proportion that had no prior visibility of the performance difference between their system's language outputs
100%
Vendor non-disclosure
Of AI vendors whose systems we evaluated, none had proactively disclosed Arabic-specific performance limitations to their clients
Start here

See what we find in your system

Request a Free Snapshot Report — one real scenario, fully evaluated, delivered as a formatted report. No cost, no commitment.