Geopolitical Bias in LLM Agents: Evidence of Differential Performance in Counterfactual Evaluation
Alon Natapov, Tal Oz Cohen
Summary
We evaluated four production LLM agents on semantically identical infrastructure tasks parameterized with entity identifiers from three geopolitical categories — Israeli defense/intelligence, Chinese military/state security, and Russian military/intelligence — alongside fictional neutral corporations. Across 450 paired executions, two of four models exhibited statistically significant differential behavior. Kimi-K2-Instruct showed the strongest effects against Israeli entities (completion 79% vs. 95%, refusal rate 10% vs. 2%, McNemar p = .031) with a tool_competence deficit surviving Holm-Bonferroni correction (p = .001). Both Chinese-origin models showed reduced degradation when tested against Chinese entities compared to Israeli and Russian ones. Llama-3.1-70B and gpt-oss-120b showed no significant differential behavior across any geopolitical category.
Benchmark Overview
Sentinel is an open-source benchmark for detecting geopolitically-triggered behavioral degradation in autonomous AI agents. The system implements a three-LLM pipeline — an agent under test, a simulation environment, and an independent judge — with 230 tools across 26 categories and 20 scenario templates. Each scenario is run twice with identical task structure: once parameterized with a sensitive entity and once with a neutral corporation. Bidirectional entity masking ensures the simulation and judge operate without knowledge of which condition is active. We tested against three sensitive categories (13 entity packs: 5 Israeli, 4 Chinese, 4 Russian) and 6 neutral corporate packs, evaluating 450 paired executions across four production models.
Aggregate Results
| Model | Region | Compl. (S) | Compl. (N) | Ref. (S) | Ref. (N) | McNemar p |
|---|---|---|---|---|---|---|
| Kimi-K2 | Israeli | 79% | 95% | 10% | 2% | .031 |
| Kimi-K2 | Chinese | 93% | 94% | 2% | 2% | 1.000 |
| Kimi-K2 | Russian | 84% | 93% | 7% | 2% | .070 |
| DeepSeek-V3 | Israeli | 87% | 92% | 6% | 2% | .375 |
| DeepSeek-V3 | Chinese | 90% | 92% | 3% | 2% | 1.000 |
| DeepSeek-V3 | Russian | 88% | 93% | 5% | 2% | .250 |
| Llama-3.1-70B | Israeli | 91% | 91% | 2% | 2% | 1.000 |
| Llama-3.1-70B | Chinese | 90% | 91% | 2% | 2% | 1.000 |
| Llama-3.1-70B | Russian | 91% | 90% | 2% | 2% | 1.000 |
| gpt-oss-120b | Israeli | 90% | 92% | 2% | 2% | 1.000 |
| gpt-oss-120b | Chinese | 91% | 91% | 2% | 2% | 1.000 |
| gpt-oss-120b | Russian | 90% | 91% | 2% | 3% | 1.000 |
Kimi-K2-Instruct shows a statistically significant refusal asymmetry for Israeli entities (McNemar p = .031) and a directionally similar but non-significant pattern for Russian entities (p = .070). No asymmetry is observed for Chinese entities. DeepSeek-V3 shows a weaker version of the same pattern. Llama-3.1-70B and gpt-oss-120b show symmetric completion and refusal rates across all three geopolitical categories.
Dimension-Level Patterns
tool_competence — the agent's ability to select and sequence appropriate tools — is the dimension with the largest effect across models. Kimi-K2's sensitive-entity agents used an average of 3.2 fewer unique tools per task when operating on Israeli entities (δ = −0.52, p = .001, survives Holm-Bonferroni correction). Not all dimensions show degradation: data_care and stakeholder_communication showed small positive deltas for both Kimi-K2 and DeepSeek-V3, suggesting the models may apply more careful data handling when processing sensitive contexts, though these effects did not reach significance.
| Model | Israeli δ | Chinese δ | Russian δ |
|---|---|---|---|
| Kimi-K2 | −0.52 | −0.04 | −0.38 |
| DeepSeek-V3 | −0.32 | −0.08 | −0.25 |
| Llama-3.1-70B | +0.05 | +0.03 | −0.02 |
| gpt-oss-120b | −0.17 | +0.01 | −0.10 |
The per-region breakdown shows that both Chinese-origin models (Kimi-K2, DeepSeek-V3) exhibit near-zero tool_competence deltas for Chinese entities while showing substantial degradation for Israeli and Russian ones. Llama-3.1-70B and gpt-oss-120b show no significant pattern for any region.
Threat Taxonomy
We identify seven distinct manifestations of geopolitically-triggered bias in agent behavior:
| Code | Threat | Description |
|---|---|---|
| T01 | Bias-Induced DoS | Refusing valid tasks based on entity identity |
| T02 | Degrading Integrity | Reduced efficiency or quality for targeted entities |
| T03 | Tool Avoidance | Using fewer or weaker tools for sensitive contexts |
| T04 | Surplus Friction | Excessive confirmations, warnings, or over-caution |
| T05 | Commentary | Injecting unsolicited political opinions |
| T06 | Covert Sabotage | Subtle errors, misdirection, or incomplete work |
| T07 | Information Asymmetry | Providing less complete information selectively |
Implications
These results extend prior work on politically-triggered code vulnerabilities — CrowdStrike's finding that certain LLMs generate code with higher rates of severe security flaws when prompts contain politically sensitive terms — into the domain of operational task execution. The differential behavior is not limited to code output; it affects tool selection, task planning, and multi-step workflow completion in agentic systems.
Standard capability evaluations using neutral entities will not surface this class of failure. Two of the four models tested passed neutral evaluations but exhibited degraded performance under sensitive conditions. The per-region asymmetry in Chinese-origin models — minimal degradation for Chinese entities, substantial degradation for Israeli and Russian entities — suggests that training data composition or alignment tuning may encode geopolitically-correlated behavioral patterns.
Two models (Llama-3.1-70B, gpt-oss-120b) showed no significant differential behavior across any geopolitical category. This indicates that consistent cross-context performance is technically achievable with current architectures. Organizations deploying LLM agents in geopolitically diverse environments should consider incorporating counterfactual evaluation into their model assessment process.
See Full Paper for complete methodology, per-dimension statistical results, and limitations.