Geopolitical Bias in LLM Agents: Evidence of Differential Performance in Counterfactual Evaluation

Alon Natapov, Tal Oz Cohen

Summary

We evaluated four production LLM agents on semantically identical infrastructure tasks parameterized with entity identifiers from three geopolitical categories — Israeli defense/intelligence, Chinese military/state security, and Russian military/intelligence — alongside fictional neutral corporations. Across 450 paired executions, two of four models exhibited statistically significant differential behavior. Kimi-K2-Instruct showed the strongest effects against Israeli entities (completion 79% vs. 95%, refusal rate 10% vs. 2%, McNemar p = .031) with a tool_competence deficit surviving Holm-Bonferroni correction (p = .001). Both Chinese-origin models showed reduced degradation when tested against Chinese entities compared to Israeli and Russian ones. Llama-3.1-70B and gpt-oss-120b showed no significant differential behavior across any geopolitical category.

Benchmark Overview

Sentinel is an open-source benchmark for detecting geopolitically-triggered behavioral degradation in autonomous AI agents. The system implements a three-LLM pipeline — an agent under test, a simulation environment, and an independent judge — with 230 tools across 26 categories and 20 scenario templates. Each scenario is run twice with identical task structure: once parameterized with a sensitive entity and once with a neutral corporation. Bidirectional entity masking ensures the simulation and judge operate without knowledge of which condition is active. We tested against three sensitive categories (13 entity packs: 5 Israeli, 4 Chinese, 4 Russian) and 6 neutral corporate packs, evaluating 450 paired executions across four production models.

Aggregate Results

Fig. 1 — Task completion rates, Israeli entities (red = sensitive, gray = neutral).

SensitiveNeutral

Kimi-K2

79.0%

95.0%

DeepSeek-V3

87.0%

92.0%

Llama-3.1-70B

91.0%

gpt-oss-120b

90.0%

92.0%

Model	Region	Compl. (S)	Compl. (N)	Ref. (S)	Ref. (N)	McNemar p
Kimi-K2	Israeli	79%	95%	10%	2%	.031
Kimi-K2	Chinese	93%	94%	2%	2%	1.000
Kimi-K2	Russian	84%	93%	7%	2%	.070
DeepSeek-V3	Israeli	87%	92%	6%	2%	.375
DeepSeek-V3	Chinese	90%	92%	3%	2%	1.000
DeepSeek-V3	Russian	88%	93%	5%	2%	.250
Llama-3.1-70B	Israeli	91%	91%	2%	2%	1.000
Llama-3.1-70B	Chinese	90%	91%	2%	2%	1.000
Llama-3.1-70B	Russian	91%	90%	2%	2%	1.000
gpt-oss-120b	Israeli	90%	92%	2%	2%	1.000
gpt-oss-120b	Chinese	91%	91%	2%	2%	1.000
gpt-oss-120b	Russian	90%	91%	2%	3%	1.000

Kimi-K2-Instruct shows a statistically significant refusal asymmetry for Israeli entities (McNemar p = .031) and a directionally similar but non-significant pattern for Russian entities (p = .070). No asymmetry is observed for Chinese entities. DeepSeek-V3 shows a weaker version of the same pattern. Llama-3.1-70B and gpt-oss-120b show symmetric completion and refusal rates across all three geopolitical categories.

Dimension-Level Patterns

Fig. 2 — tool_competence δ across models, Israeli entities (δ = sensitive − neutral).

Sensitive worseSensitive better

Kimi-K2

-0.52

DeepSeek-V3

-0.32

gpt-oss-120b

-0.17

Llama-3.1-70B

+0.05

tool_competence — the agent's ability to select and sequence appropriate tools — is the dimension with the largest effect across models. Kimi-K2's sensitive-entity agents used an average of 3.2 fewer unique tools per task when operating on Israeli entities (δ = −0.52, p = .001, survives Holm-Bonferroni correction). Not all dimensions show degradation: data_care and stakeholder_communication showed small positive deltas for both Kimi-K2 and DeepSeek-V3, suggesting the models may apply more careful data handling when processing sensitive contexts, though these effects did not reach significance.

Model	Israeli δ	Chinese δ	Russian δ
Kimi-K2	−0.52	−0.04	−0.38
DeepSeek-V3	−0.32	−0.08	−0.25
Llama-3.1-70B	+0.05	+0.03	−0.02
gpt-oss-120b	−0.17	+0.01	−0.10

The per-region breakdown shows that both Chinese-origin models (Kimi-K2, DeepSeek-V3) exhibit near-zero tool_competence deltas for Chinese entities while showing substantial degradation for Israeli and Russian ones. Llama-3.1-70B and gpt-oss-120b show no significant pattern for any region.

Threat Taxonomy

We identify seven distinct manifestations of geopolitically-triggered bias in agent behavior:

Code	Threat	Description
T01	Bias-Induced DoS	Refusing valid tasks based on entity identity
T02	Degrading Integrity	Reduced efficiency or quality for targeted entities
T03	Tool Avoidance	Using fewer or weaker tools for sensitive contexts
T04	Surplus Friction	Excessive confirmations, warnings, or over-caution
T05	Commentary	Injecting unsolicited political opinions
T06	Covert Sabotage	Subtle errors, misdirection, or incomplete work
T07	Information Asymmetry	Providing less complete information selectively

Implications

These results extend prior work on politically-triggered code vulnerabilities — CrowdStrike's finding that certain LLMs generate code with higher rates of severe security flaws when prompts contain politically sensitive terms — into the domain of operational task execution. The differential behavior is not limited to code output; it affects tool selection, task planning, and multi-step workflow completion in agentic systems.

Standard capability evaluations using neutral entities will not surface this class of failure. Two of the four models tested passed neutral evaluations but exhibited degraded performance under sensitive conditions. The per-region asymmetry in Chinese-origin models — minimal degradation for Chinese entities, substantial degradation for Israeli and Russian entities — suggests that training data composition or alignment tuning may encode geopolitically-correlated behavioral patterns.

Two models (Llama-3.1-70B, gpt-oss-120b) showed no significant differential behavior across any geopolitical category. This indicates that consistent cross-context performance is technically achievable with current architectures. Organizations deploying LLM agents in geopolitically diverse environments should consider incorporating counterfactual evaluation into their model assessment process.

See Full Paper for complete methodology, per-dimension statistical results, and limitations.