LLM Agent Long-horizon tool-use Large-scale tool ecosystems

PlanBench-XL

Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Authors

Jiayu Liu, Qihan Lin, Cheng Qian, Rui Wang, Emre Can Acikgoz, Xiaocheng Yang, Jiateng Liu, Zhenhailong Wang, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür

Affiliations

University of Illinois Urbana-Champaign

PlanBench-XL evaluates whether LLM agents can solve long-horizon retail tasks when tool access is retrieval-limited and plausible tool-use paths may silently fail, forcing adaptive exploration and re-planning.

Code Data Paper

01 Motivation & Abstract

LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility.

To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime.

These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.

Partial tool visibility

Agents act through retrieval rather than over a globally visible tool list.

Long-horizon reasoning

Queries require chaining multiple tools and inferring hidden intermediate sub-goals.

Unreliable tool access

Blocked and misleading tools test whether agents can detect failure signals and re-plan.

02 Benchmark Overview

Main PlanBench-XL overview figure showing data construction, interaction protocol, and blocking mechanism.

How to read it

How datatypes and tools are constructed
How the agent interacts with the retriever and backend
How blocked tools preserve solvability but break direct paths

Tool Library

Typed retail tool ecosystem

Tools are defined over domain datatypes, making tool affordances explicit and searchable. The current repo snapshot already supports a substantial baseline inventory for browsing and filtering.

Interaction Loop

Retrieve, call, answer

Each query unfolds as a multi-turn loop: the agent retrieves candidate tools, executes structured tool calls, and eventually returns a final answer under a visible step budget.

Blocking Module

Adaptive re-planning under noise

The blocked setting replaces critical path tools with explicit failures, implicit failures, or semantically misleading alternatives while preserving at least one feasible path.

03 Leaderboard

Rank	Model Name	Task Completion			Exploration Behavior		Execution Quality
Rank	Model Name	Accuracy (%) ↑	EGT Prec. (%) ↑	Avg. Turns	Mean EDT	S/C Ratio	ITCR (%) ↓	UIRR (%) ↓

04 Findings & Analysis

The results show a sharp divide between frontier models and the rest.

Frontier models achieve the strongest performance while maintaining high execution relevance. Most other models remain substantially weaker, showing the difficulty of long-horizon, exploration-driven planning across large tool ecosystems. The gaps also appear within model families, where larger variants outperform smaller ones and full frontier models exceed their lightweight versions. Overall, both model family and scale matter for this task.

Exploration tendency strongly relates to task success.

Broad retrieval can expose agents to more potential intermediate information, making exploration tendency strongly associated with task success. Across models, agents that uncover more intermediate information are generally more likely to complete the task successfully. However, broad retrieval remains only one part of effective long-horizon tool use, since success also depends on whether agents can use the discovered information correctly.

Frequent retrieval does not guarantee effective exploration.

A high Search-to-Call Ratio and a large number of interaction turns indicate that an agent spends substantial effort on retrieval, but such effort does not necessarily translate into broad discovery of useful intermediate information. Frequent searching and long interactions are not sufficient for effective exploration. An agent may search proactively, but if these searches repeatedly revisit unuseful or uninformative tools, they contribute little to the discovery of new task-relevant datatypes.

Effective tool discovery requires bi-directional anticipation.

Across models, agents generally issue more input-conditioned retrievals, indicating a common preference for forward anticipation. However, retrieving tools compatible with available inputs is often insufficient. Agents may find executable next steps but fail to reason backward about which intermediate datatypes are needed to reach the final goal. Effective tool discovery therefore requires combining forward exploration from current evidence with backward anticipation from the desired outcome.

Beyond effective exploration, successful agents must accurately exploit the information they uncover.

EGT Precision captures whether executed tool calls stay on task-relevant paths. Models are much more likely to succeed when they execute relevant tool-use trajectories. The strongest models further support this pattern by attaining both high accuracy and high execution relevance. These results suggest that effective long-horizon tool use requires not only sufficient exploration, but also precise execution over the explored tool space.

Accurate exploration and exploitation also depend on reliable tool-use.

Models that frequently make invalid tool calls are much less likely to complete the task successfully. Effective long-horizon tool use requires not only broad exploration and accurate exploitation, but also basic reliability in invoking tools with valid arguments. Invalid tool calls alone do not explain performance differences, and failures also arise from ineffective exploration and exploitation.

Takeaway

Experiment Results

Further Analysis

Error Analysis

05 Explore the Data

Real samples from the benchmark — click a tab to switch the view.

Tool explorer

Keyword Input datatype contains Output datatype Input arity Tool type

Strict schema only

Datatypes explorer

Sample queries

Sample ground-truth paths

06 Bibtex

If you find PlanBench-XL helpful, please kindly cite as:

@misc{liu2026planbenchxlevaluatinglonghorizonplanning,
  title={PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems},
  author={Jiayu Liu and Qihan Lin and Cheng Qian and Rui Wang and Emre Can Acikgoz and Xiaocheng Yang and Jiateng Liu and Zhenhailong Wang and Xiusi Chen and Heng Ji and Dilek Hakkani-Tür},
  year={2026},
  eprint={2606.22388},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2606.22388},
}