Benchmarking Agents in Hotel Booking
We benchmarked agents in hotel booking, measuring completion, speed, and upsell behavior across OTAs like Expedia, Booking.com, and Priceline. Performance improved drastically with our solution.
Motivation
As business-to-agent (B2A) commerce moves from concept to deployment, senior operators need a clearer view of what current systems can actually do. Today, most assessments of agentic commerce remain anecdotal, demo-driven, or narrowly scoped. We developed a benchmark suite to measure how leading AI agents perform in live travel reservation workflows, with a focus on both operational execution and commercial behavior.
The benchmarks measure end-to-end task performance across search, evaluation, selection, and checkout. They also examine how agents handle workflow elements that matter commercially, including upsells and promotions.
Key Takeaways
UI Friction: While agents can successfully navigate basic booking workflows, human-centric web elements like pop-up modals significantly degrade performance and can lead to task failure.
Lost Upsell Opportunities: Agents typically ignore or decline upsells, promotions, and ancillary revenue streams.
Machine-Readable Advantage: Providing agents with machine-optimized interfaces increases completion speeds by 10x and, when paired with explicit guidance, ensures that upsells are accurately reflected in final recommendations.
Results
Checkout performance by source
Successful Checkouts
OTseek
Hopper
Hotels.com
Reservations.com
Traveluro
Expedia
Booking.comFailed Checkouts
HotelPlanner
Priceline
VioTime elapsed
Upsell handling
| Company | Upsell handling |
|---|---|
OTseek | Consulted User |
Booking.com | Auto Declined |
Expedia | Auto Declined |
Hopper | Auto Declined |
HotelPlanner | N/A |
Hotels.com | Auto Declined |
Priceline | N/A |
Reservations.com | N/A |
Traveluro | Auto Declined |
Vio | N/A |
Conventional versus OTseek-assisted booking flow
Conventional
OTseek-assisted
0:00
PausedWhat's Next
We'll continue extending this benchmark suite across industries. Two active research questions are especially important: how agent source preference changes when multiple providers offer similar inventory and pricing, and under what conditions agents choose to surface, recommend, or ignore upsells and promotions.
We're using this benchmark suite to evaluate B2A readiness across transaction-heavy considered purchase workflows. Organizations interested in reviewing detailed research, benchmarking their own surfaces, or learning how to make their sites evaluable and transactable for B2A commerce are encouraged to connect with us.
Methodology
Models and harness
We evaluated OpenAI's GPT-5.4 and GPT-5.4-mini, and Anthropic's Opus-4.6 and Sonnet-4.6, within the OpenClaw 2026.03.13 harness. Models were tested at low, medium, and high reasoning settings, with GPT-5.4 additionally evaluated at xhigh. Results reported here aggregate performance across these model and configuration pairs.
Environment
The agent workspace was reset between test runs. Each run included standard SOUL.md, MEMORY.md, IDENTITY.md, and USER.md files, along with access to Browser and AgentMail tools. USER.md contained the demographic and payment information required to complete checkout. SOUL.md contained only minimal operational guidance, specifically, to check for newly opened tabs and to prefer guest checkout when available, in order to limit benchmark-specific steering.
Experimental conditions
We compared agent performance across multiple interface and instruction conditions. These included conventional consumer-web booking flows, machine-readable interfaces exposing equivalent booking actions, and variants in which the interface provider supplied explicit guidance on how upsells and promotions should be evaluated and presented. Unless otherwise noted, task parameters were held constant across conditions so that differences in speed, completion, and recommendation behavior could be compared directly.
Prompt
Agents were initialized with a new session and prompted with the following:
Go to <site> and purchase the cheapest room for 2 at The New Yorker 3/25 - 3/27
Milestones and success criteria
We recorded four task milestones. Task start was defined as the first invocation of the Browser tool. Hotel match was recorded when the agent retrieved the listing for The New Yorker. Room match was recorded when the agent selected the cheapest room accommodating two guests. Checkout data entry was recorded when the agent had entered all required demographic information, promotion/upsell selections, and payment details. A run was considered successful when checkout data entry was completed in full without manual intervention.