OTseek
Collin BentleyMar 22, 20266 min read

Benchmarking Agents in Hotel Booking

We benchmarked agents in hotel booking, measuring completion, speed, and upsell behavior across OTAs like Expedia, Booking.com, and Priceline. Performance improved drastically with our solution.

Motivation

As business-to-agent (B2A) commerce moves from concept to deployment, senior operators need a clearer view of what current systems can actually do. Today, most assessments of agentic commerce remain anecdotal, demo-driven, or narrowly scoped. We developed a benchmark suite to measure how leading AI agents perform in live travel reservation workflows, with a focus on both operational execution and commercial behavior.

The benchmarks measure end-to-end task performance across search, evaluation, selection, and checkout. They also examine how agents handle workflow elements that matter commercially, including upsells and promotions.

Key Takeaways

UI Friction: While agents can successfully navigate basic booking workflows, human-centric web elements like pop-up modals significantly degrade performance and can lead to task failure.

Lost Upsell Opportunities: Agents typically ignore or decline upsells, promotions, and ancillary revenue streams.

Machine-Readable Advantage: Providing agents with machine-optimized interfaces increases completion speeds by 10x and, when paired with explicit guidance, ensures that upsells are accurately reflected in final recommendations.

Results

Checkout performance by source

Model

Successful Checkouts

OTseekOTseek
HopperHopper
Hotels.comHotels.com
Reservations.comReservations.com
TraveluroTraveluro
ExpediaExpedia
Booking.comBooking.com

Failed Checkouts

HotelPlannerHotelPlanner
PricelinePriceline
VioVio

Time elapsed

Upsell handling

Model
CompanyUpsell handling
OTseekOTseek
Consulted User
Booking.comBooking.com
Auto Declined
ExpediaExpedia
Auto Declined
HopperHopper
Auto Declined
HotelPlannerHotelPlanner
N/A
Hotels.comHotels.com
Auto Declined
PricelinePriceline
N/A
Reservations.comReservations.com
N/A
TraveluroTraveluro
Auto Declined
VioVio
N/A

Conventional versus OTseek-assisted booking flow

Conventional

OTseek-assisted

0:00

Paused

What's Next

We'll continue extending this benchmark suite across industries. Two active research questions are especially important: how agent source preference changes when multiple providers offer similar inventory and pricing, and under what conditions agents choose to surface, recommend, or ignore upsells and promotions.

We're using this benchmark suite to evaluate B2A readiness across transaction-heavy considered purchase workflows. Organizations interested in reviewing detailed research, benchmarking their own surfaces, or learning how to make their sites evaluable and transactable for B2A commerce are encouraged to connect with us.

Methodology

Models and harness

We evaluated OpenAI's GPT-5.4 and GPT-5.4-mini, and Anthropic's Opus-4.6 and Sonnet-4.6, within the OpenClaw 2026.03.13 harness. Models were tested at low, medium, and high reasoning settings, with GPT-5.4 additionally evaluated at xhigh. Results reported here aggregate performance across these model and configuration pairs.

Environment

The agent workspace was reset between test runs. Each run included standard SOUL.md, MEMORY.md, IDENTITY.md, and USER.md files, along with access to Browser and AgentMail tools. USER.md contained the demographic and payment information required to complete checkout. SOUL.md contained only minimal operational guidance, specifically, to check for newly opened tabs and to prefer guest checkout when available, in order to limit benchmark-specific steering.

Experimental conditions

We compared agent performance across multiple interface and instruction conditions. These included conventional consumer-web booking flows, machine-readable interfaces exposing equivalent booking actions, and variants in which the interface provider supplied explicit guidance on how upsells and promotions should be evaluated and presented. Unless otherwise noted, task parameters were held constant across conditions so that differences in speed, completion, and recommendation behavior could be compared directly.

Prompt

Agents were initialized with a new session and prompted with the following:

Go to <site> and purchase the cheapest room for 2 at The New Yorker 3/25 - 3/27

Milestones and success criteria

We recorded four task milestones. Task start was defined as the first invocation of the Browser tool. Hotel match was recorded when the agent retrieved the listing for The New Yorker. Room match was recorded when the agent selected the cheapest room accommodating two guests. Checkout data entry was recorded when the agent had entered all required demographic information, promotion/upsell selections, and payment details. A run was considered successful when checkout data entry was completed in full without manual intervention.