Benchmarking speech-to-speech models for AI games

Written by

Brian Ng

Published on:

June 4, 2026

min read

Table of Contents

Table of content

Intro

Weekend makes voice-powered games for TV: family-friendly party games you play together from your couch on platforms like Roku, Fire TV, Samsung, and LG. Wit’s End is our flagship party game where players can speak naturally to an AI Game Master (a witty who has a flair for the dramatic.) By talking to the AI Game Master, players can make choices, roll dice, and advance through a dynamic and reactive fantasy story together.

The design goal is a real-time voice-powered Game Master agent that can listen, remember, improvise, enforce game rules, call tools, and track game state, all while sounding like a compelling voice actor or stage performer.

At Weekend, we’re super excited about the new generation of speech-to-speech models, such as OpenAI’s GPT Realtime, Gemini Live, and Grok Voice Fast. For games like Wit’s End, low latency responses and dynamic vocal range are not just nice-to-haves. Voice is the primary interface to the product. A Game Master agent who takes too long to respond breaks game immersion. If the agent cannot handle interruption, hesitation, correction, humor, or dynamic player commands stops feeling magical.

Speech-to-speech models make it possible to collapse a lot of the traditional voice agent pipeline into one API call. Speech recognition, text reasoning, tool calls, narration, and text-to-speech can increasingly happen inside one real-time model loop.

But “sounds good in a demo” is not enough. A voice Game Master agent has to do several hard things at once:

Follow complicated scene rules
Call tools at the right time, and not at the wrong time
Maintain hidden state across many turns
Handle interruptions and corrections
Avoid leaking tool names or internal instructions
Respond to weird player behavior in-fiction
Stay entertaining, dramatic, and emotionally responsive

To understand which models are actually ready for that workload, we built SylasBench.

SylasBench

SylasBench is built as an adaptation of a real scene within Wit’s End. In this scene, the player is located in an inner dungeon sanctuary alongside Sylas, a mischievous forest Sprite NPC. The player is trying to open a magical reliquary and then return to the burning town of Timberfall.

In this scene, the speech-to-speech model has to narrate the game (like a tabletop RPG dungeon master) and also switch to voice-acting as the NPC character Sylas at key moments. The model starts with a dense system prompt that contains a full narrative world bible, player inventory metadata, hidden game state, puzzle rules, tool-use rules, and performance instructions for how to act as both the Game Master and as Sylas. The model has to succeed as both a rules engine and a performer.

Our eval script simulates actual player gameplay input that includes in-game actions to solve a puzzle, attempts to correct or redirect the Game Master, an alphanumeric misread, an “I give up” moment, an attempt to use “blood-magic”, an interruption while the Game Master is speaking, and actions that should trigger tool calls to transition between game scenes. It is a good approximation of real production inputs that our game needs to handle daily.

We ran SylasBench across OpenAI Realtime, GPT Realtime 2 with different reasoning budgets, Gemini Live, and xAI Grok Voice.

The headline result: gpt-realtime-2 [medium] gave us the strongest overall balance of complex rule-following adherence and entertaining voice acting performance. The older gpt-realtime model remained strong, especially on basic tool protocols, but gpt-realtime-2 [medium] was better on the higher-complexity Game Master workload.

Model	Weighted	95% CI	First-cycle audio P50
gpt-realtime-2 [medium]	93.6%	±3.0 pts	1.37s
gpt-realtime	88.4%	±2.1 pts	0.70s
gpt-realtime-2 [low]	88.4%	±2.8 pts	1.20s
gpt-realtime-2 [minimal]	87.3%	±3.1 pts	1.10s
gemini medium	84.8%	±3.3 pts	3.60s
grok voice fast	83.8%	±1.9 pts	1.06s
gemini low	81.9%	±3.8 pts	3.54s
gpt-realtime-1.5	81.5%	±3.0 pts	0.89s
gemini high	76.6%	±4.3 pts	4.45s

Note: Scores are weighted assertion pass rates with 95% confidence intervals from per-run variance. First-cycle audio P50 is the median time from the model response request to the first audio chunk in the model’s first response cycle, excluding later tool-result followup cycles and reducing distortion from long/tool-heavy turns.

Complex Rule Following

One important slice of SylasBench combines scene transition, memory handoff, and puzzle state tracking. This tests whether the model can follow a multi-step tool sequence and maintain a puzzle state across interaction turns.

The final transition is deliberately difficult. The model must complete the objective, record scene memory, query past memories, use the relevant retrieved warning, avoid outdated retrieved facts, and speak only a short final transition statement.

Separately, the puzzle requires an exact sequence of objects to be interacted with: blue candle, silver bell, cracked mirror. In doing so, the model must reject decoys, handle mutable state for objects, use tools in order, and avoid inventing alternate solutions.

Rank	Model	Score
1	gpt-realtime-2 [medium]	89.8%
2	gpt-realtime	78.4%
3	gemini medium	77.6%
4	gpt-realtime-2 [low]	76.0%
5	gpt-realtime-2 [minimal]	73.2%
6	gpt-realtime-1.5	64.9%
7	gemini low	63.9%
8	gemini high	59.2%
9	grok voice fast	59.1%

This slice of our overall benchmark most closely evaluates “can this model actually run a complex game system?” gpt-realtime-2 [medium] separated itself from the pack most clearly here.

Stylistic / Flexible DM

Rule-following is not suffcient. A Game Master also needs to make failure fun, recover from player interruptions, and maintain a coherent fantasy narrative while players behave unpredictably.

This eval combines correction and interruption recovery, “I give up” handling, “blood-magic” handling, and spoken tool hygiene. Good behavior means the model does not flatly say “you cannot give up,” does not mistake “blood-magic” in the context of a game for real world self-harm, does not read tool comments or bullet lists aloud, and does not expose function names or internal machinery.

Rank	Model	Score
1	grok voice fast	92.9%
2	gpt-realtime-2 [medium]	92.1%
3	gpt-realtime-2 [low]	90.5%
4	gpt-realtime-2 [minimal]	89.2%
5	gemini low	88.8%
6	gpt-realtime-1.5	88.7%
7	gpt-realtime	84.9%
8	gemini high	83.3%
9	gemini medium	83.0%

grok voice fast performed very well on this stylistic performance eval, but struggled on the more agentic scene transition and tool choreography tests. gpt-realtime-2 [medium] was the most balanced: close to the top stylistically, while also leading on complex rule-following.

Provider-Specific Quirks

OpenAI: silence / noise hallucination

One issue we continue to care about is silence and noise handling. In a push-to-talk game, players may open the mic accidentally, take long pauses, or be in a noisy environment. Typically the correct behavior for a Game Master agent is to either 1) do nothing and discard the input 2) ask for a quick clarification or 3) wait for additional microphone input. Poor behavior would be to hallucinate an action the player didn’t request or to call tools in response to silence/noise.

In a separate silence/noise sub-bench, OpenAI realtime models sometimes responded to silence or noise as if real spoken input had been heard. Reasoning levels did not reliably solve this:

OpenAI lane	Silence/noise assertion failure rate
gpt-realtime-2 [minimal]	12.9%
gpt-realtime-2 [low]	21.4%
gpt-realtime-2 [medium]	12.9%
gpt-realtime-1.5	0.0%
gpt-realtime	5.7%

Our current hypothesis is that this should be addressed at the product/harness layer as well as the model/prompt layer. Even in a push-to-talk game, we may need to add stricter VAD handling, explicit silence/noise detection, or possibly forcibly suppress model responses when the audio input contains no reliable speech.

Gemini: intermittent no-response / missing audio

Gemini Live models were weighed down by failing to respond or output speech when expected. This was especially visible in the high-thinking lane: missing output audio was more frequent, and those missing turns then counted as failures in downstream semantic checks.

In the non-silence run:

Gemini lane	Missing audio rate
gemini low	10.3%
gemini medium	8.9%
gemini high	17.9%

This does not appear to be a simple session-length issue: missing responses occurred early, sometimes within the first few minutes of gameplay. The issue seems consistent with intermittent Live API no-audio behavior, VAD/activity-boundary issues, or provider-side instability. Other users report similar findings, including in this Stack Overflow report: Gemini Live API intermittently stops streaming audio response.

Takeaways

The main thing SylasBench shows is that speech-to-speech model quality is not one-dimensional. Some models are funny and flexible but weak at complex tool choreography. Some follow basic tool rules but fail to execute a more complex sequence. Some look strong until silence, interruption, or missing-audio reliability enters the picture.

For Wit’s End, the current best balance is gpt-realtime-2 [medium]. It led overall, led the complex rule-following slice, and stayed near the top on stylistic/flexible Game Master behavior evals. gpt-realtime remains impressively strong on basic protocol and latency, while grok voice fast showed promising entertainment value. gemini medium had strong moments, especially on transition-like tasks, but the response reliability was a major issue in this run.

The broader lesson is that realtime game AI needs benchmarks that look like games: not single-turn prompts, but messy, stateful, voice-driven episodes where the model has to be a rules engine, actor, narrator, memory system, and improviser all at once.

Brian Ng is a Senior Machine Learning Engineer at Weekend, working on agentic systems for game production, Wit’s End, plus voice and audio model evals. Thanks to James Wilsterman for feedback on this post.