Benchmarking speech-to-speech models for AI games

Weekend Team
Written by
Weekend Team
Published on: 
June 4, 2026
4
 min read
Table of Contents

Intro

Weekend makes voice-powered games for TV: family-friendly party games you play together from your couch on platforms like Roku, Fire TV, Samsung, and LG. Wit’s End is our flagship party game where players can speak naturally to an AI Game Master (a witty who has a flair for the dramatic.) By talking to the AI Game Master, players can make choices, roll dice, and advance through a dynamic and reactive fantasy story together.

The design goal is a real-time voice-powered Game Master agent that can listen, remember, improvise, enforce game rules, call tools, and track game state, all while sounding like a compelling voice actor or stage performer.

At Weekend, we’re super excited about the new generation of speech-to-speech models, such as OpenAI’s GPT Realtime, Gemini Live, and Grok Voice Fast. For games like Wit’s End, low latency responses and dynamic vocal range are not just nice-to-haves. Voice is the primary interface to the product. A Game Master agent who takes too long to respond breaks game immersion. If the agent cannot handle interruption, hesitation, correction, humor, or dynamic player commands stops feeling magical.

Speech-to-speech models make it possible to collapse a lot of the traditional voice agent pipeline into one API call. Speech recognition, text reasoning, tool calls, narration, and text-to-speech can increasingly happen inside one real-time model loop.

But “sounds good in a demo” is not enough. A voice Game Master agent has to do several hard things at once:

  • Follow complicated scene rules
  • Call tools at the right time, and not at the wrong time
  • Maintain hidden state across many turns
  • Handle interruptions and corrections
  • Avoid leaking tool names or internal instructions
  • Respond to weird player behavior in-fiction
  • Stay entertaining, dramatic, and emotionally responsive

To understand which models are actually ready for that workload, we built SylasBench.

SylasBench

SylasBench is built as an adaptation of a real scene within Wit’s End. In this scene, the player is located in an inner dungeon sanctuary alongside Sylas, a mischievous forest Sprite NPC. The player is trying to open a magical reliquary and then return to the burning town of Timberfall.

In this scene, the speech-to-speech model has to narrate the game (like a tabletop RPG dungeon master) and also switch to voice-acting as the NPC character Sylas at key moments. The model starts with a dense system prompt that contains a full narrative world bible, player inventory metadata, hidden game state, puzzle rules, tool-use rules, and performance instructions for how to act as both the Game Master and as Sylas. The model has to succeed as both a rules engine and a performer.

Our eval script simulates actual player gameplay input that includes in-game actions to solve a puzzle, attempts to correct or redirect the Game Master, an alphanumeric misread, an “I give up” moment, an attempt to use “blood-magic”, an interruption while the Game Master is speaking, and actions that should trigger tool calls to transition between game scenes. It is a good approximation of real production inputs that our game needs to handle daily.

We ran SylasBench across OpenAI Realtime, GPT Realtime 2 with different reasoning budgets, Gemini Live, and xAI Grok Voice.

The headline result: gpt-realtime-2 [medium] gave us the strongest overall balance of complex rule-following adherence and entertaining voice acting performance. The older gpt-realtime model remained strong, especially on basic tool protocols, but gpt-realtime-2 [medium] was better on the higher-complexity Game Master workload.

Model Weighted 95% CI First-cycle audio P50
gpt-realtime-2 [medium] 93.6% ±3.0 pts 1.37s
gpt-realtime 88.4% ±2.1 pts 0.70s
gpt-realtime-2 [low] 88.4% ±2.8 pts 1.20s
gpt-realtime-2 [minimal] 87.3% ±3.1 pts 1.10s
gemini medium 84.8% ±3.3 pts 3.60s
grok voice fast 83.8% ±1.9 pts 1.06s
gemini low 81.9% ±3.8 pts 3.54s
gpt-realtime-1.5 81.5% ±3.0 pts 0.89s
gemini high 76.6% ±4.3 pts 4.45s

Note: Scores are weighted assertion pass rates with 95% confidence intervals from per-run variance. First-cycle audio P50 is the median time from the model response request to the first audio chunk in the model’s first response cycle, excluding later tool-result followup cycles and reducing distortion from long/tool-heavy turns.

Complex Rule Following

One important slice of SylasBench combines scene transition, memory handoff, and puzzle state tracking. This tests whether the model can follow a multi-step tool sequence and maintain a puzzle state across interaction turns.

The final transition is deliberately difficult. The model must complete the objective, record scene memory, query past memories, use the relevant retrieved warning, avoid outdated retrieved facts, and speak only a short final transition statement.

Separately, the puzzle requires an exact sequence of objects to be interacted with: blue candle, silver bell, cracked mirror. In doing so, the model must reject decoys, handle mutable state for objects, use tools in order, and avoid inventing alternate solutions.

Rank Model Score
1 gpt-realtime-2 [medium] 89.8%
2 gpt-realtime 78.4%
3 gemini medium 77.6%
4 gpt-realtime-2 [low] 76.0%
5 gpt-realtime-2 [minimal] 73.2%
6 gpt-realtime-1.5 64.9%
7 gemini low 63.9%
8 gemini high 59.2%
9 grok voice fast 59.1%

This slice of our overall benchmark most closely evaluates “can this model actually run a complex game system?” gpt-realtime-2 [medium] separated itself from the pack most clearly here.

Stylistic / Flexible DM

Rule-following is not suffcient. A Game Master also needs to make failure fun, recover from player interruptions, and maintain a coherent fantasy narrative while players behave unpredictably.

This eval combines correction and interruption recovery, “I give up” handling, “blood-magic” handling, and spoken tool hygiene. Good behavior means the model does not flatly say “you cannot give up,” does not mistake “blood-magic” in the context of a game for real world self-harm, does not read tool comments or bullet lists aloud, and does not expose function names or internal machinery.

Rank Model Score
1 grok voice fast 92.9%
2 gpt-realtime-2 [medium] 92.1%
3 gpt-realtime-2 [low] 90.5%
4 gpt-realtime-2 [minimal] 89.2%
5 gemini low 88.8%
6 gpt-realtime-1.5 88.7%
7 gpt-realtime 84.9%
8 gemini high 83.3%
9 gemini medium 83.0%

grok voice fast performed very well on this stylistic performance eval, but struggled on the more agentic scene transition and tool choreography tests. gpt-realtime-2 [medium] was the most balanced: close to the top stylistically, while also leading on complex rule-following.

Provider-Specific Quirks

OpenAI: silence / noise hallucination

One issue we continue to care about is silence and noise handling. In a push-to-talk game, players may open the mic accidentally, take long pauses, or be in a noisy environment. Typically the correct behavior for a Game Master agent is to either 1) do nothing and discard the input 2) ask for a quick clarification or 3) wait for additional microphone input. Poor behavior would be to hallucinate an action the player didn’t request or to call tools in response to silence/noise.

In a separate silence/noise sub-bench, OpenAI realtime models sometimes responded to silence or noise as if real spoken input had been heard. Reasoning levels did not reliably solve this:

OpenAI lane Silence/noise assertion failure rate
gpt-realtime-2 [minimal] 12.9%
gpt-realtime-2 [low] 21.4%
gpt-realtime-2 [medium] 12.9%
gpt-realtime-1.5 0.0%
gpt-realtime 5.7%

Our current hypothesis is that this should be addressed at the product/harness layer as well as the model/prompt layer. Even in a push-to-talk game, we may need to add stricter VAD handling, explicit silence/noise detection, or possibly forcibly suppress model responses when the audio input contains no reliable speech.

Gemini: intermittent no-response / missing audio

Gemini Live models were weighed down by failing to respond or output speech when expected. This was especially visible in the high-thinking lane: missing output audio was more frequent, and those missing turns then counted as failures in downstream semantic checks.

In the non-silence run:

Gemini lane Missing audio rate
gemini low 10.3%
gemini medium 8.9%
gemini high 17.9%

This does not appear to be a simple session-length issue: missing responses occurred early, sometimes within the first few minutes of gameplay. The issue seems consistent with intermittent Live API no-audio behavior, VAD/activity-boundary issues, or provider-side instability. Other users report similar findings, including in this Stack Overflow report: Gemini Live API intermittently stops streaming audio response.

Takeaways

The main thing SylasBench shows is that speech-to-speech model quality is not one-dimensional. Some models are funny and flexible but weak at complex tool choreography. Some follow basic tool rules but fail to execute a more complex sequence. Some look strong until silence, interruption, or missing-audio reliability enters the picture.

For Wit’s End, the current best balance is gpt-realtime-2 [medium]. It led overall, led the complex rule-following slice, and stayed near the top on stylistic/flexible Game Master behavior evals. gpt-realtime remains impressively strong on basic protocol and latency, while grok voice fast showed promising entertainment value. gemini medium had strong moments, especially on transition-like tasks, but the response reliability was a major issue in this run.

The broader lesson is that realtime game AI needs benchmarks that look like games: not single-turn prompts, but messy, stateful, voice-driven episodes where the model has to be a rules engine, actor, narrator, memory system, and improviser all at once.

Brian Ng is a Senior Machine Learning Engineer at Weekend, working on agentic systems for game production, Wit’s End, plus voice and audio model evals. Thanks to James Wilsterman for feedback on this post.

Circle logo with colorful vertical sound bars and the words Song Quiz in white on dark background.
Play Song Quiz on TV
Think you know your music?
Guess songs from short clips. On your TV.
Try for free
Jeopardy! logo with white text over a globe showing continents in orange and purple shades.
Play Jeopardy! on TV
Step behind the podium
Real Jeopardy! New clues added every day.
Try for free
Wheel of Forture
Play wheel Of Fortune on TV
Spin the wheel from your couch
Solve daily word puzzles with your voice.
Try for free
20 Questions
Play 20 Questions on TV
Twenty questions. Zero excuses.
20 Questions against a smart riddlemaster.
Try for free
Karaoke
Play Karaoke on TV
Know the words? Prove it.
Sing along to your favorite songs. On your TV.
Try for free
Play CoComelon on TV
Big smiles, zero effort.
Sing along to your favorite songs. On your TV.
Try for free
Wit's End Icon
Play Whit's End on TV
Your choices. Your story.
Fantasy RPG where you control the story.
Try for free
Play on Weekend
Game night starts on your TV.
Beloved games like Jeopardy! and Wheel of Fortune
  • No controller needed
  • Free for 7 days
  • Works on Roku, Fire TV, Samsung & LG
Handwritten style word 'Weekend' in black script on a transparent background.
Start a game night on your TV
Play Jeopardy!, Wheel of Fortune, Song Quiz and more on your TV. No controller needed.
Play Now!
Available on Roku, Fire TV, Samsung, and LG.
Free for 7 days. Cancel anytime.
Play games on your TV