Benchmarking speech-to-speech models for AI games
Intro
Weekend makes voice-powered games for TV: family-friendly party games you play together from your couch on platforms like Roku, Fire TV, Samsung, and LG. Wit’s End is our flagship party game where players can speak naturally to an AI Game Master (a witty who has a flair for the dramatic.) By talking to the AI Game Master, players can make choices, roll dice, and advance through a dynamic and reactive fantasy story together.
The design goal is a real-time voice-powered Game Master agent that can listen, remember, improvise, enforce game rules, call tools, and track game state, all while sounding like a compelling voice actor or stage performer.
At Weekend, we’re super excited about the new generation of speech-to-speech models, such as OpenAI’s GPT Realtime, Gemini Live, and Grok Voice Fast. For games like Wit’s End, low latency responses and dynamic vocal range are not just nice-to-haves. Voice is the primary interface to the product. A Game Master agent who takes too long to respond breaks game immersion. If the agent cannot handle interruption, hesitation, correction, humor, or dynamic player commands stops feeling magical.
Speech-to-speech models make it possible to collapse a lot of the traditional voice agent pipeline into one API call. Speech recognition, text reasoning, tool calls, narration, and text-to-speech can increasingly happen inside one real-time model loop.
But “sounds good in a demo” is not enough. A voice Game Master agent has to do several hard things at once:
- Follow complicated scene rules
- Call tools at the right time, and not at the wrong time
- Maintain hidden state across many turns
- Handle interruptions and corrections
- Avoid leaking tool names or internal instructions
- Respond to weird player behavior in-fiction
- Stay entertaining, dramatic, and emotionally responsive
To understand which models are actually ready for that workload, we built SylasBench.
SylasBench
SylasBench is built as an adaptation of a real scene within Wit’s End. In this scene, the player is located in an inner dungeon sanctuary alongside Sylas, a mischievous forest Sprite NPC. The player is trying to open a magical reliquary and then return to the burning town of Timberfall.
In this scene, the speech-to-speech model has to narrate the game (like a tabletop RPG dungeon master) and also switch to voice-acting as the NPC character Sylas at key moments. The model starts with a dense system prompt that contains a full narrative world bible, player inventory metadata, hidden game state, puzzle rules, tool-use rules, and performance instructions for how to act as both the Game Master and as Sylas. The model has to succeed as both a rules engine and a performer.
Our eval script simulates actual player gameplay input that includes in-game actions to solve a puzzle, attempts to correct or redirect the Game Master, an alphanumeric misread, an “I give up” moment, an attempt to use “blood-magic”, an interruption while the Game Master is speaking, and actions that should trigger tool calls to transition between game scenes. It is a good approximation of real production inputs that our game needs to handle daily.
We ran SylasBench across OpenAI Realtime, GPT Realtime 2 with different reasoning budgets, Gemini Live, and xAI Grok Voice.
The headline result: gpt-realtime-2 [medium] gave us the strongest overall balance of complex rule-following adherence and entertaining voice acting performance. The older gpt-realtime model remained strong, especially on basic tool protocols, but gpt-realtime-2 [medium] was better on the higher-complexity Game Master workload.
Note: Scores are weighted assertion pass rates with 95% confidence intervals from per-run variance. First-cycle audio P50 is the median time from the model response request to the first audio chunk in the model’s first response cycle, excluding later tool-result followup cycles and reducing distortion from long/tool-heavy turns.
Complex Rule Following
One important slice of SylasBench combines scene transition, memory handoff, and puzzle state tracking. This tests whether the model can follow a multi-step tool sequence and maintain a puzzle state across interaction turns.
The final transition is deliberately difficult. The model must complete the objective, record scene memory, query past memories, use the relevant retrieved warning, avoid outdated retrieved facts, and speak only a short final transition statement.
Separately, the puzzle requires an exact sequence of objects to be interacted with: blue candle, silver bell, cracked mirror. In doing so, the model must reject decoys, handle mutable state for objects, use tools in order, and avoid inventing alternate solutions.
This slice of our overall benchmark most closely evaluates “can this model actually run a complex game system?” gpt-realtime-2 [medium] separated itself from the pack most clearly here.
Stylistic / Flexible DM
Rule-following is not suffcient. A Game Master also needs to make failure fun, recover from player interruptions, and maintain a coherent fantasy narrative while players behave unpredictably.
This eval combines correction and interruption recovery, “I give up” handling, “blood-magic” handling, and spoken tool hygiene. Good behavior means the model does not flatly say “you cannot give up,” does not mistake “blood-magic” in the context of a game for real world self-harm, does not read tool comments or bullet lists aloud, and does not expose function names or internal machinery.
grok voice fast performed very well on this stylistic performance eval, but struggled on the more agentic scene transition and tool choreography tests. gpt-realtime-2 [medium] was the most balanced: close to the top stylistically, while also leading on complex rule-following.
Provider-Specific Quirks
OpenAI: silence / noise hallucination
One issue we continue to care about is silence and noise handling. In a push-to-talk game, players may open the mic accidentally, take long pauses, or be in a noisy environment. Typically the correct behavior for a Game Master agent is to either 1) do nothing and discard the input 2) ask for a quick clarification or 3) wait for additional microphone input. Poor behavior would be to hallucinate an action the player didn’t request or to call tools in response to silence/noise.
In a separate silence/noise sub-bench, OpenAI realtime models sometimes responded to silence or noise as if real spoken input had been heard. Reasoning levels did not reliably solve this:
Our current hypothesis is that this should be addressed at the product/harness layer as well as the model/prompt layer. Even in a push-to-talk game, we may need to add stricter VAD handling, explicit silence/noise detection, or possibly forcibly suppress model responses when the audio input contains no reliable speech.
Gemini: intermittent no-response / missing audio
Gemini Live models were weighed down by failing to respond or output speech when expected. This was especially visible in the high-thinking lane: missing output audio was more frequent, and those missing turns then counted as failures in downstream semantic checks.
In the non-silence run:
This does not appear to be a simple session-length issue: missing responses occurred early, sometimes within the first few minutes of gameplay. The issue seems consistent with intermittent Live API no-audio behavior, VAD/activity-boundary issues, or provider-side instability. Other users report similar findings, including in this Stack Overflow report: Gemini Live API intermittently stops streaming audio response.
Takeaways
The main thing SylasBench shows is that speech-to-speech model quality is not one-dimensional. Some models are funny and flexible but weak at complex tool choreography. Some follow basic tool rules but fail to execute a more complex sequence. Some look strong until silence, interruption, or missing-audio reliability enters the picture.
For Wit’s End, the current best balance is gpt-realtime-2 [medium]. It led overall, led the complex rule-following slice, and stayed near the top on stylistic/flexible Game Master behavior evals. gpt-realtime remains impressively strong on basic protocol and latency, while grok voice fast showed promising entertainment value. gemini medium had strong moments, especially on transition-like tasks, but the response reliability was a major issue in this run.
The broader lesson is that realtime game AI needs benchmarks that look like games: not single-turn prompts, but messy, stateful, voice-driven episodes where the model has to be a rules engine, actor, narrator, memory system, and improviser all at once.
Brian Ng is a Senior Machine Learning Engineer at Weekend, working on agentic systems for game production, Wit’s End, plus voice and audio model evals. Thanks to James Wilsterman for feedback on this post.







- No controller needed
- Free for 7 days
- Works on Roku, Fire TV, Samsung & LG

Free for 7 days. Cancel anytime.
