Building a Voice Agent for Media: Why the LLM Is Not the Brain

Workshop · ContentWise Connect Day 2026

Recorded at ContentWise Connect Day 2026, Milan. Renato Bonomini, VP Sales Engineering, ContentWise by Moviri.

Talking to a TV used to feel awkward. You held the remote up like a microphone, repeated yourself, and got back something generic or wrong. The problem wasn’t the hardware. It was that nothing behind the microphone actually knew who you were or what you had available to watch.

That’s changed — but not because LLMs got smarter. It changed because the architecture around them got better. At ContentWise Connect Day 2026, VP of Sales Engineering Renato Bonomini walked a room of video operators and product leaders through the working prototype he built on UX Engine, and explained what every voice agent demo gets wrong.

The stack has four layers. Only one of them is yours.

A voice agent has a simple mechanical shape: the user speaks, speech-to-text converts it to text, an LLM reasons about the query, a recommendation engine handles the lookup, and text-to-speech converts the answer back to audio. Four layers. Three of them, STT, LLM, and TTS, are commodity. You can buy them, license them, or swap them. The fourth layer is UX Engine, and it’s the only one that knows your catalog and your users. That distinction is the whole architecture.

In Renato’s prototype, a synthetic user called Jack starts a session. Before Jack says a word, UX Engine has already handed the LLM a 7,531-character context block: metadata field definitions, tool routing rules, and Jack’s viewing preferences across five explicit and seven implicit categories. Jack never filled out a form. He just watched things over time, and UX Engine was paying attention.

250 milliseconds is the hard ceiling

Voice breaks at latency in a way that text doesn’t. Telecommunication engineering established the 250ms threshold for natural conversation decades ago when designing GSM encoding, and that limit applies directly to voice AI. Ask a question and wait a second for an answer: the experience falls apart.

Testing across commercially available LLMs, Renato found the premium reasoning models clock in at around 1.2 seconds per query and cost roughly 500 times more per transaction than the smallest capable models. A latency-tuned open-weight model like Qwen 3.6 delivers 233ms responses at a fraction of the cost, with no detectable quality difference for this task. For a US operator with 10 million subscribers, running a premium model would cost approximately $10 million per month to operate. That math ends the conversation quickly.

At the end of the day, what might be driving it is you need a certain latency, and then you find out it costs so much.— Renato Bonomini Cremonesi

The reason smaller models work here is that the LLM’s job is tightly scoped. It translates intent, picks the right tool call, and narrates the result. It does not rank content. That job stays with the recommendation engine, which is designed for exactly that purpose.

Grounding is two-sided: catalog and profile

Ask an ungrounded LLM for a movie recommendation and it will give you one. The title probably sounds right. It probably doesn’t exist in your service. Catalog grounding means every title the agent names came from a tool call to UX Engine, not from the model’s training data. No tool call, no title.

Profile grounding works the same way. When Jack asks “why would I like Jurassic Park?”, the agent answers: “Because it stars Jeff Goldblum, an actor you’ve watched frequently, and is directed by Steven Spielberg, one of your favorite directors.” That explanation came from UX Engine’s user model, not from the LLM’s imagination. Jeff Goldblum: 12 confirmed signals from viewing behavior. The LLM read the profile; it didn’t construct it.

Business rules belong in the engine, not the prompt.

Editorial teams change priorities daily. A title goes into first position. A content rights window closes. A partner promotion starts. If those rules live in the LLM prompt, every editorial decision is a software deployment. Someone edits the prompt, tests it, and ships it. The gap between intent and production can stretch to days.

In this architecture, the LLM never sees the rules. It calls UX Engine, which applies current business logic and returns filtered, ranked results. The same prompt has been running across different ContentWise demo environments and partner pilots without modification. When rules change, they change in UX Engine. The voice agent surfaces whatever the engine returns.

Teach the agent. It doesn’t know what you know.

The most instructive part of the workshop is the bug that took the longest to diagnose. The query “show me something funny” would route to a generic recommendations call rather than a mood-filtered search. The LLM read “something” as the operative word and lost “funny.” The fix was not a code change. It was five counter-examples added to the prompt, explicitly mapping phrases that look open-ended but contain an attribute word to their correct tool calls.

The underlying lesson generalizes: an LLM is capable and contextually ignorant at the same time. You can assume it knows nothing about how your users phrase requests or what your metadata fields mean. The prompt is where that institutional knowledge lives, and building it is where most of the work actually goes — not in the SDK, not in the tools.

250 ms

Latency ceiling for natural voice conversation

Above this threshold the interaction feels unnatural. Premium LLMs average 1.2 seconds per query, well outside the limit for production use.

~500×

Cost difference between premium and edge LLMs

A latency-tuned open-weight model delivers comparable quality for this task at a fraction of the cost, with responses under 250ms.

7,531

Characters of user context passed at session start

Catalog metadata fields, tool routing rules, and the user’s preference profile, before the user says a single word.

6

Tools needed to build a complete voice agent

Get recommendations, get related titles, search content, get user profile, get metadata values, get item details. The full feature set in six calls.

Renato Bonomini
VP, Sales Engineering, ContentWise

Renato runs sales engineering at ContentWise, sitting at the intersection of operator requirements and platform capability. He builds working prototypes to find what breaks before it reaches a customer, and runs workshops on AI architecture for media and streaming operators.