Back to research
Thesis2 min read

Voice context is a timing problem

Long voice calls slow down for a familiar reason: too much history, not enough context window. The usual advice is to summarize the old turns. That assumes you have time to run the summary. In voice, you don't. The fix isn't a better summary. It's deciding when the summary runs.

The 800ms you don't have

A voice agent has about 800 milliseconds between the user finishing a sentence and the first word of audio coming back. Most of that is already spent on speech recognition, the model, and text-to-speech.

If a long call has lots of history, something has to shrink it before the model sees it. Run that work on the main path and you blow the budget. The call breaks. There's dead air where there shouldn't be.

It looks like a memory problem. It's a timing problem.

Do the work while the agent is talking

While the agent speaks, the user is silent. That window is where context work belongs.

While the agent says "sure, let me look that up," a background job summarizes the older part of the call. By the time the user starts speaking again, the summary is ready. The next prompt is built from the cached summary plus a few recent turns kept in full. Cheap, fast, always the same cost no matter how long the call runs.

The watermark

The piece that makes this work is small but worth naming. Call it the watermark: the point in the call history up to which the summary is done.

Below the watermark, a summary. Above it, the recent turns kept in full. Every prompt is summary + recent_full_turns + new_user_turn. The watermark only moves forward in the background. Never on the main path.

That is the rule that keeps long calls fast. The main path can't get slower as the call gets longer, because the part that would get slower has been pushed off it.

Where it breaks

The plan needs the agent's reply to be long enough for the background work to finish. Short replies and interruptions shrink that window. Two-word back-and-forth defeats it entirely. The watermark falls behind, the full history grows, and the call starts to drag.

The fix is product, not architecture. Agents that handle long calls pace themselves: they confirm, they read back, they take a beat. That's not stalling. It's keeping the background window open.

The takeaway

You can't fix this with a better memory trick. Time is the constraint. Build two paths: one cheap and live on the user's clock, one in the background on the agent's. Keep a watermark, and never make the main path wait for it.

When it's working, the agent stops getting slower as the call gets longer. The thirty-minute call at turn forty feels the same as the ninety-second call at turn three.