Token budgets are a product problem now
It’s about tokens — specifically, how many to send, how many to generate, and what happens to your product when you get that decision wrong.
There’s a conversation happening in every engineering team building AI products right now, and most PMs aren’t in it.
It’s about tokens — specifically, how many to send, how many to generate, and what happens to your product when you get that decision wrong.
For most of AI’s current wave, token management was treated as an infrastructure concern. Engineers handled it. PMs wrote product specs, engineers figured out how to call the model efficiently, and both sides pretended this was a clean division of labour.
That pretense is breaking down.
Why token budgets have become a product decision
When you call a large language model, you’re sending a context window — everything the model knows about the current task. That context has a ceiling. Exceed it, and the model can’t process the full input. Approach it, and quality degrades. Stay well inside it, and you’re leaving capability on the table.
The cost dimension makes this sharper. Tokens cost money. In simple single-turn interactions, the cost is small and the context management is trivial. In agentic workflows — where an AI agent calls other AI agents, accumulates tool outputs, maintains conversation state — the token count compounds with every step. A workflow that costs $0.004 per run in testing can cost $0.40 in production when real conversation length and tool use are factored in. Scale that to 10,000 daily users and the economics are suddenly a board-level conversation.
But cost is the tractable problem. The quality failure is subtler.
What actually breaks
I’ve watched two failure modes emerge repeatedly as teams try to scale AI features beyond prototypes.
The first is context bloat. The team is cautious — they want the model to have everything it might need. So they include full conversation history, all user metadata, every system instruction, the complete document the user uploaded. The model works beautifully in testing, which is always with short conversations and clean inputs. In production, conversations get longer. Documents are bigger. The context window fills. When it does, models don’t throw an error — they start quietly losing the thread. Recent instructions get deprioritised in favour of earlier context. The model gives confident answers that contradict what the user said three messages ago.
The second failure is aggressive truncation without strategy. The team hits costs and decides to cut context. They start dropping earlier messages, summarising conversation history, trimming system prompts. But they do it arbitrarily — cutting by recency rather than relevance. Now the model loses the part of the conversation that actually mattered. A user says “as I mentioned earlier, the invoice date needs to match the LC” — but “earlier” was truncated. The model confidently ignores the constraint. The error surfaces to a user, not a developer.
Both failures share a root cause: the team never asked what the model actually needs to know at each step.
The framing that helps
Think of token budgets the way good engineers think about memory architecture.
You have fast, expensive memory (RAM) and slow, cheap storage (disk). Good system design isn’t “put everything in RAM just in case.” It’s “put in RAM exactly what’s needed right now, and design a retrieval path for everything else.”
Applied to AI systems: what does the model need in context right now? What can be retrieved on demand using vector search or a tool call? What should be compressed into a summary rather than passed verbatim? What can be dropped entirely because it’s no longer relevant to the current task?
These aren’t engineering questions. They’re product questions — because the answers determine what the user experience feels like. A model with a well-managed context window feels coherent, consistent, responsive. A model with a bloated or arbitrarily truncated context feels unreliable, even when the underlying model quality is identical.
Where PMs need to be in this conversation
The practical intervention isn’t for PMs to become token engineers. It’s to include token budget decisions in product design — not as an afterthought, but as a first-class constraint alongside latency and accuracy.
This means asking, when designing an AI feature: what’s the longest realistic conversation this feature will have? What context will accumulate over time? At what point does the context become a liability rather than an asset? What’s the strategy for handling that?
It also means building observability into the product from day one. If you can’t see what your model is receiving in context during a production failure, you cannot debug it. Token count per call, context composition, truncation events — these should be as visible as response time and error rates.
The teams getting AI features into reliable production are treating token management as a product design constraint. The teams still struggling are treating it as a backend implementation detail someone else will sort out.
It’s not. And the sooner PMs own that, the fewer expensive surprises they’ll ship to users.


