#44 - Tokenmaxxing, Meet Systems Thinking
What your token spend forgets to measure
I had an entirely different opening to this article when I first wrote it. But Anthropic’s recent piece When AI builds itself forced my hand and I had to pull out this quote by one of their own engineers, which perfectly sets the scene for the remainder of this article:
“On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don’t understand why and I realize I have no idea what I’ve been up to anymore.”
It’s almost too perfect that this is coming from a company where over 80% of their production code is now written by Claude. The first half of the quote is about speed: everything automated, faster, better. The second half is about the erosion of understanding. The former is getting all the attention. The latter is hardly talked about. And that gap is what this article is about.
It gets better. That same piece has the money shot we see in most AI-pilled posts: Anthropic’s engineers now ship roughly 8x the lines of code per day they used to. But tucked about halfway down, an important caveat: “Lines of code is an imperfect measure, as it measures quantity over quality. So 8× lines of code/engineer/day in the second quarter of 2026 is almost certainly an overstatement of the true productivity gain.”
You could be forgiven for missing that.
Jensen wants you to go ape
At Nvidia’s GTC this year, Jensen Huang sat down on the All-In Podcast and gave engineering leaders a brand new number to be insecure about. Asked how he thinks about AI inside his own teams, he said he’d be “deeply alarmed” if one of his $500,000 engineers didn’t consume at least $250,000 worth of tokens over the course of a year. Half their salary. If that engineer came back having spent “only” $5,000? In Jensen’s words: “I will go ape”.
Peter Steinberger, OpenClaw creator, took it to heart, spending over $1M in a month:
To be fair, there’s a real insight buried in Jensen’s thinking. Refusing to use AI to design chips in 2026 is a bit like insisting on paper and pencil while your competition runs CAD. Fine. But don’t lose sight of the incentives: the CEO of the company that sells you the tokens just told an entire industry that token spend is the productivity metric. Set the target, watch the world chase it.
The CEO of the company that sells you the tokens told an entire industry that token spend is the productivity metric.
And chase it we did. For months, “tokenmaxxing” went from in-joke to KPI. Leaders started bragging about per-engineer token burn the way they used to brag about lines of code, which should have been the first alarm. Guess what? The pendulum is swinging back: every other newsletter in your inbox is suddenly Very Concerned about AI ROI. Welcome to the party. No fucking notes.
Except the ROI take is the shallow one. And I want to go deeper.
The view from open source
While enterprises were busy gamifying their token meters, the open-source world was getting a free preview of where this leads.
Maintainers are drowning in slop. AI-generated pull requests that look legit but are just noise. Bug-bounty reports entirely hallucinated, complete with made-up vulnerabilities and bogus reproduction steps. Daniel Stenberg, who maintains curl, didn’t mince words: “We are effectively being DDoSed”. By January this year, roughly a fifth of submissions to curl’s bug bounty were AI-assisted and the share of valid reports had collapsed to about 5%, so he shut the HackerOne program down entirely. The cost of reading the slop exceeded the value of the bounty program itself.
Reviewer attention is finite and tokenmaxxing absolutely torches it
Now what’s interesting is that the slop receded. Nobody fixed the incentives, but the frontier models got good enough that the garbage mostly went away on its own. By April Stenberg was reporting that the slop “is not a problem anymore”; almost every security report now leans on AI to some degree, and the confirmed-vulnerability rate had climbed back to its pre-AI level of 15-16%. He reopened the HackerOne reporting channel in March, though the bounty money is gone for good. The attention stock drained hard for a year, then refilled, and what refilled it was pure luck, not a gauge on anyone’s wall.
This is an entirely different take on the ROI debate. Nobody is spending half a salary on tokens. Contributors are free after all. But the maintainers were very clearly worse off the entire time it lasted, because the thing being consumed isn’t measured in dollars. It’s attention. Reviewer attention is finite and tokenmaxxing absolutely torches it. Curl got its attention back by accident. Most teams draining a stock they never put on a dashboard won’t be so lucky, mostly because they won’t even notice it happened.
Which is the whole point. The cost was never the tokens.
Forget ROI. Did the system get worse?
My question to you: is your quality bar at least as high as it was before AI, and are your systems at least as stable?
I don’t care if you think you’re faster. Did the thing get worse, and would you even know?
Most organisations cannot answer this. They never captured a pre-AI baseline for change failure rate, defect escape rate, or time-to-onboard a new engineer onto a service. So they’re optimising a flow they can see (tokens, PRs, velocity) while completely blind to the stocks it’s draining. This is a systems thinking problem, though we seem to have forgotten all about it.
I’ve written before about why your metrics are probably lying to you. The key of that piece was Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Tokenmaxxing is textbook Goodhart’s Law and tell you nothing about whether anyone is doing good work. It just tells you who’s good at spending.
Stocks and flows, or: where the velocity actually went
Back in the engineering management series — pre GenAI craze — I introduced System Dynamics through the rework cycle. The core distinction is simple. A flow is a rate, something you measure per unit of time: code merged this week, tokens burned this sprint. A stock is an accumulation, a reservoir that fills and drains slowly: code quality, system stability, the team’s actual understanding of what it has built.
Back to the articles opening quote: the first half was a flow, going brrr. The second half was a stock, hitting empty. Tokenmaxxing optimises one highly visible flow and quietly drains four stocks few organisations track.
Undiscovered rework; and this one grows. The rework cycle’s nastiest feature is the reservoir of work that’s been done wrong but not yet known to be wrong. AI is a great engine for filling it. The output looks finished. It passes the tests (that perhaps the AI generated itself). So it flows straight into “done”, and the defects sit in the undiscovered-rework tank until an incident pulls the drain plug weeks later. The promised speedup arrives first; the bill arrives second, with interest, as known rework that extends the very timeline you were trying to compress.
The promised speedup arrives first. The bill arrives second, with interest.
Engineer comprehension; drains. This one is scary because it refills slowest. If you just wave through what the AI generates, people stop building a mental model of it. The team’s understanding of its own system slowly degrades. I really don’t think that Anthropic engineer was being dramatic; I’ve felt the same way. “I have no idea what I’ve been up to anymore” is the comprehension stock hitting empty. I called this the explainability debt back when I wrote about the death of software design: the gap between what your systems do and what your engineers can actually explain. Comprehension and explainability debt are two sides of one coin. Comprehension is the asset; explainability debt is what accrues as it drains. In 2025 I framed that as a slightly dystopian future. Tokenmaxxing brought it forward way faster than I predicted.
Want to read the gauge on this one? Try the whiteboard test. Pick a service your team shipped last quarter and ask them to draw how it works without opening the editor. The length of the silence is your reading.
Reviewer attention; drains. See: every open-source maintainer above. Finite, gets flooded, and when it’s gone, review quality collapses, which feeds straight back into undiscovered rework. It’s a reinforcing loop, and it runs in the wrong direction.
System stability; drains. More change means more opportunity for change to fail. The DORA research has been telling us for a decade that deployment frequency and change failure rate have to be read together; its most famous finding is that speed and stability aren’t a trade-off, but elite teams earn that result precisely by watching both. Increase the flow of changes without watching the failure rate and you don’t get a faster system, it’ll just break more often.
The industry is fumbling towards the gauge
To be fair, the walk-back has started, and some of it is even in the right direction.
Amazon ran an internal leaderboard, KiroRank, that ranked engineers by tokens consumed, right up until it clocked people climbing the board by feeding the model pointless busywork. Textbook Goodhart again. So they scrapped it and switched to tracking “normalised deployments”: did the AI actually produce useful, shipped code? This swaps the meter for something close to the gauge: measuring the outcome the business wanted instead of the spend it took to get there. Exactly the move.
Separately, Amazon got a blunter lesson, and it came from Kiro: the very tool behind that leaderboard. AWS let Kiro make changes to a customer-facing cost calculator, and the model opted to “delete and recreate the environment”, taking the service down for thirteen hours. Amazon waved it off as “extremely limited”. By March, after a run of incidents an internal memo flagged as “high blast radius” and tied to “GenAI-assisted changes”, a senior VP told engineers that junior and mid-level staff would now need a senior to sign off any AI-assisted change. The right instinct: defending reviewer attention and stability on purpose. (There was a near-six-hour retail outage that knocked out checkout around then too, though Amazon pins that one on a plain bad code deployment, not AI, and I’ll take them at their word.) But the stock had to hit empty before anyone went looking. An expensive way to learn.
And then there’s Uber, which burned through its entire 2026 AI budget by April and responded by capping engineers at $1,500 per tool per month. Which is, when you squint, management of the meter. Capping the flow because it frightened the finance team tells you nothing about whether the four stocks are draining. Uber’s defect escape rate could be climbing all quarter and the spend cap wouldn’t so much as flicker. And the tell is that Uber’s own COO already knew it: asked whether all that AI-generated code was actually translating into anything, Andrew Macdonald said that “it’s very hard to draw a line between one of those stats and ‘Okay now we’re actually producing like 25% more useful consumer features’”. Ouch.
Now, I can’t hand you Amazon’s gauge (their approach). “Normalised deployments” is their outcome, reverse-engineered from their business. Yours is almost certainly something else, and that’s the work to be done.
Because the fix was never a metric you could copy off someone else’s slide. It’s a pair of them. One number for the outcome that actually moves your business, and a second, pulling in the opposite direction, that stops the first from being gamed. The self-balancing metrics idea from my previous article, pointed squarely at AI.
What that pair is depends entirely on what you do. If you’re a consumer product, maybe it’s revenue per active user held honest by NPS, so “growth” can’t quietly come to mean dark patterns that churn people out in six months. If you’re running a platform, maybe it’s cost-to-serve paired with task completion rate, so “cheaper” can’t quietly mean an assistant that gives up halfway. If you’re closer to my world, maybe it’s deployment frequency coupled with change failure rate and the time it takes a human to onboard onto the codebase. There is no universal answer here.
Two things that won’t fit on a tidy slide but matter more than any single pair:
You need a “before”. “Did the system get worse” is unanswerable without a pre-AI reading. If you’re early enough to still have a clean-ish comparison, capture it now. If you’re not, start the clock today, future-you will be grateful in 12 months time.
Read the coupling, not the level. One ugly quarter on any of these is noise. A number that degrades every quarter as token spend climbs is the system talking to you. The direction and the correlation carry the signal.
The only frame that isn’t theatre
Tokenmaxxing alone isn’t useful. A number that looked like progress, great for engagement, that mostly measured enthusiasm. And the ROI walk-back everyone’s writing about right now? That’s just measurement theatre: swapping one lagging indicator for another. ROI will tell you the system failed, eventually, after the stocks have drained and the damage is locked in.
Systems thinking is the only frame here that isn’t a performance. It asks the question that matters before the incident, not after: what is this flow doing to the stocks I’m not looking at?
So go ahead. Spend the $250k. Go ape, even. Just don’t mistake a full token meter for a healthy system.





