Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

6/7/2026 at 11:00:04 AM

I have a MA system setup for personal use.

You give it a problem, you then refine that problem where a fast, cheaper model asks you questions which you answer to get a better input prompt. You then choose a MA strategy for example take problem break up to sections then final judge concludes or you do multi turn where agents debate then judge summarises debate.

The best approach is what I call 'all angles' where all these strategies run in parallel the final meta-judge synthesise the response - the most useful part of this which I recently added is a view to see the variance in each strategy.

Been using this for life stuff - housing search, schools, family challenges!

Perhaps I should make a video of it in action if people in HN community interested let me know.

by monkeydust

6/7/2026 at 4:12:23 PM

Right here is the video demo of what I built - https://streamable.com/e49cgt

by monkeydust

6/8/2026 at 4:12:38 PM

Details and repo post on ShowHN here - https://github.com/monkeydust/rightmind

by monkeydust

6/7/2026 at 6:10:41 PM

I have also developed a similar system not focused on the exploratory refinement of prompt(s). But more focused on feedback loops cybernetic style, so focused on the maintaining of stability of the prompt outputs by a growing library of deterministic checks and autofixes. Anything that is a "problem" which isn't covered by that library is surfaced to the human driving the process.

by ethanwillis

6/7/2026 at 12:30:10 PM

You mention cost in one of the replies. Can you elaborate on the cost profile (ballpark) for various problem types? I would also be curious to understand the strategies employed and what the costs look like across each.

by chrisss395

6/7/2026 at 12:21:12 PM

Definitely interested, would love to see a video :)

by Folcon

6/7/2026 at 2:09:06 PM

Sure let me do that. Can I post this as a ShowHN if its just video? The rules say people need to try out but that will cost me a small fortune :) ...could perhaps post on Github and people can setup the repo themselves with their own Openrouter key if that works. Have never done a ShowHN but would be fun to try it.

by monkeydust

6/7/2026 at 6:03:18 PM

The cheap models may ask subpar questions leading to subpar solutions

by whattheheckheck

6/7/2026 at 11:10:16 AM

So what harness are you using? And what LLM’s

by uxhacker

6/7/2026 at 11:30:20 AM

Homebrew harness and all frontier ones plus deepseek. All via Openrouter at the moment. Works well enough but can get expensive so use for real high value challenges. Interestingly the refine feature has been most useful to me and people I have shown, essentially people are lazy when expressing the initial problem (me included!), refine asks relevant questions to initial problem then refines the initial statement, user can accept/reject/edit before submitting.

by monkeydust

6/7/2026 at 1:09:52 PM

I came to a similar conclusion. I think the default options in many IDEs (Ask/Plan/Agent) are limited... 'Refine' feels like an improved 'Plan' in that it doesn't just jump right into building a list of tasks based on the initial prompt, because who knows what sort of flaws or deficiencies were present in the initial prompt! Can't always get everything right in the first try. XP

I don't think a specific harness is even necessary to get a boost from 'Refine'. Even a simple custom agent is portable enough... it's easy enough to take the existing 'Plan' agent definition present in VS Code and tweak it to be 'Refine' instead.

by Cherub0774

6/8/2026 at 1:17:26 AM

There is a 5 line skill I’ve been using for refinement called grill-me that works quite well

by SOLAR_FIELDS

6/7/2026 at 7:02:56 PM

[flagged]

by flowbarai

6/8/2026 at 1:04:27 PM

The problem with these kinds of systems (they have been well studied), is that that the overall output is ultimately anchored to the dumbest models used.

I.e. you cannot end up having a more intelligent output by using more dumber models (that is: dumber than the most intelligent model used).

It's generally always best to refine your prompt and send it (at most) to the two smartest frontier models possible. And then have the smartest model review the output from the second smartest.

by saberience

6/7/2026 at 4:20:45 AM

amusing side note:

Was in a meeting reviewing a potential new product, it was going well until they showed us that they had added AI to it (of course they have). It was pretty obviously just shoehorned in, and one part of that obviousness was that they had a column that showed how many tokens it took to make each query.

I asked who is paying for the tokens, they said its included in the license. I said, so is there a budget or is it all you can eat. they said good question they didnt know and would get back to me. I said the reason i asked was just one query there had a 250k token burn on it. and it was a fairly simple query about one device.

then, one of the execs on their side was heard saying out loud "Why are we even showing this to the customers?"

it have us quite a chuckle. But lesson learned... the cost of adding AI to anything isnt really being accounted for let alone the true cost of actually running the AI.

all things AI are going to get more expensive. even if you dont want the AI aspect.

by senectus1

6/7/2026 at 7:33:52 AM

AIshittification

by prymitive

6/7/2026 at 5:54:04 AM

One month I could use Github Copilot fully with no disruptions. The next month, after pricing changes, I’ve run out of tokens in two days.

Such drastic changes tell me that pricing of tokens is arbitrary, and AI business is running out of money fast.

by sedatk

6/7/2026 at 6:15:54 AM

I think it's more a consequence of pushing for the biggest valuation/IPO. Rumoured profits on inference are north of 70%.

Taking SpaceX as an example, they have increased prices across all their consumer products over the past six months. But they definitely aren't short on money with Alphabet and Anthropic combined paying them over $2 billion per month.

Microsoft/GitHub lost out here as they were just repacking other people's products.

by lucaspiller

6/7/2026 at 6:48:25 AM

Inference can only happen after having invested in training and datacenter construction. Arguing about "inference profitability" sounds a lot to me like ignoring large cost centers of these comanies.

by lefra

6/7/2026 at 8:25:07 AM

> Rumoured profits on inference are north of 70%.

Rumors are worth squat when they’re most likely put in motion by the people with a vested interest in this industry.

Let’s talk about profits when there’s real data from the IPO documentation.

by jurgenburgen

6/7/2026 at 8:39:02 AM

> Rumors are worth squat

You can make some educated guesses and find out some limits on inferencing cost by looking at 3rd party providers on platforms like openrouter. You can get some median cost /tok for a given model size. Then make some educated guesses on SotA model sizes, and you can get an estimate on pure cost of serving a model. Error bars and all that, of course. But still a range, with some limits.

by NitpickLawyer

6/7/2026 at 2:59:00 PM

No, you can't really make educated guesses unless people start opening their books. Especially in an industry where the vast majority of firms make up valuations out of thin air and not based on any reproducible insights.

by shimman

6/7/2026 at 3:30:27 PM

Opening their books would let you know things like profitability. I'm talking about cost per token, model development and human costs being irrelevant.

by NitpickLawyer

6/7/2026 at 6:11:44 PM

Yeah take the gpu rental cost, what it can run, how many tokens per second come out and see the true rate per token. Plus the margin on harness special sauce

by whattheheckheck

6/7/2026 at 5:13:39 PM

How is spacex not short on money when no one will pay them to use their models and they lose money every quarter? Sure they’re now transitioning to a data center provider away from actually being an AI company because they’re losing less money that way but it doesn’t sound like a strategic success

by TSiege

6/7/2026 at 2:21:01 PM

SpaceX is increasing prices because they're trying really hard to get into the S&P 500.

by phyzix5761

6/7/2026 at 12:41:20 PM

The github example is also a bit of an outlier because they made a recent change to their pricing so that's why its such a drastic jump.

Also I mean prices in generally for all things are based on underlying factors, that doesn't make them arbitary (i.e. github executives using a random number generator for token pricing would be arbitary)

by altmanaltman

6/8/2026 at 4:42:31 PM

What I mean by arbitrary is that like raising bread prices from $5 to $50 kind of thing. That’s not a sign of cost-based pricing. It’s arbitrary.

by sedatk

6/7/2026 at 8:59:28 AM

> Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%

I'm seeing a ratio of around 10:1 in my usage. A vast majority of the tokens consumed are on the input side. The agent will often read a million tokens just to patch one line of code.

I think if you are seeing something closer to 1:1 or more on the output side, there is either a problem with the agent or the codebase is new / empty.

by bob1029

6/7/2026 at 9:29:47 AM

Did you experiment with giving agent better tools to navigate and document the codebase? Asts, language servers and so on?

A million tokens (not cached) sounds like a lot.

by kolinko

6/7/2026 at 9:38:46 AM

The target codebase is very large. A million tokens is a drop in the proverbial bucket.

I still don't understand how caching helps me very much. I must be misunderstanding it because I thought the user's prompt (which is the biggest variable) necessarily sits prior to all of these token intensive tool calls. How can we cache the reading of codebase if the prefix is always moving?

by bob1029

6/7/2026 at 10:25:42 AM

If an agent makes a tool call, the LLM provider will receive the full context again after the result of the tool call becomes available in order to decide the next move. Everything up to the point of the tool call being made will no longer change and could thus in theory be cached. If the agent makes a ton of tool calls, then for every tool call one should be hitting the cache an equal amount of times.

A new instruction by the user will be appended at the end if it done in the same conversation. Thus only has influence on the cacheability of the original agent prompt, but not of subsequent tool calls.

by Phemist

6/7/2026 at 11:42:23 AM

Often to me it seams like using MA is like letting a million monkeys lose.

Has ai forgotten about high level design? Surely all it needs to know is what the methods, objects or functions in the code base actually does and the actual code it is meant to be fixing?

I wonder if half the issues is that the LLM try to change too much?

by uxhacker

6/7/2026 at 4:26:50 PM

[dead]

by willtemperley

6/7/2026 at 11:08:46 AM

> The target codebase is very large.

But, does every prompt need the entire codebase?

by frumiousirc

6/7/2026 at 11:59:47 AM

How could it not? Can you ever guarantee accurate answers about a book you haven't entirely read?

by amazingamazing

6/7/2026 at 9:20:40 AM

If input tokens dominate the cost to that extent, this implies that major gains are possible by making better use of caching. You could basically ask the model to do a one-time "compaction" step including a dump of the relevant portions of the code, and use that as the cached prefix for a large amount of "swarm" subagent calls.

by zozbot234

6/7/2026 at 2:59:50 AM

One thing I've noticed using agents for coding is that they really like to write thousands of unit tests but not dynamically test.

by sakuraiben

6/7/2026 at 3:17:35 AM

And they like to burn a ton of tokens writing and debugging tests that are semantically corrupt.

by drivebyhooting

6/7/2026 at 3:52:58 AM

And AWS heavily pushes a complex lambda solution stringing together as many chargeable AWS services as possible for a simple requirement

Their interests are often not your interests. In this case they want you to unnecessary money on useless work (let's stop the euphemism of "tokens" btw)

by gib444

6/7/2026 at 12:14:04 PM

This kind of cute conspiracy theories don’t actually hold true in real life. The companies want to make useful products.

by simianwords

6/7/2026 at 12:35:41 PM

They're all just in it for the love of making customers happy, for sure. Amazon actually donates all their profit to charity. Bez os is currently building a foundation to shift all his wealth to, and he's going to become an alpaca farmer

by gib444

6/7/2026 at 5:41:48 AM

Unit tests are a type of dynamic testing. As opposed to static testing which is linting/typechecking etc.

If you want a difference kind of dynamic testing besides unit tests, have you tried writing it in as a requirement during the planning/PRD phase?

by esperent

6/7/2026 at 3:03:55 AM

you can just tell them to do more dynamic testing. I think dynamic testing is partly frowned upon because it slows things down & can take down software where you wouldn't expect

by make3

6/7/2026 at 4:07:20 AM

Reminded me of this paper from last year trying to optimize efficient token usage providing budget guidance information. [1]

[1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Stee...

by SubiculumCode

6/7/2026 at 6:46:22 AM

It’s just like Airline reward miles and offers no benefit to companies over just renting bare metal GPU time

by gmerc

6/7/2026 at 6:59:24 AM

I hope this horrible time will soon be over when cheaper NPUs come available from more hardware companies, and also when model size get optimized down further.

I wonder what hyperscaled compute farms and models will be good for at that running cost when most AI needs can be fulfilled by on-prem and on-device hardware and models. Probably only customer left are big governments. So in the end the tax payer has to pay for those billions of investments by the AI cartel.

by emsign

6/7/2026 at 7:33:35 AM

The typical NPU is only marginally helpful for on-prem inference. A GPU can read quantized data from main memory and dequantize/pad it locally (making effective use of memory throughput); a NPU often needs to read padded data directly from memory, which is wasteful. So it only helps a little bit wrt. prefill.

Also, smaller models can obviously be used but a smaller model will be a lot weaker in real-world knowledge and this tends to limit their smarts in a way that can't be compensated by more thinking.

by zozbot234

6/7/2026 at 6:52:54 AM

At its current iteration the AI tech market is not economically sustainable, not for the other markets outside the AI economy, and most deadly not even for the main target customers or AI tech companies themselves. There have been several news of companies having overspent their token budget month after month. The hardware monopolist and his network of buddy companies can determine the token price as freely as they want, there are no competitors, their only "competitor" is when people stop using AI alltogether.

by emsign

6/7/2026 at 11:24:41 AM

I don't think business is interested in any sustainability of anything. There's zero incentives for that for anyone.

by scotty79

6/7/2026 at 3:17:00 AM

In the past Google et al would hire engineers based on how well they could optimize the infrastructure.

Maybe soon companies will look at how engineers can optimize the token efficiency of AI.

by drivebyhooting

6/7/2026 at 3:32:05 AM

That assumes Tokens will remain a meaningful expense. I’m not sure developers will find uses for ever more tokens nearly as quickly as the prices fall.

by Retric

6/7/2026 at 4:55:21 AM

How are we so confident that prices will fall? Isn't the exact opposite happening, right now, during arguably the most critical part of this whole saga (pre-IPO to make things appear as beautiful and as not-obviously-illegal as possible)? And the only reason they were "falling" previously was for hyper growth.

by ares623

6/7/2026 at 5:58:28 AM

The Growth aspect mentioned is that VCs are subsidizing the bill right now, so it is hard to know if at the current moment the demand curve would promote as much usage without it, but assuming demand remained constant (not even growing), you could expect token prices to be competed down. It is a commodity without a moat.

Now that we have pretty decent open source models, anyone can create a new business to supply more tokens. Sure there’s short term scarcity: energy, GPUs, cooling, but this is a scale up problem. More token demand = more data center build = more energy plant build. This downward pressure will also keep frontier private model prices in check.

Differentiation seems to be happening at the harness level, whereby we can expect token spend to be a metric to compete on and drive down for the customer (at least hoping tools in the application space don’t continue token based billing as their primary revenue stream).

These are not short term hyper growth forces, but a fundamental alignment of incentives.

by jpatt

6/7/2026 at 11:43:31 AM

Pricing on SToA models probably won’t fall, there’s no reason for the frontier labs to lower their prices.

But we’re seeing lots of open weight models that are either pretty close to SToA, or more importantly, perfectly capable of doing all the low level token insensitive grunt work when writing code. Pairing them with SToA models for long horizon task management, and you’ve got a very cost effective system.

The frontier labs have put little effort into cost efficient inference, they don’t need to, but folks like DeepSeek clearly are, and have achieved some impressive cost improvements. Given DeepSeeks models give you 70% of the capabilities for 30% of the cost, expect people to start moving lots of workloads to providers that provide cheap inference for open models, and huge competition to appear to provide that cheap inference. It’s truly commodity LLM inference.

In turn expect more companies to focus on building inferences efficient models, because someone that can build a model that provides 70% of SToA capabilities for 10% of the token cost, immediately eats up huge amounts of the available inference market.

Another factor in all this, is it’s becoming increasingly clear that building custom agents/workflows for LLM to operate in, is required to get the best out of these models. That means people are implicitly building the infra needed to use multiple model types and evaluate workflow performance end-to-end. Which in turn means they have everything they need to plugin in future, cheaper, inference providers and quickly evaluate if they can change their model provider.

by avianlyric

6/7/2026 at 6:04:04 AM

it is falling if you look elsewhere, deepseek made their 75% discount on their V4 models permanent, on one hand there's LLM improvements that make inference cheaper (e.i. MoE, hybrid attention), on the other hand we're getting more inference focused chips that break the nvidia monopoly.

i don't think a lot of people know this, but a cluster of GPUs can serve multiple clients without much of a drop in performance, e.i. worst case scenario you band together with 6-16 people to run a 2-3 H100 server to host deepseek V4 Flash or 4-6 to run Pro, and you're getting the same performance as if you ran it alone, this means a lot of companies can afford throwing 50-100k into their own LLM server cluster.

We're at a price point where if you push it further people will move, there's no real vendor lock in, your agent config, skills, MCP servers etc are all reusable with other models and harnesses, so unless you get all providers to collude on a price hike, you risk an exodus of customers

by mobelkh

6/7/2026 at 6:42:24 AM

In the one direction the hardware continues to improve, new buildouts continue to come online, and methods for improving the parameter efficiency of models continue to be discovered.

In the other direction models continue to grow larger, new customers continue to arrive, and existing customers continue to find ever more creative ways to burn large quantities of tokens as the prices fall.

I doubt anyone can say with certainty where the equilibrium will be 1 or 5 years from now largely because (among many other things) it's impossible to predict how much of the current economy AI will end up eating. In general though the third party providers of open weights models are probably the most reliable data source available since they have little to no incentive to subsidize usage.

by fc417fc802

6/7/2026 at 5:53:27 AM

I don’t think we can extrapolate from current API pricing, but dramatically improving hardware in terms of cost:performance is the underlying reality.

Betting against that you need to assume exponentially more expensive models every year.

by Retric

6/7/2026 at 8:14:34 AM

Prices have fallen dramatically over the last few years. It’s just that our standards have increased because we are using AI in ways that were not possible with worse models. But for the same level of “intelligence” as we had a couple years ago, the prices are so much lower.

by oersted

6/7/2026 at 4:13:31 AM

[dead]

by dnlosx

6/7/2026 at 9:12:03 AM

I know how to drop a company’s token costs to zero: treat tokens as a utility same as internet and make engineers pay for it.

by deadbabe

6/7/2026 at 11:27:29 AM

I would easily pay a lot of money to have access to AI for my job. I actually do pay. If the cost was significant I'd just add it to hourly rate that I consider acceptable. Company always pays in the end, because company is the only entity with money in this setup.

by scotty79

6/8/2026 at 2:54:37 PM

You will be undercut easily by someone running a cheaper LLM setup.

by deadbabe

6/8/2026 at 4:44:11 PM

That depends entirely on how they and I are using tools available. There must be a sweet spot. Best person at finding that sweet spot wins on price. I'd be up for that.

by scotty79

6/8/2026 at 9:57:54 PM

It will trigger a race to the bottom. The sweeter the juice the better the squeeze.

by deadbabe

6/9/2026 at 10:38:52 AM

Isn't that the entire point of the economy and the free market? Finding out how cheaply things can be done and doing them?

by scotty79

6/7/2026 at 3:40:46 PM

I wrote a Subsack post on this topic back in December https://open.substack.com/pub/zacharywhitley/p/the-coming-ag...

by zcw100

6/7/2026 at 4:04:45 AM

Tokenomics is already a word used to describe cryptocurrency economics, not sure why they'd try to redefine it for AI even if a different sort of token is used.

by satvikpendem

6/7/2026 at 10:52:29 AM

Tokenomics had been already used by marijuana enthusiasts for a long time.

by mariusor

6/7/2026 at 4:24:15 AM

New fad. Forget about the old fad. This one will be old soon, you better get on board before its too late!

by NewJazz

6/7/2026 at 10:54:28 AM

cryptocurrency economics = cryptonomics

You're welcome! =)

by alchemism

6/7/2026 at 12:24:10 PM

Neal Stephenson wrote a book about it.

by layer8

6/7/2026 at 7:45:44 PM

The cryptonomics con

by badc0ffee

6/7/2026 at 7:49:49 AM

Crypto was already a term before cryptocurrencies made it about them. Web 3.0 was already a thing before crypto bros made web 3 about cryptocurrencies.

So what? Terms are reused in different contexts all the time. And most people have moved on from cryptocurrencies anyway, so there’s little chance it’ll confuse anyone.

by dkersten

6/7/2026 at 6:59:37 AM

First thought was "only 30 tasks" however the findings map to what I've seen personally: code review consumes majority of tokens

by becomevocal

6/7/2026 at 7:37:15 AM

Code review could also be run as an unattended/batched task though, possibly with at least some use of on-prem inference (which excels at this). That would be a major saving compared to the usual cloud inference scenario.

by zozbot234

6/7/2026 at 12:51:04 PM

with which models, though?

by jwnin

6/7/2026 at 11:38:32 PM

Yeah wasn’t there a report recently on how local models after energy costs didn’t weren’t actually more efficient to complete the same task?

by becomevocal

6/7/2026 at 2:17:58 PM

[flagged]

by friendlygeorge

6/7/2026 at 4:29:08 AM

[flagged]

by winphoto

6/8/2026 at 10:51:58 AM

[flagged]

by jazzen

6/7/2026 at 1:27:52 PM

[flagged]

by knightops_dev

6/7/2026 at 2:28:23 PM

[dead]

by jlcases

6/7/2026 at 8:08:16 AM

[flagged]

by eddysir

6/7/2026 at 4:51:08 PM

[flagged]

by JoaoBerne

6/7/2026 at 8:19:21 PM

[dead]

by samdonovan

6/7/2026 at 4:13:40 AM

[dead]

by baarse

6/7/2026 at 4:11:01 AM

[dead]

by andrewvu0203

6/7/2026 at 3:59:31 AM

[dead]

by bonigv

6/7/2026 at 3:28:08 AM

[flagged]

by Waffle2180