The Token Compression Illusion: Why I'm Skeptical of RTK

6/18/2026 at 6:17:52 PM

I am glad articles like this are finally starting to get some momentum around what I call the LLM magic box industry. From caveman mode to RTK to semantic search and everything in between. Developers have become magicians that cast spells instead of engineers. It sucks at work especially with everyone so sure that their magic spell is the one for ultimate token savings.

My criteria are: if it’s not in a harness it’s probably not that good (the best ideas float up to Codex/Claude imo) and any GitHub advertising some percent of token savings is not to be trusted.

It’s hard to avoid the snake oil and I hope people start thinking critically on this stuff.

by cityofdelusion

6/18/2026 at 9:01:41 PM

Totally wrong, you underestimate the frontier's incompetence in anything other than building LLM models (ehm ehm flickering TUI for a year "written like a game engine").

I ran a bunch of benchmarks and there are proven ways to reduce tokens while achieving the same results (finding the same CVEs / finding the same bugs in CRs, etc...).

See https://maki.sh, it's my own little proof.

by tontinton

6/19/2026 at 7:20:55 AM

Downloaded it and giving it a try. I love the aesthetics and how colorful it is! It is also pleasant to use and fast, although my personal preference is TUIs that don't break scrollback (claude code actually does a good job at this).

by ChadNauseam

6/19/2026 at 6:54:30 AM

Tried it just now. The onboarding process could be better, for example guide user to pick the available models if providers is setup but it's not anthropic. Wasted a bit of time foguring out that the provider is detected, just he models was wrong.

But I like it, code processing is freaking fast.

by aaulia

6/18/2026 at 11:38:32 PM

an agent harness built in rust with ratatui - checks out. i've built one myself. i don't maintain it, and continue to use opencode, but it was worth it to learn how agent harnesses work.

anyway, what's the real pitch on why i should move on from opencode to maki?

by dfee

6/19/2026 at 7:27:32 AM

> what’s the real pitch on why

I’m not OP, but parent comment and linked site https://maki.sh talk about token reduction.

by no-name-here

6/19/2026 at 4:43:30 AM

I just tried it. It is awesome!

Can you add an indicator to show whether the tool is currently running or not running (due to - no prompt, API error, waiting for permission etc)

by saaspirant

6/18/2026 at 9:40:39 PM

What is your approach for reducing token usage and is it different than rtk?

by lackoftactics

6/18/2026 at 9:55:59 PM

The biggest ones are: using tree-sitter to index code files as a tool, code_execution tool running a workflow of tools inside a python interpreter (monty), and not being a harness developed by the company profiting from selling you the shovels (and introducing "dynamic workflows" aka spawning 50 agents).

by tontinton

6/18/2026 at 10:46:03 PM

Tbh I could buy into what you are proposing more than rtk. It feels sane in comparison

by lackoftactics

6/19/2026 at 1:54:08 AM

Looks very cool. I would like to try it, but don't want to use API billing. OpenAI I think would allow it to use account login. Would you support that?

by raylad

6/19/2026 at 4:44:27 AM

Not OP It is already supported via Codex auth. Please run `maki auth login openai`

by saaspirant

6/18/2026 at 9:24:31 PM

Maki is awesome. Thanks! I'm using it on my X220 and it flies in comparison to OpenCode et al.

by rescbr

6/18/2026 at 9:53:29 PM

Enjoy, I can't go back to other agents now, too spoiled by the speed

by tontinton

6/18/2026 at 10:56:22 PM

I also have become a maki convert and I really like it. I ran into an issue with the dynamic model provider that I should probably make a patch for; list_models doesn’t use the `<provider> models` output at all but instead tries to look up `<provider> resolve`’s base URL + /v1/models, which breaks on a provider like Z.ai which doesn’t have /v1/ anywhere in the path…

by 0xc133

6/19/2026 at 7:16:13 AM

Create an issue! :)

by tontinton

6/18/2026 at 10:16:48 PM

Wait what? I thought Maki is a Pi/OpenCode replacement i.e. just the TUI for whatever you plug in it i.e. API-based Codex / Claude?

In another comment you said "I can't get back to other agents". What gives? Feels like I completely misunderstood what Maki is.

by pdimitar

6/18/2026 at 10:23:12 PM

It's a TUI you're right, but it's also a harness.

As much as I hate to admit, T the tools you provide, the descriptions, and prompts, all amount to pretty big changes in experience, even using the same models.

by tontinton

6/18/2026 at 10:27:29 PM

That didn't help very much. What did you mean by "agents" earlier on? The tool/harness or the LLM itself?

Also -- can you make Maki enforce the underlying LLM to use stuff like fd/rg and not always default to find/grep, for example? And stop trying to do bash-isms in a zsh system?

by pdimitar

6/19/2026 at 7:21:49 AM

I meant harnesses / TUIs, sorry to confuse.

Using fd/rg sounds interesting, honestly it would require little tweaks to the bash tool lua plugin, either add to the description to prefer these binaries instead or something like that.

In general though I much prefer "advising" and encouraging the LLM to use the native tools like grep l/glob, they are implemented to be super fast, and you will get better parser output.

by tontinton

6/18/2026 at 7:58:56 PM

This is why I Blind A/B test everything.

I burn a ton of tokens, but things actually have to prove their value. And the vast majority of things do not come close to doing so.

I have my own AI agent full of stuff. I blind A/B test everything, but I also don't think the results are all that useful as a signal to others.

Just because I Blind A/B test it 4 months ago, it's maybe not meaningful today.

Maybe the word choices I use dramatically impact things.

I do it, because I can prove the value, and see it with my own eyes. I don't even bother publishing the specific Blind A/B tests.

Also, I've seen other people try to Blind A/B test and get it very wrong. If your measurements aren't good, the test is meaningless.

I don't know. We're all working on these problems together. There's a lot of black magic (which is why I rely on hooks a lot). I'm sure I have tons of black magic, I have a large little AI Agent.

But what I know for certain, is it works for me. All it takes is for me to not use it, and I honestly don't know how everyone currently works with AI.

I will link it, but it is not an endorsement for what you do. Mostly only other software engineers use it. And it's so very specific to the things I have to do.

At best, maybe it sparks an idea for you to implement on your own.

https://github.com/notque/vexjoy-agent

by AndyNemmity

6/18/2026 at 9:41:01 PM

>the best ideas float up to Codex/Claude imo

They only float up if people create things like RTK and other people try them though.

It's fair to sit this one out and let others figure out if it's worth it or not but tools like RTK, Headroom, caveman mode and others do reduce input and output tokens that need to be processed, and for local LLMs that can have measurable speedups. Whether or not that ultimately hurts the resulting output I don't have enough data to say, but I am happy to play with them to find out.

by evilduck

6/18/2026 at 9:52:38 PM

Also the incentives aren’t exactly aligned. Yes, Anthropic et. al want you to have efficient token usage (because you’ll use it more, and because of some competitive pressure). But it’s not their first priority especially when they make more money with more tokens.

If a tool like rtk improves token efficiency, but has some negative impact on quality, should Anthropic integrate it immediately? Where is the line? This kind of decision is arguably better left to the user.

What they should maybe do, is have a parameter similar to effort level, that allows the user to opt into native features for token minimizing. Make the tools available but leave the choice of the fidelity/savings tradeoff up to the user.

by chatmasta

6/18/2026 at 6:23:41 PM

The idea itself is sound: If you can reduce the signal-to-noise ratio in the context window, then that's a good thing.

Whether or not RTK actually does this has not been established. I would be glad to see some proper benchmarks done on the actual difference this tool makes (not some meaningless "up to 90%" type of language).

by arcanemachiner

6/19/2026 at 5:24:31 AM

I found this, which has some: https://arxiv.org/pdf/2605.28876 TLDR: RTK does not look good according to the author's benchmark.

by celrod

6/18/2026 at 6:26:00 PM

I was wondering if that impacts the accuracy, obviously the rtk output wasn't in the training dataset, but maybe it doesn't matter at the end

by lackoftactics

6/18/2026 at 7:58:46 PM

I'll go further and note that some of the optimizations I've seen in rtk for things like `git status` have actually bubbled up into the model layer -- Codex is regularly making tool calls like `git status --short` instead of `git status`.

by philipbjorge

6/18/2026 at 6:20:29 PM

There is a conflict of interest, though.

by blubber

6/19/2026 at 4:53:05 AM

Only in inference, but if you consider that they’re reinvesting inference performance in training I think the conflict argument is overblown.

by baq

6/18/2026 at 7:33:47 PM

I have to say I made a similiar mistake with trusting semantic search is the next big thing. My opinion shifted, but it made sense for me for too long

by lackoftactics

6/18/2026 at 9:44:42 PM

But Claude especially copy opensource ideas after they are widely used for months

by mingqiz

6/18/2026 at 9:24:24 PM

Oh, this gold rush has breathed new life into the old school Semantic guys.

Lord knows the DITA priesthood has been running low on rubes, so this new era is a godsend.

Re-coding all of your org's content into a verbose granular schema, that's what will fix these AI things. It's going to give your LLM superpowers! Semantic superpowers!

While everyone completely ignores the utter lack of coupling between the actual language and whatever nonsense is in the element / structure naming. Or the fact that every single thing has to go through some horrible 1990s era parser, which breaks constantly, and now everyone's shovelling the full markup into the very tiny confused mouth of the AI. Or that now everyone needs specialized software to display anything. Or the everything.

My dudes, the thing you're trying to do with this stuff is already done in the vectorizations. You can use math for a lot of it now, instead of someone hand coding "poplar" as "tree" in a totally flat tree structure.

by lopsotronic

6/19/2026 at 7:45:53 AM

My criteria is "do they measure performance, or at least even try to?". Caveman [1], RTK [2] and more recently ponytail [3] don't or use a few trivial tests. Those projects don't measure performance on widely used benchmarks (like SWE Pro and stuff), that have their issues but at least it would give some indication. They also don't measure "big model + caveman vs smaller model".

I've had a few times where removing all custom instructions that I started using with model N-2 made model N perform way better, so I'm very suspicious of everything that changes how the model works, it's easy to get degraded performance silently and suddenly you're paying latest Opus costs for 6 months old Sonnet performance.

[1]: https://github.com/JuliusBrussee/caveman

[2]: https://github.com/rtk-ai/rtk

[3]: https://github.com/DietrichGebert/ponytail

by Zababa

6/18/2026 at 7:07:49 PM

I mean it kind of already is in harnesses. Codex and Claude Code both have subagent tools. You could probably get a similar token output cut just by asking Claude Code to run all commands with Haiku as a summarizing subagent.

by striking

6/18/2026 at 6:24:55 PM

Author of the text here. I will be honest with why I wrote it, the rtk ai looks very odd to me as software engineer, the number of stars, no mention of accuracy and how management is pushing that stuff to optimize costs. Now people are wrapping every possible command in rtk and trying to handle all major possible commands and decide which output you should get.

by lackoftactics

6/18/2026 at 9:35:44 PM

Would sincerely love to hear your thoughts on https://www.github.com/jahala/tilth - it’s a different approach than RTK, benchmarked to reduce cost per correct answer by ~40%

by jahala

6/19/2026 at 12:07:50 AM

Looked at your repo, even starred. On the surface, I like your approach a bit better. It looks like your idea sits at the space between semantic search and compressing tokens. I was into semantic search before, but mostly trying to vectorize codebase instead of tree sitter and couldn’t make the semantic search work for me. Thanks for sharing!

by lackoftactics

6/19/2026 at 1:28:50 AM

An ex colleague is working on Headroom, a much more legit alternative to RTK. They provide accuracy benchmarks in the repo and are transparent about the compression algorithms used for the different output types. I liked their approach a lot better than RTK and thought it might be relevant for you.

https://github.com/chopratejas/headroom

by jvican

6/19/2026 at 5:16:10 AM

This thread is gold, looks like setting up a combination of both tools could reduce token consumption by 50% essentially doubling the subscription? Will be testing this out after morning coffee for sure

by baq

6/19/2026 at 6:36:30 PM

Headroom uses RTK under the hood.

I applaud the benchmarking though.

by oxavier

6/19/2026 at 7:53:01 AM

That's already better than RTK because you measure task accuracy AND savings! So I'm more confident in this one than in the RTK/caveman/ponytail stuff.

There are still two things that bother me:

1) I don't really know when tilth is called, how it works kind of. Does the model itself select it when it needs it? Do you need to instruct the model to use it?

2) If the model itself chooses to use it, I'd like to have a benchmark of non regressions on tasks where tilth isn't helping, to ensure you made the model + harness + tools as a whole better rather than more specialized ; or be upfront about more specialized and have more details when to use/when to not use.

by Zababa

6/19/2026 at 2:20:48 PM

[flagged]

by jahala

6/19/2026 at 4:00:39 AM

Very cool. I'll probably switch away from AFT to this. Can you add tree-sitter-bash?

by polski-g

6/18/2026 at 7:59:50 PM

Why didn’t you offer any real world usage numbers to illustrate your point? I found this unhelpful.

by ianwalter

6/18/2026 at 8:51:19 PM

I read another post oddly similar earlier today that has more explicit data on that authors codebase: https://codepointer.substack.com/p/cutting-llm-token-costs-w...

TLDR; ~3-4% savings to actual API costs with rtk, caveman, and headroom combined, but nothing tangible on if those cost reductions came at a cost of quality. By their calculations, rtk saved them $4.96 on a $926 bill.

by lloyd-christmas

6/18/2026 at 11:18:55 PM

^recommend reading this one

by bcollins34

6/18/2026 at 11:32:53 PM

https://en.wikipedia.org/wiki/Brandolini%27s_law

by fumeux_fume

6/19/2026 at 12:10:12 AM

Thanks for the link! I thought I know every obscure law and I was so wrong.

by lackoftactics

6/18/2026 at 9:01:11 PM

That’s the fair point. The rtk promotional posts point to 60-90% tokens savings and there is no mention how they perform accuracy wise. The commenter below did great job pointing to resource showing caveman, rtk saving just couple bucks on $926 bill. Thanks, Llyoyd Christmas for linking to useful substack

by lackoftactics

6/18/2026 at 6:14:06 PM

> 1. Gamified Savings vs. Your Actual API Bill

Tool use output represents a large amount of my output. I'll take 3.7M tokens saved on 3.9M tokens of input. Tokens saved are tokens saved.

> 3. Where Are the Accuracy Benchmarks?

As a user of RTK, it would be nice to see accuracy benchmarks. However, I've seen no evidence of the model missing anything critical as a result of the compression. As part of their design philosophy they are very strict about preserving correctness to the point that if a filter fails they fall back to raw output. For my most frequently used commands I've inspected the source, was happy with what I saw, they've earned my trust thus far.

> The day git, cargo, npm, or grep updates its terminal formatting by a few spaces or changes an error layout, RTK's regex and parsing filters will break. And returning to the silent failure trap, it won't throw an explicit error; it will fail quietly, feeding corrupted or partial text to your agent.

Again, any filter that fails simply falls back to the raw output. One of their core pillars is avoiding this exact scenario you described. RTK should never feed corrupted or partial text to an agent.

Your concerns are fair but I'd like to see your criticism backed up with evidence. Have you used RTK? Have you found evidence that they are failing to preserve correctness?

by compuficial

6/18/2026 at 6:21:21 PM

I was looking through the issues as investigation. Some issues that caught my attention are looking quite bad https://github.com/rtk-ai/rtk/issues/2494 https://github.com/rtk-ai/rtk/issues/2462 https://github.com/rtk-ai/rtk/issues/2395

by lackoftactics

6/18/2026 at 6:25:17 PM

Fwiw, I just ran the steps to reproduce and got `Error: prettier produced no output` on rtk (0.42.2). Not saying this isn't valid for the users environment but I could not reproduce on linux.

by compuficial

6/18/2026 at 6:46:19 PM

appreciate the engineering effort and skin in the game. I might try on macos today as the author of issue.

by lackoftactics

6/18/2026 at 8:53:23 PM

> Tokens saved are tokens saved.

Not always. RTK strips flags and other information. Sometimes you spend more tokens getting them back later. Sure your saved 70% tokens on that tool call, but nothing in the metrics says whether you ran 3 tool calls instead of 1.

There is also a question of whether that stripped output requires more thinking tokens or not.

by Sayrus

6/19/2026 at 8:49:28 AM

I don't think being very strict about preserving correctness is enough. Considering the cost differences between the latest model and an open weight one that's behind, or between the biggest model and the one below it, I think you have to measure performance very carefully.

Rather than the criticism needing to be backed up with evidence, it's up to RTK to prove they don't degrade performance.

by Zababa

6/18/2026 at 8:03:34 PM

The core of the problem is that there are a million tools that make AI better, and no ways to measure whether AI is working better.

Big companies with popular products have it. They do something between normal product analytics and chatbot evals to figure out if users are being successful in their sessions. That's the job.

But any given dev, with between 3 and 50 sessions a day? Like, I have no idea what makes the LLM better. It's all vibes.

My company has a whole stack here. Preferred harnesses, preferred models, skills, the shape of our code, everything. There's gotta be a way to measure whether this setup is working for us, at 1 / 1-million-th the scale of a Claude Code.

by trjordan

6/19/2026 at 12:48:33 AM

> and no ways to measure whether AI is working better.

What I do with my product is I explicity tell you to ask your agent. I have real world examples and real world repositories that you can try with:

https://gitsense.com

https://github.com/gitsense/smart-ripgrep

https://github.com/gitsense/smart-codex

Token saving on average is not what I am mostly interested in though. I am more interested in knowing that the AI doesn't load unnecessary files in context, which can affect reasoning.

You can just ask the agent after a task how many files do you think was not read by knowing the files purpose first?

by sdesol

6/19/2026 at 12:13:42 AM

And the effort to produce valid benchmarks is tremendous. You are probably right and that’s very annoying. We already had flame wars over frameworks and this is way worse, your vibes vs. my vibes. Who would thought non-deterministic outputs would lead us here?

by lackoftactics

6/18/2026 at 9:37:39 PM

There is an answer- these tools should benchmark by cost per correct answer - not just tokens saved.

by jahala

6/18/2026 at 6:22:01 PM

I tried it and it does not compress messages which was 90% of my context, so it only compresses a small part of my token usage. If you read it carefully you will realize that is exactly stated. If you look at /context you will probably see that tool calls are not where you are spending token on, so a proxy that compresses tool calls will not make much impact, whilst still being true that it compresses tool calls by 8x. Its just not that important for long coding sessions for me.

"native/built-in Read or cat tools, the data is not intercepted by RTK's shell hook"

by tlarkworthy

6/19/2026 at 1:25:34 AM

This post offers virtually no data to back up their objections and reads as LLM-generated for the most part. I

by Bnjoroge

6/19/2026 at 4:34:48 AM

We’ve been on the receiving end of this complaint with Semble. I think it is a valid complaint, but constructing a benchmark for this kind of thing is just very difficult and expensive because of the (harness) x (model) x (mcp/cli) combination.

With traditional ml/tooling, not showing benchmarks was usually a red flag. But for llm tooling, I’m not so sure.

by stephantul

6/18/2026 at 11:17:58 PM

First of all there is a way to made agents spot truncation by being aware of RTK compression and having bypass option (I use RTK_DISABLE=1) as a way of restoring original full text.

Works fine, yeah it only compresses command output so only input tokens are affected in terms of "compression".

by ilia-a

6/18/2026 at 8:40:02 PM

> Mainstream CLIs and developer tools can easily ship a native --compact or --json-stream flag tailored for LLM consumption.

Until they do, they won't soon , rtk, caveman, ponytail and many others are just trying to address every growing costs (for 2K org, its around 2.5M, for now), so these are trade-offs we are all know and adjusting, but unlike the author claims we know the trade-off well and forking these tools, benchmarking, verifying the output quality matches our needs and so on to make it work for us, so no blindly.

For solo devs, yes, they might not really need it, self hosting another model to save would be better option. But for orgs thats a spicy part.

Yes, its good that we see these articles are shedding some light but like we do with these tools, lets also consume these articles with a grain of salt.

by ziyasal

6/18/2026 at 9:25:36 PM

I just typed in rtk gain on my Mac, unfortunately my main dev machine I reimaged due memory issues I had and it messing up a few things, but on my Mac I've shaved off roughly 51k input tokens, and 23k output tokens, and saved an average of 3 seconds per command. Not sure what the outrage is for or why they cared enough to write this up really.

Not sure who is piping stacktraces through RTK, I only use it for very specific programs, shoving compiler output through it seems silly, but you can always instruct your agent to only use RTK for very specific sets of commands.

by giancarlostoro

6/18/2026 at 9:26:24 PM

Many points about maintainability that this article makes seem to hold, especially with update and version output changes, but it doesn't even offer the simplest alternative. Most of these supported commands have flags to strip out noise and reduce output. Maybe agents aren't well trained on these.

As a side note, has anyone tried a dual agent setup where the command output is proxied through a lightweight local model? I can imagine a scenario where all tool output is filtered through Qwen or similar locally to compact the tool output.

by cephei

6/18/2026 at 6:12:47 PM

I've been trying out RTK and it seems kinda alright. I doubt it's saving much, but the quality of the work feels similar.

But if it's making a dent in token usage (which I have not personally measured), then that's great.

I had to add some system prompt instructions to Pi to help it work (GPT 5.5 initially got confused when `git status` looked different than expected). The Claude Code extension appears to do a proper job of informing the agent about the unexpected shape of the output without any extra work on my part.

by arcanemachiner

6/18/2026 at 6:16:13 PM

so how do you justify it's usage if it's not saving much and the work feels similiar. They have 664 issues open and some of them are quite funny, the tools are called and return success even though they aren't even installed.

My take is that handling so many versions and so many different tools shouldn't be the work of any single repo. The responsibility should be either on coding agent to compress or best case scenario people who are responsible for cli tool

by lackoftactics

6/18/2026 at 6:18:00 PM

I'm not justifying its usage, and I don't have to.

I've been trying it out for a couple days and it seems kinda OK or whatever. If that upsets you, then that's your problem.

I might dump it later on if it doesn't provide much if a benefit. I typically try out new things, then cull whatever doesn't work. This tool seems pretty neutral for now, at least.

by arcanemachiner

6/18/2026 at 6:29:19 PM

no, it doesn't upset me. I am open for discussion, there might be things I miss and don't understand. I am just trying to get why it's been pushed so hard lately and if the benefits are really there. Sorry, if I sounded upset to you, but I am trying to be really civil and just genereally curious

by lackoftactics

6/18/2026 at 6:59:45 PM

Well, I'm sorry as well. I mistakenly assumed you were being confrontational.

There are a lot of people who have negative knee jerk reactions to any AI stuff, new workflows (I'll agree there is a lot of garbage being shilled in this space), etc., and I jumped the gun by lumping you into that group.

by arcanemachiner

6/18/2026 at 7:21:01 PM

Nope, I am doing my master thesis on finetuning llms at 36, so I am into this stuff, but it’s been very weird lately. I’ve been self-taught dev and I definitely was missing computer science concepts so excited to fill the gaps, although the timing wasn’t perfect.

Good conversation! Great pushback against my arguments. That’s what I signed on with hacker news and missing that spirit recently

by lackoftactics

6/19/2026 at 5:30:05 AM

I completely agree with this post. After I used it in one session of 300k tokens, I had maybe 3k tokens saved. Plus, if commits really are an issue for you in term of tolen consumption, you can always ask to hand over the reigns and apply the commits yourself as a rule (unless you're operating in a loop).

by Otterly99

6/18/2026 at 7:20:28 PM

I don't disagree with the article, but I also don't disagree with RTK. The output of these commands is not optimized for agents (or humans) for that matter.

by graphememes

6/18/2026 at 6:13:41 PM

I feel like the state of the art is baked into the compaction logic, and I've had a lot of problems with compaction (absent other prompting) losing key bits of state.

https://github.com/toon-format/toon is another interesting one, and I feel like it takes on a much more achievable goal - reduce whitespace and verbosity of JSON, not overall context compression.

by old_sysadmin

6/18/2026 at 6:16:38 PM

Personally, I find compaction to be unreliable, which forces me to rely heavily on session-specific planning documents and inter-agent handoff messages.

by arcanemachiner

6/18/2026 at 11:27:24 PM

Agree. I've watched agents go around in circles or use ridiculous workarounds after being confused by rtk output.

by RVuRnvbM2e

6/19/2026 at 12:03:36 AM

Anybody have experience with https://github.com/chopratejas/headroom? They seem to have similar goals in token reduction, but headroom appears to be broader in scope.

by akman

6/19/2026 at 1:29:58 AM

They provide benchmarks as well. I like headroom’s approch a lot better. More transparent.

by jvican

6/19/2026 at 3:36:13 AM

[flagged]

by anonymars

6/18/2026 at 6:21:56 PM

I don't agree with the conclusion at all. I can see the value of RTK - whether it is buggy or vibe coded is kind of secondary. That basically comes down to how severe and often the bugs are.

There's no gamification of savings here. Tool output can be meaty.

Is the author skeptical of the concept, or the implementation? Because only one of those is worth critiquing.

by Catloafdev

6/18/2026 at 6:43:14 PM

Hey, author here, I am skeptical of implementation starting from Rust Token Killer and looking to monetize on Rust love by other developers.

Concept is fine to me and I believe we should optimize, but a repo that will handle all tools sounds like Sisyphus rolling a rock up the hill.

by lackoftactics

6/19/2026 at 7:43:04 AM

Yeah, RTK is problematic because of its focus on associations between kanji and arbitrary English keywords, many of which are poorly chosen and... oh it's an LLM thing.

by wren6991

6/19/2026 at 7:46:50 AM

Don't worry, what is kept vs what is removed by the RTK LLM thing is just as arbitrary as the RTK English keywords with no measure of performance!

by Zababa

6/18/2026 at 5:54:24 PM

I feel like what is needed is not compression, but aggressive context management with subagents.

by SubiculumCode

6/18/2026 at 6:02:36 PM

so burn more tokens to save more tokens, so that we can spend more on X token but save on Y tokens?

not the question is which X tokens and which Y tokens? and since the output is non-deterministic how do you validate this?

LLMs aren't random and that enforces something that people are too dumb to realize that random-ness could be normally distributed but LLMs have no reason to be normally distributed or follow any sort of curve of understanding.

They are non-deterministic but with bias so their output might be just be worse with T' transformation for the class of problems A is solving but work great for B. or vice versa.

You can't reproducibly test LLMs and that allows all sorts of benchmarks to exist which can make any model look good or bad as much as we want. Enlightening stuff.

Not much different from sociological or psychological sciences where with enough bias in data you can prove anything.

by minraws

6/18/2026 at 5:57:41 PM

I am the author the text.

What do you mean by aggresive context management with subagents? Would you add a lopp that would trim the context?

Both of those tasks seem even more difficult

by lackoftactics

6/18/2026 at 6:33:25 PM

First, I only say this because of what I learned as a phD inhuman memory, not as someone who authors agentic workflows or does AI.

How human cognition tends to work by simultaneously utilizing and combining/separating multiple frequency scales of information. A simple way of thinking about is this: We tend to encode and retrieve both the gist of what is happening, and the verbatim details of what happened. The gist can be thought of as low frequency information, almost like bullet points, that contain the big overview goal, keypoints). The verbatim traces, are the high resolution memory that contains all the details. The gist helps encoding and recall by providing encoding and retrieval context cues. There are also levels in between those two, but I was keeping it simple. During human development, verbatim memory capacity increases first, but then hits a wall/plateau. Further performance increases begin to depend on the ability to utilize and gain from gist-like representations that can guide encoding and retrieval of verbatim details within contexts.

You don't need to keep everything in the context window. My untested, perhaps naive hypothesis is that what is needed is that sub-agents dealing with verbatim tasks (actually writing code), their context window should be managed by an agent above that is tuned to information at a lower frequency, and it by another above it on even lower frequency information. Lowest frequency information context windows feel up slowly. High-frequency information fills up fast. Use the low frequency information to retrieve the needed high frequency information.

by SubiculumCode

6/18/2026 at 6:07:24 PM

I believe they mean aggressive delegation to minimize context bloat in the coordinating agent.

by skinfaxi

6/18/2026 at 8:05:44 PM

This is a really useful technique in my experience. The harnesses are starting to do it more on their own but if you encourage the use of more subagents, I find it's typically nothing but win.

by svachalek

6/18/2026 at 6:08:13 PM

that would make more sense, trimming context with subagents sounds like an overkill

by lackoftactics

6/18/2026 at 6:09:19 PM

Use the right tool for the job.

If you need a piece of information that is buried somewhere, or a high-level summary/distillation of a larger body of info, then subagents may be the right tool for the job.

If you need all the gathered context for later use (i.e. distilled context is insufficient), then subagents probably are not the right tool for the job.

by arcanemachiner

6/18/2026 at 6:14:28 PM

if your corebase requires a million tokens, then youre probably going to break more than you fix

by cyanydeez

6/18/2026 at 6:20:47 PM

If you are using a million tokens in a single context window, you are using the entire toolbox incorrectly.

by arcanemachiner

6/19/2026 at 1:44:55 AM

I feel like what is needed is more local tool usage and small local model usage that does the heavy lifting, rather than the paid for LLM burning tokens at all.

by nullbio

6/18/2026 at 6:25:25 PM

"Where Are the Accuracy Benchmarks?"

I wish the author would have provided one.

by blubber

6/18/2026 at 10:51:36 PM

I feel bad that I wasted my time reading this.

On the points in the article:

1. Yes, "gain" is a vanity metric but it's harmless, nobody is being "fooled" here.

2. This could be a problem in principle, sure, but unless you're actually vetting bug reports you're just spreading FUD.

3. Again, do you have any reason to believe that the thousands of devs using rtk are silently tanking their performance without noticing? here's a thought: instead of reporting that SOMEONE SHOULD MEASURE THIS, you could, you know, measure it yourself.

4. Good lord, what is this doing in a purportedly technical article?

5. Yes, this is inherent in the problem domain, again, nobody is being "fooled".

Yes, I'm grumpy; reading this article was a waste of time.

Bias: had my first RTK pr accepted today, so I guess I probably know more about it than this guy who got offended by "gain" and spit out the first thoughts that came to mind.

by jbellis

6/18/2026 at 11:27:52 PM

1. Are you sure no one is fooled? It’s the main thing managers are praising rtk for and using as an argument for it’s validity. If this is gamed, then it paints a very different picture. 2. No, I didn’t vet all the reports. But they paint quite convincing picture of the problems present in the library, which has a very ambitious goals of handling every popular command and making it less verbose. 3. You know this is not a valid point. Engineers tanking performance and choosing based on hype is nothing new. Github stars and usage is not a valid argument, when the tool is not very transparent and could quietly fail. If it’s only couple percents less accuracy, most wouldn’t easily recognize it with the whole stack of skills, mcps and agents.md 4. Is it something more than a feature? If the benefit is $3 on $900 as other commenter pointed out using maybe better and well researched article than mine from codepointer, why would I risk that for all the possible bugs and worse accuracy. 5. Hard to address this one. Tough problem domain to handle with endless cli commands to capture and process properly.

Congratulations on your accepted PR. I didn’t want to make you grumpy today. If you feel I am wrong, it’s very possible. I am just a guy who wrote my point of view, it doesn’t automatically make it valid. Once again sorry for making you grumpy.

by lackoftactics

6/18/2026 at 10:56:52 PM

How is 1 not more damning? It sounds like the fundamental service they are purportedly providing is not real. Am I reading it wrong?

by beepbooptheory

6/19/2026 at 12:23:10 PM

hope you feel better knowing your effort, reading and then commenting, is appreciated here, and convinced me to read OP's article. it's short, and raises valid points, but i'm left wondering why your reply is so defensive

let me try that style

  1. it's not *just vanity* if it feeds into *rtk*'s pitch. it's the hook, it's meant to convince users, *rtk* will reduce token waste.
  2. OP's article is not spreading fear, uncertainty, or doubt. at best it disputes *rtk*'s claims that it is effective in reducing token waste, and it does so directly with the question: "Where Are the Accuracy Benchmarks?"
  3. a) *beep* - you are disqualified for failing to identify the *burden of proof* obligation lies with *rtk*, not OP; b) OP made no claims, except for the ones you conveniently dismiss — the github issues. furthermore the "reason[s] to believe that the thousands of devs using rtk are silently tanking their performance without noticing" was already answered. you missed it because you couldn't see past the joy of having your pull-request recently merged.
  4. really, you were so disturbed by the article, you couldn't even ignore the *one* non-technical point, in an article *you choose* to interpret as being technical — all of it being your own fault. nevermind how relevant it is as a signal for the effectiveness of such technics.
  5. is it inherent? are we doomed to live with broken tool outputs? note, the issue, here, is not that *rtk* will fail when output changes, *that* is inherent to *rtk*'s current implementation — as i understand it, but that "it will fail quietly, feeding corrupted or partial text to your agent".

you are not better informed, than gp, because you have commits to your name in rtk. you're just biased by the proximity. we're all at a loss for how effective rtk is, because there are no benchmarks measuring its performance beyond some "vanity metric[s]".

you were so close to getting it here:

> instead of reporting that SOMEONE SHOULD MEASURE THIS, you could, you know, measure it yourself

but hey, thanks for getting me to take another look at rtk & co., i am now further convinced these are just the flavor of the month tricks for speed running context rot

by 0123456789ABCDE

6/19/2026 at 3:26:06 AM

There are ways to improve token usage but no tool will work right on all prompts everytime.

by mumin00

6/18/2026 at 6:04:35 PM

Am I the only one that thought RTK was Real-Time Kinematics used for precision with satellite navigation?

by iam-TJ

6/18/2026 at 6:07:16 PM

No. I clicked here for the same reason.

by dayjaby

6/18/2026 at 6:09:58 PM

I might have picked better title, but they are literally called rtk https://github.com/rtk-ai/rtk

and it stands for Rust Token Killer

by lackoftactics

6/18/2026 at 5:52:09 PM

slop complaining about other slop

by breadislove

6/18/2026 at 6:01:29 PM

thank you, author here. I will stay civil here and focus on the rtk as that was the goal of article.

So do you think rtk cli is ai slop? I had some suspicions looking at their repo and number of issues and their style. The prettier issue with running successfully while binary wasn't even installed was quite entertaining

by lackoftactics

6/18/2026 at 6:02:42 PM

Did you use an LLM for the blog post? it reads like it in places.

by grey-area

6/18/2026 at 6:06:14 PM

I have raycast shortcut for fix grammar, it might done more damage than adding a, an, the or changing tenses.

by lackoftactics

6/18/2026 at 6:17:14 PM

A content-free 2nd "paragraph" like this turned me off immediately.

> But in the current dev tools gold rush, if something sounds too good to be true, it almost always is.

The people who are interested in RTK and in criticism of RTK aren't interested in pablum like this.

by gowld

6/18/2026 at 6:30:35 PM

ok, this one is all mine. So that's even more hurtful as this is 100% me

by lackoftactics

6/18/2026 at 9:10:45 PM

this is aboslutely entirely written by AI

by danr4

6/18/2026 at 10:58:09 PM

As an author of the text, I can say you are „absolutely” not correct. I might be already spending too much time with llms and they start to shape my texts, so I am not proud of that either. But thanks for bringing very valuable insight to otherwise interesting discussion.

by lackoftactics

6/19/2026 at 5:14:42 AM

I hope you do take this to heart and stop using LLMs in any form for writing, people are not telling you this just to annoy you but because you owe your readers more than that. LLMs generate superficially plausible text, not good writing, use your own voice always.

by grey-area

6/19/2026 at 1:33:24 PM

[dead]

by nexroo