alt.hn

2/26/2026 at 6:11:35 PM

Show HN: Librarian – Cut token costs by up to 85% for LangGraph and OpenClaw

by Pinkert

2/26/2026 at 7:11:53 PM

One architectural tradeoff we are actively working on right now is the latency of the "Select" step for shorter conversations.

Currently, the open-source version of Librarian uses a general-purpose model to read the summary index and route the relevant messages. It works great for accuracy and drastically cuts token costs, but it does introduce a latency penalty for shorter conversations because it requires an initial LLM inference step before your actual agent can respond.

To solve this, we are currently training a heavily quantized, fine-tuned model specifically optimized only for this context-selection task. The goal is to push the selection latency below 1 second so the entire pipeline feels completely transparent. (We have a waitlist up for this hosted version on the site).

If anyone here has experience fine-tuning smaller models (like Llama 3 or Mistral) strictly for high-speed classification/routing over context indexes, I'd love to hear what pitfalls we should watch out for.

by Pinkert

2/26/2026 at 9:58:02 PM

won't this essentially disable prompt caching, that you get from a standard append-only chat history?

by findjashua