AI agents: Less capability, more reliability, please

3/31/2025 at 3:08:44 PM

Yeah, the "book a flight" agent thing is a running joke now - it was a punchline in the Swyx keynote for the recent AI Engineer event in NYC: https://www.latent.space/p/agent

I think this piece is underestimating the difficulty involved here though. If only it was as easy as "just pick a single task and make the agent really good at that"!

The problem is that if your UI involves human beings typing or talking to you in a human language, there is an unbounded set of ways things could go wrong. You can't test against every possible variant of what they might say. Humans are bad at clearly expressing things, but even worse is the challenge of ensuring they have a concrete, accurate mental model of what the software can and cannot do.

by simonw

3/31/2025 at 9:11:13 PM

> The problem is that if your UI involves human beings typing or talking to you in a human language, there is an unbounded set of ways things could go wrong. You can't test against every possible variant of what they might say.

It's almost like we really might benefit from using the advances in AI for stuff like speech recognition to build concrete interfaces with specific predefined vocabularies and a local-first UX. But stuff like that undermines a cloud-based service and a constantly changing interface and the opportunities for general spying and manufacturing "engagement" while people struggle to use the stuff you've made. And of course, producing actual specifications means that you would have to own bugs. Besides eliminating employees, much interest in AI is all about completely eliminating responsibility. As a user of ML-based monitoring products and such for years.. "intelligence" usually implies no real specifications, and no specifications implies no bugs, and no bugs implies rent-seeking behaviour without the burden of any actual responsibilities.

It's frustrating to see how often even technologists buy the story that "users don't want/need concrete specifications" or that "users aren't smart enough to deal with concrete interfaces". It's a trick.

by photonthug

3/31/2025 at 11:23:49 PM

> concrete interfaces with specific predefined vocabularies and a local-first UX

An app? We don’t even need to put AI in it, turns out you can book flights without one.

by freeone3000

4/1/2025 at 2:21:34 AM

I see the AI push as turnkey Wall E future.

by cyanydeez

4/1/2025 at 12:22:05 AM

Tech won't freeze in place exactly where it's at today even if some people want that, and even if in some cases it actually would make sense. And.. if you advocate for this I think you risk losing credibility. Especially amongst technologists it's better to think critically about structural problems with the trends and trajectories. AI is fine, change is fine.. the question now is really more like why and what for and in the interest of whom. To the extent models work locally, we'll be empowered in the end.

Similarly, software eating the world was actually pretty much fine, but SaaS is/was a bit of a trap. And anyone who thought SaaS was bad should be terrified about the moats and platform lock-in that billion dollar models might mean, the enshittification that inevitably follows market dominance, etc.

Honestly we kinda need a new Stallman for the brave new world, someone who is relentlessly beating the drum on this stuff even if they come across as anticorporate and extreme. An extremist might get traction, but a call to preserve things as they are probably cannot / should not.

by photonthug

4/1/2025 at 2:49:34 PM

>And.. if you advocate for this I think you risk losing credibility

It's a shame if new interface = credible by default. Look at all the car manufacturers (well some, probably not enough) finally after many years conceding the change to touch interfaces "because new" was a terrible idea, when the right old tool for the job was simply better...and obvious to end-users very quickly.

by PKop

4/1/2025 at 10:42:23 PM

Again in that case the newness of different tech isn’t actually the real problem and feels like the wrong critique. What’s problematic is trajectory and intent with things like planned obsolescence, subscriptions, ongoing costs in repairs after initial sale. I’d say that a new interface is barely even an issue compared with that.. although fwiw, yes, I prefer buttons rather than touch screens.

by photonthug

4/2/2025 at 1:29:00 AM

>the newness of different tech isn’t actually the real problem and feels like the wrong critique

I'm not equating new = bad. I'm saying new = good is wrong. And based on your last sentence, you do think car manufacturers all switching over to all touch controls was a problem. Almost everyone prefers buttons to touch screens, that's my point. The better more popular option was rejected because of a false premise, or false belief.

by PKop

4/1/2025 at 4:06:05 AM

If you believe in this to that extent, why can’t you be the “new Stallman”?

by MichaelZuo

4/1/2025 at 6:29:52 AM

It's not about what I believe, it's about what we already know. Computing is old enough now that you don't need to be some kind of mad prophet to know things about the future, because you can just look at how things have played out already.

More to the point though.. at the beginning at least, Stallman was a respected hacker and not just some random person pushing politics on a community he was barely involved with. It's gotta be that way I think, anyone who's not a respected AI/ML insider won't get far

by photonthug

4/1/2025 at 4:18:00 PM

If you are a random outsider… then how do you know there is the room and potential for such an individual?

by MichaelZuo

4/1/2025 at 10:48:58 PM

I remember you now, and I would block you if I could. On the off chance you’re not doing this on purpose, read this please: https://en.m.wikipedia.org/wiki/Sealioning

by photonthug

4/2/2025 at 2:21:49 AM

Regardless of whatever you believe, you still need to write the actual claim/argument down?

You don’t have any more credibility than most other HN users… so just stating insinuations as if they were self evident doesn’t even make sense.

by MichaelZuo

4/1/2025 at 12:54:42 PM

I am worried about a more modest enshittification. I am already starting to encounter models that are just plain out of date in non obvious ways. It has the same feeling as trying to remember how to express someone on how to troubleshoot windows over the phone for two versions ago (e.g.: in vista this was slightly different).

by xemdetia

4/1/2025 at 7:12:39 AM

> for general spying and manufacturing "engagement"

"Oh, there's one tiny feature that management is really really interested in, make the AI gently upsell the user on a higher tier of subscription if an opportunity presents itself."

by Terr_

4/1/2025 at 3:11:53 PM

With today's models that means it will pitch the upsell every three sentences. They're happy to comply.

by genewitch

3/31/2025 at 3:26:42 PM

Perhaps the solutions(s) needs to be less focusing on output quality, and more on having a solid process for dealing with errors. Think undo, containers, git, CRDTs or whatever rather than zero tolerance for errors. That probably also means some kind of review for the irreversible bits of any process, and perhaps even process changes where possible to make common processes more reversible (which sounds like an extreme challenge in some cases).

I can't imagine we're anywhere even close to the kind of perfection required not to need something like this - if it's even possible. Humans use all kinds of review and audit processes precisely because perfection is rarely attainable, and that might be fundamental.

by emn13

3/31/2025 at 4:30:00 PM

The biggest issue I’ve seen is “context window poisoning”, for lack of a better term. If it screws something up it’s highly prone to repeating that mistake. It then makes a bad fix that propagates two more errors, the says, “Sure! Let me address that,” repeating to poorly fix those rather than the underlying issue (say, restructuring code to mitigate.)

It is almost impossible to produce a useful result, far as I’ve seen, unless one eliminates that mistake from the context window.

by _bin_

3/31/2025 at 4:55:44 PM

I really really wish that LLMs had an "eject" function - as in I could click on any message in a chat, and it would basically start a new clone chat with the current chat's thread history.

There are so many times where I get to a point where the conversation is finally flowing in the way that I want and I would love to "fork" into several directions from that one specific part of the conversation.

Instead I have to rely on a prompt that requests the LLM to compress the entire conversation into a non-prose format that attempts to be as semantically lossless as possible; this sadly never works as in ten did [sic].

by instakill

3/31/2025 at 7:51:22 PM

This is precisely what the poorly named Edit button does in Claude.

by mvdtnz

4/1/2025 at 3:16:16 PM

LM studio has a fork button on every chat part. Sorry, can't think of a better word - you can fork on any human or ai part. You can also edit, but editing isn't, it essentially creates a copy of the context with the edit, and sends the whole thing to the AI. This can overflow your context window, so it isn't recommended. Forking of course does the same thing, but it is obvious that it is doing so, whereas people are surprised to learn editing sends everything.

by genewitch

3/31/2025 at 5:52:20 PM

Google UI supports branching and delete someone recently made a blog post about how great it is

by tough

3/31/2025 at 7:47:16 PM

which Google UI?

by marlott

4/1/2025 at 3:40:03 AM

ai.dev AI studio sorry

by tough

3/31/2025 at 5:50:13 PM

You can use LibreChat which allows you to fork messages: https://www.librechat.ai/docs/features/fork

by theblazehen

4/1/2025 at 2:52:07 PM

"If it screws something up it’s highly prone to repeating that mistake"

Certainly true, but coaching it past sometimes helps (not always).

- roll back to the point before the mistake.

- add instructions so as to avoid the same path. "Do not try X. We tried X it does not work as it leads to Y.

- add resources that could aid a misunderstanding (api documentation, library code)

- rerun the request (improve/reword with observed details or insights)

I feel like some of the agentic frameworks are already including some of these heuristics, but a helping hand still can work to your benefit

by PeterStuer

3/31/2025 at 5:28:15 PM

I think this is one of the core issues people have when trying to program with them. If you have a long conversation with a bunch of edits, it will start to get unreliable. I frequently start new chats to get around this and it seems to work well for me.

by bongodongobob

3/31/2025 at 11:15:49 PM

Yes, this definitely helps. It's just incredibly annoying because you have to dump context back into it, re-type stuff, consolidate stuff from the prior conversation, etc.

by _bin_

4/1/2025 at 3:08:34 AM

Have the AI maintain a document (a local file or in canvas) with project goals, structure, setup instructions, current state, change log, todos, caveats, etc. You might need to remind it to keep it up-to-date, but I find this approach quite useful.

by dr_kiszonka

3/31/2025 at 8:45:11 PM

This is what I find. If it makes a mistake, trying to get it to fix the mistake is futile and you can't "teach" it to avoid that mistake in the future.

by donmcronald

4/1/2025 at 12:39:29 PM

It depends, I ran into this a lot with GPT, but less so with Claude.

But then again, I know how it could avoid the mistake, so I point that out, from that point onwards it seems fine (in that chat).

by johnisgood

3/31/2025 at 7:19:35 PM

> Perhaps the solutions(s) needs to be less focusing on output quality, and more on having a solid process for dealing with errors. Think undo, containers, git, CRDTs

LLMs are supposed to save us from the toils of software engineering, but it looks like we're going to reinvent software engineering to make AI useful.

Problem: Programming languages are too hard.

Solution: AI!

Problem: AI is not reliable, it's hard to specify problems precisely so that it understands what I mean unambiguously.

Solution: Programming languages!

by ModernMech

3/31/2025 at 7:58:07 PM

With pretty much every new technology, society has bent towards the tech too.

When smartphones first popped up, browsing the web on them was a pain. Now pretty much the whole web has phone versions that make it easier*.

*I recognize the folly of stating this on HN.

by Workaccount2

3/31/2025 at 10:35:05 PM

No it's still a pain.

There's apps that open links in their embedded browser where ads aren't blocked. So I need to copy the link and open them in my real browser.

by LtWorf

4/1/2025 at 1:42:12 AM

Or my other favorite trap: an embedded browser where I'm not authenticated. Great, now I have to roll the dice about pasting a password in your "trust me, bro" looking login page because I cannot see the URL and the autofill is all "nope"

by mdaniel

4/2/2025 at 5:58:56 AM

> LLMs are supposed to save us from the toils of software engineering

Well, cryptocurrency was supposed to save us from the inefficiences of the centralized banking system.

There's a lesson to be learned here, but alas our sociiety's collective context window is less than five years.

by otabdeveloper4

3/31/2025 at 3:49:48 PM

But, assuming this is a general thing not just focused on say software development, can you make the tooling around creating this easier than defining the process itself? Everyone loosely speaking sees the value in test driven development, but often I think with complex processes, writing the test is harder than writing the process.

by techpineapple

3/31/2025 at 3:56:05 PM

I want to make a simple solution where data is parsed by a vision model and "engineer for the unhappy path" is my assumption from the get-go. Changing the prompt or swapping the model is cheap.

by RicoElectrico

3/31/2025 at 8:40:13 PM

vision models are also faulty, and some times all paths are unhappy paths, so there's really no viable solution. Most of the times, swapping the model completely randomizes the problem space (unless you measure every single corner case, it's impossible to tell if everything got better or if some things got worse...

by herval

3/31/2025 at 4:29:32 PM

[dead]

by dfilppi

3/31/2025 at 4:50:24 PM

I'm old enough to remember having to talk to a (human) agent in order to book flights, and can confirm that in my experience, the modern flight booking website is an order of magnitude better UX than talking to someone about your travel plans.

by yujzgzc

3/31/2025 at 5:19:01 PM

That still exists. The last time I did onsite interviews, every single company that wanted to fly me to their office to interview me asked me to talk to a human agent to book flights. But of course the human agent is just a travel agent with no budgetary power; so I ended up calling the agent to inquire about a booking, then calling the recruiter to confirm that price is acceptable, and then calling the agent book to confirm the booking.

It doesn't have to be this way. Even before the pandemic I remember some companies simply gave me access to an internal app to choose flights where the only flights shown are these of the right date, right airport, and right price.

by kccqzy

3/31/2025 at 7:53:55 PM

Yeah, I much prefer using a well designed self service system than trying to explain it over the phone.

The only problem with most of the flights I book now is that they're with low cost airlines and packed with dark patterns designed to push upgrades.

Would an AI salesman be any better though? At least the website can't actively try to pursuade me to upgrade.

by leoedin

4/1/2025 at 2:42:39 PM

An AI agent will likely be worse in that you would have to actively haggle with it so it doesn’t upsell you by default, which IMO is harder than circumnavigating the dark patterns.

An actually useful agent is something that is totally doable with technologies even from a decade ago, which you by necessity need to host yourself, with a sizeable amount of DIY and duct tape, since it won’t be allowed to exist as a hosted product. The purveyor of goods and services cannot bargain with it so it puts useless junk into your shopping cart on impulse. You cannot really upsell it, all the ad impressions are lost on it, and you cannot phish it with ad buttons that look like the UI of your site — it goes in with the sole purpose to make your bookings/arrangements, it’s a quick in-and-out. It, by its very definition and design, is very adversarial to how most companies with Internet presences run things.

by WesolyKubeczek

4/1/2025 at 4:25:49 AM

I think what we’ll come to widely realize is that syncing state between two minds (in your example, the travel agent’s mind and your mind; more widely, AI agents and their user’s minds) is extremely expensive and slow and it’s gonna be very hard to make these systems good enough to overcome the super low latency of keeping a task contained to a single mind, your own, and just doing most stuff yourself. The CPU/GPU dichotomy as a lens for viewing the world is widely applicable, IME.

by toasterlovin

3/31/2025 at 3:36:33 PM

Even operator's original demo the first thing they showed was booking restaurant reservations and ordering groceries. I understand their need to demo something intuitive but it's still debatable whether these tasks are ones that most people want delegated to black-box agents.

by serjester

3/31/2025 at 6:52:29 PM

They don't. I have never once in my life wanted to talk to my smart speaker about what I wanted for dinner, not even because a smart speaker is/can be creepy, not because of social anxiety, no, it's just simpler and more straightforward to open Doordash on my damn phone, and look at a list of restaurants nearby to order from. Or browse a list of products on Amazon to buy. Or just call a restaurant to get a reservation. These tasks are trivial.

And like, as a socially anxious millennial, no I don't particularly like phone calls. However I also recognize that setting my discomfort aside, a direct connection to a human being who can help reason out a problem I'm having is not something easily replaced with a chatbot or an AI assistant. It just isn't. Perfect example: called a place to make a reservation for myself, my wife and girlfriend (poly long story) and found the place didn't usually do reservations on the day in question, but the person did ask when we'd be there. As I was talking to a person, I could provide that information immediately, and say "if you don't take reservations don't worry, that's fine," but it was an off-busy hour so we got one anyway. How does an AI navigate that conversation more efficiently than me?

As a techie person I basically spend the entire day interacting with various software to perform various tasks, work related and otherwise. I cannot overstate: NONE of these interactions, not a single one, is improved one iota by turning it into a conversation, verbal or text-based, with my or someone else's computer. By definition it makes basic tasks take longer, every time, without fail.

by ToucanLoucan

3/31/2025 at 8:06:02 PM

I've more than once been on a roadtrip and realized that wanted something to help me find a meal where I'll be sometime in the next 2 hours. I have no idea what the options are and I can't find them. All too often I've taken some generic fast food when I really wanted something local but I couldn't get maps to tell me and such things are one street away where I wouldn't see it. (remember too if i'm driving I can't spend time to scroll through a list - but even when I'm navigator the interface I can find in maps isn't good)

by bluGill

4/1/2025 at 3:18:53 AM

You definitely would not want the existing SEO enhanced search results. And definitely not the not-too-distant future of SEO enhanced, AI poisoned listings where every eating place proudly declares itself "most likely/probably the best burger joint".

We need to go back to a more innocent time when we could ask a select group of friends and their trusted chain of friends for recommendations. Not what social media is today.

by xarope

4/1/2025 at 3:30:27 PM

I don't have friends all over the country, and I submit that, if the adage "150 people" is true, no one has friends "all over the country".

I dislike driving through Texas, and so, most road trips involve McDonalds - the only time I eat the junk.

My car's inbuilt nav is 13 years out of date, so it knows major throughways but not, for instance, that the road I live on has its own interface to the "highway", and so on, up to restaurants. Phones are unreliable in a lot of the US, and at one point I had a spare phone with all of its storage dedicated to offline Google maps just so I wouldn't get stuck in the Rockies somewhere.

Microsoft used to sell trip planning software and those were the good old days.

by genewitch

3/31/2025 at 8:27:52 PM

I'm on a road trip across Utah and Colorado right now and I've been experimenting with both Gemini and OpenAI Deep Research for this kind of thing with surprisingly decent results. Here's one transcript from this morning: https://chatgpt.com/share/67e9f968-4e88-8006-b672-13381d5e95...

by simonw

4/1/2025 at 6:54:41 AM

I'm curious what the problem is with that task. I'd open Google maps, find a larger place in the right direction, confirm with directions that it's about 2h away, search for "dinner/lunch/restaurant/Japanese/tacos/..." in the visible area, choose something highly rated. I've done that lots of times successfully. What part is that fails for you? (As a non-driver of course)

by viraptor

4/1/2025 at 12:58:55 PM

The problem is choice. I don't care about Japanese/tacos - either would be fine, but Argentine would be better (I have no idea if it is even a thing, but if it is I want to try it). I don't want a chain (well maybe a local chain) - I have plenty of McDonald's near my house if I want that, I want something I can't get near home. Maps will put right at top all the big chains that pay for that top spot and I need to scroll through them. More than once I've seen something that might be interesting but then the map scrolls/resizes and I can't find it anymore.

by bluGill

4/1/2025 at 1:07:47 PM

But you're taking as a given that the AI is going to have any better idea than Google Maps, or be subject to less interference from marketing/paid placement stuff, when like... I'd be willing to bet a small amount of money that it's going to do what you're decrying: it's going to search $localized_area for "restaurant" and if you're lucky, maybe add -chain to it. What you want here are locals notions of what's good and not, and while I absolutely respect the shit out of that (and would love it myself!) I don't really know how to facilitate that at scale without immediately caving to the same negative influences that are screwing it up right now.

Like, really what you're wanting is legitimate information not bound to the whims of advertisers and marketers (and again, to be clear, don't we fucking all) but I don't think an LLM is going to do that for you. If it does it now, and that's a load-bearing if, I have a strong feeling that's because this tech, like all tech, is in it's infancy stage. It hasn't yet gotten enough attention from corporations and their slimy marketing divisions, but that's a temporary state of affairs and has been for every past tech too. Like, OpenAI just closed another funding round and it's valuation is now THREE HUNDRED BILLION. Do you REALLY think they and by extension/as a result, their competitors, are going to be thinking about editorial independence when existing established information institutions already can't?

by ToucanLoucan

4/1/2025 at 7:24:05 AM

Agreed, verbally asking for X might make it easier for Aunt "where's the Any key" Tillie to get a solution, but it doesn't necessarily give a better solution for everyone else.

Or, for that matter, solutions you can trust. Remember the pitch for Amazon Dash buttons, where you press it and it maybe-reorders a product for delivery, instantly and sight-unseen? What if the price changed? What if it's not exactly the same product anymore? Wait, did someone else already press it? Maybe I can get a better deal? etc.

Actually, that spurs a random thought: Perhaps some of these smart-speaker ordering pitches land differently if someone is in a socioeconomic class where they're already accustomed to such tasks being done competently by human office-assistants, nannies, etc. Their default expectation might be higher, and they won't need to invest time pinching pennies like the rest of us.

by Terr_

4/1/2025 at 3:25:06 PM

Not to detract from your overall message; are there studies that say that millennials have more social anxiety? My wife is 9 months younger than me, and a millennial, whereas I am X. I have no social anxiety, at all - she and our kids do. Like, calling people on the phone requires a sit down and breathing exercises; I'm always the one to "run in to the store", not wanting to attend non-concert related venues that may be crowded.

My parents were way older than boomers, and hers were boomers, so maybe that's it?

by genewitch

3/31/2025 at 5:37:54 PM

It's no different than the old Amazon button thing. I'm not going to automatically pay whatever price Amazon is going to charge to push-button replenish household goods. Especially in those days, where "The World's Biggest" fence would have pretty wild swings in price.

If i were rich enough to have some bot fly me somewhere, I'd have a real-life minion do it for me.

by Spooky23

3/31/2025 at 4:06:11 PM

Any customer service or tech support rep can tell you that even humans can't always understand what other humans are attempting to say

by 3p495w3op495

3/31/2025 at 4:18:53 PM

It's so funny when people try to build robots imitating people. I mean part funny, part tragedy of the upcoming bust. The irony being, we would have been better off with an interoperable flight booking API standard which a deterministic headless agent could use to make perfect bookings every single time. There is a reason current user interfaces stem from a scientific discipline once called "Human-Computer Interaction".

by hansmayer

3/31/2025 at 4:56:22 PM

It's a business problem, not a tech problem. We don't have a solution you described because half of the air travel industry relies on things not being interoperable. AI is the solution at the limit, one set of companies selling users the ability to show a middle finger to a much wider set of companies - interoperability by literally having a digital human approximation pretending to be the user.

by TeMPOraL

3/31/2025 at 5:03:38 PM

I've been a sentient human for at least the last 15 years of tech advancement. Assuming this stuff actually works, it's only a matter of time before these AI services claw back all that value for themselves and hold users and businesses hostage to one another, just like social media and e-commerce before. https://en.wikipedia.org/wiki/Enshittification

Unless these tools can be run locally independent of a service provider, we're just trading one boss for another.

by the_snooze

3/31/2025 at 7:01:16 PM

The difference is that social media isn't special because of its hardware or software even. People are stuck on fa ebook because everyone else is on it. It's network effects. LLMs currently have no network effects. Your friends and family aren't "on" chatgpt so why use that over something else?

Once performance of a local setup is on par with online ones or good enough, that'll be game over for them.

by polishdude20

4/1/2025 at 8:51:21 AM

All it takes is for the "omg AI slop!!111" and "would someone think of my copyrights?" crowd to get their way - resulting in a conventional or legal ban for using AI user-agents on the Internet without express consent of a site/service provider. From there, it will be APIs all over again: much like today, you can't easily pipe your Facebook photo to your OneDrive and make a calendar invite - but you can use (for example) Zapier with Facebook Integration, OneDrive Integration and Google Calendar Integration, we'll end up with LLM/chatbot companies whose main value is in their exclusive set of integrations they offer.

So true, it's not going to be "I use PolishDude20GPTBook because my family and friends are on it". It's going to be, "I use PolishDude20GPTBook because they have contracts with Gazeta.pl, Onet, TVN24, OLX and Allegro, so I can use it to get local news and find best-priced products in a convenient way, whereas I can't use TeMPOraLxity for any of that".

Contracts over APIs, again.

As long as the "think of my copyright / AI slop oneoneone" crowd wins. It must not.

by TeMPOraL

4/2/2025 at 8:08:01 AM

The only reason that there is a "AI-slop-crowd" (as you call it) is that, well...there is a lot of (Gen-)AI slop. If the technology was as miraculous as it has been hyped up for several years now, there would be no such crowd. Everyone would just get on. If a tech just does what it says it does, everyone gets on board. Internet is a great example of this, so were the smartphones after the iPhone moment. There was never an Anti-Internet-Crowd, I wonder why that might be?

by hansmayer

4/2/2025 at 9:43:36 PM

> There was never an Anti-Internet-Crowd, I wonder why that might be?

You forgot the dotcom boom? :)

Existence of AI slop has nothing to do with whether the tech itself is exceeding or falling short of its hype. It exists because it's good enough for advertising, the cancer on modern society that metastasizes to every new medium and technology, defiling and destroying everything it touches.

by TeMPOraL

4/1/2025 at 5:08:08 AM

> Unless these tools can be run locally independent of a service provider, we're just trading one boss for another.

Not only that, we have to be careful about all the integrations being built around it. Thankfully the MCP standard is becoming mainstream (used by Anthropic, OpenAI and next could be Google) and it's an open standard, even if started by Anthropic so we won't have e.g. Anthropic specific integrations.

by aledalgrande

4/1/2025 at 9:00:55 AM

See my replies to other comments parallel to yours. But in short: MCP doesn't help us anymore than cURL lets you replicate Zapier in a shell script - the bad future is that, like with APIs, service providers get to differentiate between humans and AI user-agents, and restrict the latter to endpoints governed by B2B contracts.

by TeMPOraL

3/31/2025 at 5:16:59 PM

> Unless these tools can be run locally independent of a service provider, we're just trading one boss for another.

Many of them already can be. Many more existing models will become local options if/when RAM prices decline.

But this won't necessarily prevent enshittification, as there's always a possibility of a new model being tasked with pushing adverts or propaganda. And perhaps existing models already have been — certainly some people talk as if it's so.

by ben_w

4/1/2025 at 8:56:07 AM

People are worried about the wrong side of equation. Other problems with them notwithstanding, it's not the browser wars that killed interoperability on the Web - it's everyone else. Any browser you ever used could issue the same HTTP calls (up to standards of a given time, ofc.) - but it helps you with nothing if the endpoint only works when you've signed a contract to access the private API.

The same fate may come to AI, and that worries me. It won't matter whether you're using OpenAI models, Anthropic models, or locally run models, any more than it matters whether you use Firefox, Chrome or raw cURL - if the business gets to differentiate further between users and AI agents working as users, and especially if they get legal backing to doing that, you can kiss all the benefits of LLMs goodbye, they won't be yours as end-user, they'll all accrue to capitalists, who in turn will lend slivers of it to you, for a price of a subscription.

by TeMPOraL

4/1/2025 at 3:03:06 PM

> Any browser you ever used could issue the same HTTP calls (up to standards of a given time, ofc.) - but it helps you with nothing if the endpoint only works when you've signed a contract to access the private API.

Oh, you mean like everyone who shows up to the Cloudflare submissions pointing out how they've been blocklisted from about 50% of the Internet, without recourse, due to the audacity to not run Chrome? In that circumstance, it's actually worse(?) because to the best of my knowledge I cannot subscribe to Cloudflare Verified to not get the :fu: I just have to hope the Eye of Sauron doesn't find me

That reminds me, it's probably time for my semi-annual Google Takeout

by mdaniel

4/1/2025 at 7:21:20 PM

Yeah, that's just an extension of what I said. After all, it's not Google/Chrome that's creating this problem - it's Cloudflare and people who buy this service from them, by making the lazy/economically prudent assumption that anyone who has an opinion on how they consume services can be bucketed together with scammers and denied access.

It stems from the problem I described though - blocking you for not using Chrome is just "only illegitimate users don't use Chrome", which is the next step after "only illegitimate users would want to use our API endpoints without starting a business and signing a formal contract with us".

by TeMPOraL

3/31/2025 at 8:00:21 PM

The airlines rely on things not interoperating for you. However their agents interoperate all the time via code sharing. They don't want normal people to do this but if something goes wrong with the airplane you should be on they would prefer you to get there than not.

by bluGill

4/1/2025 at 9:27:29 AM

> They don't want normal people to do this

That's the root of the problem. That's precisely why computers are not the "bicycles for the minds" they were imagined to be.

It's not a conspiracy theory, either. Most of the tech industry makes money inserting themselves between you and your problem and trying to make sure you're stuck with them.

by TeMPOraL

3/31/2025 at 4:44:41 PM

But that's the promise of AI, right? You can't put an API on everything for human + technological reasons.

by jatins

3/31/2025 at 4:48:38 PM

You can’t put an API on everything because it’d take a ton of time and money to pull that off.

I can’t think of any technological reasons why every digital system can’t have an API (barring security concerns, as those would need to be case by case)

So instead, we put 100s of billions of dollars into statistical models hoping they could do it for us.

It’s kind of backwards.

by dartos

3/31/2025 at 6:12:41 PM

A web page is an Application/Human Interface. Outside of security concerns, companies can make more money if they control the Application/Human Interface, and reduce the risk of a middleman / broker extorting them for margins.

If I run a flight aggregator that has a majority of flight bookings, I can start charging 'rents' by allowing featured/sponsored listings to be promoted above the 'best' result, leading to a prisoner's dilemma where airlines should pay up to their margins to keep market share.

If an AI company becomes the default application human interface, they can do the same thing. Pay OpenAI tribute or be ended as a going concern.

by datadrivenangel

4/1/2025 at 4:38:05 AM

LLMs as a natural language interface is fine.

What I’m saying is that if there was a standard protocol for making travel plans over the internet, we wouldn’t need an AI agent to book a trip.

We could just create great user experiences that expose those APIs like we do for pretty much everything on the web.

by dartos

3/31/2025 at 6:52:07 PM

Exactly. It should take around 10 parameters to book a flight. Not 30,000,000,000 and a dedicated nuclear power plant.

by daxfohl

3/31/2025 at 5:49:39 PM

You change who's paying.

by Scene_Cast2

3/31/2025 at 6:01:18 PM

Sure, as a biz it makes sense, but as a society, it’s obviously a big failure.

by dartos

3/31/2025 at 4:47:17 PM

It is a promise alright :)

by hansmayer

3/31/2025 at 7:32:11 PM

Your use of the word "perfect" is doing a lot of heavy lifting. "Perfect" is a word embedded in a high dimensional space whose local maxima are different for every human on the planet.

by doug_durham

4/1/2025 at 10:22:06 AM

No, it's just the intuitively perfect that comes to mind in this context, i.e. reliable and guaranteed to produce a safe outcome. Much like Amazon checkout process. I am fine giving my credit card details to near-perfect automatons like that. I will never give it to a statistical model, which may or may not hallucinate the sum it is supposed to enter into an interface built for humans, not computers.

by hansmayer

3/31/2025 at 7:53:21 PM

Yep, and AI agents essentially throw up a boundary blocking the user from understanding the capabilities of the system they're using. They're like the touch screens in cars that no one asked for, but for software.

by davesque

3/31/2025 at 3:21:08 PM

Case-in-point look how long it’s taken for self-driving cars to mature. And many would argue they still have a ways to go until they’re truly reliable.

I think this highlights how we still haven’t cracked intelligence. Many of these issues come from the model’s very limited ability to adapt on the fly.

If you think about it every little action we take is a micro learning opportunity. A small-scale scientific process of trying something and seeing the result. Current AI models can’t really do that.

by CooCooCaCha

3/31/2025 at 6:41:30 PM

Even maps. I was driving to Chicago last week and Apple Maps insisted I take the exit for Danville. Fortunately I knew better, I only had the map on in case an accident might require rerouting. I find it hard to drive with maps navigation because they are usually correct, but wrong often enough that I don't fully trust them. So I have to double check everything they tell me with the reality in front of me, and that takes more mental effort than it ideally should.

by SoftTalker

4/1/2025 at 2:38:52 AM

> double check everything they tell me with the reality in front of me

I believe that's a famous Army Ranger expression: "the map is not the terrain" (I tried to find an attribution for it but it seems it comes in "the map is not the territory" flavors, too)

by mdaniel

3/31/2025 at 3:22:43 PM

Isn't the point he's making:

>> Yet too many AI projects consistently underestimate this, chasing flashy agent demos promising groundbreaking capabilities—until inevitable failures undermine their credibility.

This is the problem with the 'MCP for Foo' posts that recently.

Adding a capability to your agent that the agent can't use just gives us exactly that:

> inevitable failures undermine their credibility

It should be relatively easy for everyone to agree that giving agents an unlimited set of arbitrary capabilities will just make them terrible at everything; and that promising that giving them these capabilities will make them better is:

A) false

B) undermining the credibility of agentic systems

C) undermining the credibility of the people making these promises

...I get it, it is hard to write good agent systems, but surely, a bunch of half-baked, function-calling wrappers that don't really work... like, it's not a good look right?

It's just vibe coding for agents.

I think it's quite reasonable to be say, if you're building a system, now, then:

> The key to navigating this tension is focus—choosing a small number of tasks to execute exceptionally well and relentlessly iterating upon them.

^ This seems like exceptionally good advice. If you can't make something that's actually good by iterating on it until it is good and it does work, then you're going to end up being a devin (ie. over promised, over hyped failure).

by noodletheworld

3/31/2025 at 7:48:12 PM

> Yeah, the "book a flight" agent thing is a running joke now

I literally sat in a meeting with one of our board members who used this exact example of how "AI can do everything now!" and it was REALLY hard not to laugh.

by burnte

3/31/2025 at 7:52:33 PM

Can Google Flights find the best flight dates to a destination within a time frame? E.g. get flights to LA in a up to 15 day period with ensure attendance on 17 September. Fly with SkyAlliance airlines only. Flexible with any dates but needs to be there on 17 Sept and at minimum stay of eight days or more.

Love if it could help with that but I haven't figured it out with Google Flights yet. My dream is to tell an AI agent the above and let it figure out the best deal.

by wdb

3/31/2025 at 3:28:14 PM

Booking a flight is actually task I cannot outsource to a human assistant, let alone AI. Maybe it's a third-world problem or just me being cheap, but there are heuristics involved when booking flights for a family trip or even just for myself.

Check the official website, compare pricing with aggregator, check other dates, check people's availability on cheap dates. Sometimes I only do the first step if the official price is reasonable (I travel 1-2x a month, so I have expectation "how much it should cost").

Don't get me started if I also consider which credit card to use for the points rewards.

by wiradikusuma

3/31/2025 at 3:43:59 PM

Completely agree! Especially considering that flights for most people are still a large expense, people, especially those in the credit card points game, like to go to great lengths to score the cheapest possible flights.

For example, this person[0] could have simply booked a United flight from the United site for 15k points. Instead the person batch emailed Turkish Airlines booking offices, found the Thai office that was willing to make that booking but required bank transfers in Thai baht to pay taxes, made two more phone calls to Turkish Airlines to pay taxes with a credit card, and in the end only spent 7.5k points for the same trip on United.

This may be an extreme example, but it shows the amount of familiarity with the points system, the customer service phone tree and the actual rules to get cheap flights.

If AI can do all of that, it'd be useful. Otherwise I'll stick to manual booking.

[0]: https://frequentmiler.com/yes-you-can-still-book-united-flig...

by kccqzy

3/31/2025 at 3:52:12 PM

Now THAT's the workflow I'd like to see AI agent automate, streamline and democratize for everybody.

by Jianghong94

3/31/2025 at 4:31:51 PM

If it were available to everybody, it would disappear. This is a market inefficiency that a "trader" with deep knowledge of the structure of this market was able to exploit. But if everyone started doing this, United/Turkish Airlines would see they were losing money and eliminate it. Similar to how airlines have tried to stop people exploiting "hidden cities."

by maxbond

4/1/2025 at 7:37:23 AM

This is not the same as, but reminded me of, patio11's writing on how banks have side channels for the professional-managerial class.

https://www.bitsaboutmoney.com/archive/seeing-like-a-bank/

> As a sophisticated user of the banking system, a useful skill to have is understanding whether the ultimate solution to an issue facing you is probably available to Tier Two or probably only available to a professional earning six figures a year. You can then route your queries to the bank to get in front of the appropriate person with the minimal amount of effort expended on making this happen.

> You might think bank would hate this, and aggressively direct people who discover side channels to Use The 1-800 Number That Is What It Is For. For better or worse, the side channels are not an accident. They are extremely intentionally designed. Accessing them often requires performance of being a professional-managerial class member or otherwise knowing some financial industry shibboleths. This is not accidental; that greatly cuts down on “misuse” of the side channels by that guy.

by Cyphase

3/31/2025 at 7:07:15 PM

> Similar to how airlines have tried to stop people exploiting "hidden cities."

This sounds interesting?

by davedx

3/31/2025 at 7:24:23 PM

https://skiplagged.com/

Just don't book a round trip, don't check a bag, don't do it too often. Also you're gambling that they don't cancel your flight and book you on a new one to the city you don't actually want to go to (that no longer connects via the hidden city). You can get half price tickets sometimes with this trick.

by wbxp99

3/31/2025 at 6:59:35 PM

and watch it immediate evaporate or require even more esoteric knowledge of opaque systems?

Persistent mispricings can only exist if the cost of exploitation removes the benefit or constrains the population.

by kristjansson

3/31/2025 at 6:28:58 PM

I have HAD a human assistant who booked flights for me. But it took them a long time to learn the nuances of my preferences enough to do it without a lot of back and forth. And even then, they still sometimes had to ask. Things like what time of day I prefer to fly based on what I had going on the day before or what I'll be doing after I land. What airlines I prefer based on which lounges I'd have access to, or what aircraft they fly. When I would opt for a connecting flight to get a better price vs. when I want nonstop regardless of cost. And on and on. Probably dozens of factors that might come into play in various combinations depending on where I'm going and why. And preferences that are hard to articulate, but make sense once understood.

With a really excellent human assistant who deeply understood my brain (at least the travel related parts of it), it was kind of nice. But even then there were times when I thought it would be easier and better to just do it myself. Maybe it's a failure of imagination, but I find it very hard to see the path from today's technology to an AI agent that I would trust enough to hand it off, and that would save enough time and hassle that I wouldn't prefer to just do it myself.

by zippergz

4/1/2025 at 7:02:00 PM

Off topic, but I’m curious: how did you go about finding an assistant that good?

by sneak

3/31/2025 at 4:02:47 PM

I don't really need an AI agent to book flights for me (I just don't travel enough for it to be any burden) but aren't those arguments for an AI agent? If you just wanna book the next flight London to New York it isn't that hard. A few minutes of clicking.

But if you wanna find the cheapest way to get to A, compare different retailers, check multiple peoples availability, calculate effects of credit cards etc. It takes time. Aren't those things that could be automated with an agent that can find the cheapest flights, propose dates for it, check availability etc with multiple people via a messing app, calculate which credit card to use, etc?

by victorbjorklund

3/31/2025 at 4:15:23 PM

In theory, yes. But in a real world evaluation would it pick better flights? I'd like to see evidence that it's able to find a better flight that maximizes this. Also the tricky part is how do you communicate how much I personally weight a shorter flight vs points on my preferred carrier vs having to leave for the airport at 5am vs 8am? I'm sure my answers would differ from wiradikusuma's answers.

by bgirard

3/31/2025 at 5:20:21 PM

Yep this is my vibe.

When I'm picking out a flight I'm looking at, among other things:

* Is the itinerary aggravatingly early or late

* Is the layover aggravatingly short or long

* Is the layover in an airport that sucks

* Is the flight on a carrier that sucks

* What does it cost

If you asked me to encode ahead of time the relative value of each of these dimensions I'd never be able to do it. Heck, the relative value to me isn't even constant over time. But show me five options and I can easily select between them. A clear case where search is more convenient than some agent doing it for me.

by UncleMeat

3/31/2025 at 5:53:12 PM

I agree. At first I would be open to an LLM suggested option to appear in the search UI. I would have to pick it the majority of the time for quite awhile for me to trust it enough to blindly book through it.

It's the same problem with Alexa. I don't trust it to blindly reorder me basic stuff when I have to shift through so many bad product listing on the Amazon marketplace.

by bgirard

3/31/2025 at 4:16:05 PM

Yep that's what I've been thinking. This shouldn't be that hard, at this point LLMs should already have all the 'rules' (e.g. credit card A buys flight X give you m point which can be converted into n miles) in their params or can easily query the web to get it out. Dev need to encode the whole thing into a decision mechanism and once executed ask LLM to chase down the specific path (e.g. bombard ticket office with emails).

by Jianghong94

4/1/2025 at 1:25:37 AM

And what happens to the 1% where this fails? At the moment the responsibility is on the person. If I incorrectly book my flight for date X, and I receive the itinerary and realise I chose the wrong month - then damn, I made a mistake and will have to rectify.

An LLM could organise flights with a lower error rate, however, when it goes wrong what is the recourse? I imagine it's anger and a self-promise never to use AI for this again.

*If you're saying that the AI just supplies suggestions then maybe it's useful. Though wouldn't people still be double checking everything anyway? Not sure how much effort this actually saves?

by antihipocrat

4/1/2025 at 2:41:35 PM

There is a really interesting book called Alchemy by Rory Sutherland.

In one chapter he describes his frustration with GPS based navigation apps. I thought it was similar to what you describe.

> If I am commuting home, I may prefer a slower route that avoids traffic jams. (Humans, unlike GPS devices, would rather keep moving slowly than get stuck in stop-start traffic.) GPS devices also have no notion of trade-offs, in particular relating to optimising ‘average, expected journey time’ and minimising ‘variance’ – the difference between the best and the worst journey time for a given route.

For instance, whenever I drive to the airport, I often ignore my GPS. This is because what I need when I’m catching a flight is not the fastest average journey, but the one with the lowest variance in journey time – the one with the ‘least-bad worst-case scenario’. The satnav always recommends that I travel there by motorway, whereas I mostly use the back roads.

by baxtr

3/31/2025 at 8:39:19 PM

> Booking a flight is actually task I cannot outsource to a human assistant, let alone AI.

Because there is no "correct" flight. Your preference changes as you discover information about what's available at a given time and price.

The helpful AI assistant would present you with options, you'd choose what you prefer, it would refine the options, and so on, until you make your final selection. There would be no communication lag as there would be with a human assistant. That sounds very doable to me.

by pton_xd

3/31/2025 at 5:24:17 PM

The Flight Price to Tolerable Layover time ratio is something too personal for me to convey to an assistant

by joseda-hg

4/1/2025 at 2:52:22 AM

It is not that you can't outsource it, but there are so many variables that once you finished explaining them to the assistant (human or AI) you'd be better off doing it yourself. A human assistant only makes sense if he/she is your everything assistant and has knowledge about your work schedule, life, kids, financials, etc...

by csomar

3/31/2025 at 11:00:57 PM

I feel the same way, or at least I wouldn't delegate this unless they fine tune accuracy and reliability in their apps. Right now, it sits around 40-60%

by amogul

3/31/2025 at 6:53:43 PM

The problem I find in many cases is that people are restrained by their imagination of what's possible, so they target existing workflows for AI. But existing workflows exist for a reason: someone already wanted to do that, and there have been countless man-hours put into the optimization of the UX/UI. And by definition they were possible before AI, so using AI for them is a bit of a solution in search of a problem.

Flights are a good example but I often cite Uber as a good one too. Nobody wants to tell their assistant to book them an Uber - the UX/UI is so streamlined and easy, it's almost always easy enough to just do it yourself (or if you are too important for that, you probably have a private driver already). Basically anything you can do with an iPhone and the top 20 apps is in this category. You are literally competing against hundreds of engineers/product designers who had no other goal than to build the best possible experience for accomplishing X. Even if LLMs would have been helpful a priori - they aren't after every edge case has already been enumerated and planned for.

by extr

3/31/2025 at 7:02:15 PM

> You are literally competing against hundreds of engineers/product designers who had no other goal than to build the best possible experience for accomplishing X.

I think part of what's been happening here is that the hubris of the AI startups is really showing through.

People working on these startups are by definition much more likely than average to have bought the AI hype. And what's the AI hype? That AI will replace humans at somewhere between "a lot" and "all" tasks.

Given that we're filtering for people who believe that, it's unsurprising that they consciously or unconsciously devalue all the human effort that went into the designs of the apps they're looking to replace and think that an LLM could do better.

by lolinder

3/31/2025 at 7:40:15 PM

> I think part of what's been happening here is that the hubris of the AI startups is really showing through.

I think it its somewhat reductive to assign this "hubris" to "AI startups". I would posit that this hubris is more akin to the superiority we feel as human beings.

I have heard people say several times that they "treat AI like a Jr. employee", I think that within the context of a project AI should be treated based on the level if contribution. If AI is the expert, I am not going to approach it as if I am an SME that knows exactly what to ask. I am going to try and focus on the thing. know best, and ask questions around that to discover and learn the best approach. Obviously there is nuance here that is outside the scope of this discussion, but these two fundamentally different approaches have yield materially different outcomes in my experience.

by arionhardison

3/31/2025 at 11:39:16 PM

Treat AI like a junior employee?

Absolutely not. When giving tasks to an AI, we supply them with context, examples of what to do, examples of what not to do, and we clarify their role and job. We stick with them as they work and direct them accordingly when something goes wrong.

I've no idea what would happen if we treated a junior developer like that.

by hexasquid

4/1/2025 at 5:19:23 AM

They would become a senior developer? lol ;)

by aledalgrande

3/31/2025 at 7:33:39 PM

> The problem I find in many cases is that people are restrained by their imagination of what's possible, so they target existing workflows for AI.

I concur and would like to add that they are also restrained by the limitations of existing "systems" and our implicit and explicit expectations of said system. I am currently attempting to mitigate the harm done by this restriction by focusing on and starting with a first principal analysis of the problem being solved before starting the work, for example; lets take a well established and well documented system like the SSA.

When attempting to develop, refactor, extend etc... such a system; what is the proper thought process. As I see it, there are two paths:

Path 1:

  a) Breakdown the existing workflows

  b) Identify key performance indicators (KPIs) that align with your business goals

  c) Collect and analyze data related to those KPIs using BPM tools

  d) Find the most expensive worst performing workflows

  e) Automate them E2E w/ interface contracts on either side

This approach locks you into to existing restrictions of the system, workflows, implementation etc...

Path 2:

  a) Analyze system to understand goal in terms of 1st principals, e.g: What is the mission of the SSA? To move money based on conditional logic.

  b) What systems / data structures are closest to this function and does the legacy system reflect this at its core e.g.: SSA should just be a ledger IMO

  c) If Yes, go to "Path 1" and if No go to "D"

  d) Identify the core function of the system, the critical path (core workflow) and all required parties

  e) Make MVP which only does the bare min

By following path 2 and starting off with an AI analysis of the actual problem and not the problem as it exist as a solution within the context of an existing system, it is my opinion that the previous restrictions have been avoided.

Note: Obviously this is a gross oversimplification of the project management process and there are usually external factors that weigh in and decide which path is possible for a given initiative, my goal here was just to highlight a specific deviation from my normal process that has yielded benefits so far in my own personal experience.

by arionhardison

3/31/2025 at 6:54:16 PM

We've (ex Google Deepmind researchers) been doing research in increasing the reliability of agents and realized it is pretty non-trivial, but there are a lot of techniques to improve it. The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks. We made our own benchmarks to measure progress.

Plug: We just posted a demo of our agent doing sophisticated reasoning over a huge dataset ((JFK assassination files -- 80,000 PDF pages): https://x.com/peterjliu/status/1906711224261464320

Even on small amounts of files, I think there's quite a palpable difference in reliability/accuracy vs the big AI players.

by peterjliu

3/31/2025 at 7:10:31 PM

> The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks.

OMFG thank you for saying this. As a core contributor to RA.Aid, optimizing it for SWE-bench seems like it would actively go against perf on real-world tasks. RA.Aid came about in the first place as a pragmatic programming tool (I created it while making another software startup, Fictie.) It works well because it was literally made and tested by making other software, and these days it mostly creates its own code.

Do you have any tips or suggestions on how to do more formalized evals, but on tasks that resemble real world tasks?

by ai-christianson

3/31/2025 at 7:44:45 PM

I would start by making the examples yourself initially, assuming you have a good sense for what that real-world task is. If you can't articulate what a good task is and what a good output is, it is not ready for out-sourcing to crowd-workers.

And before going to crowd-workers (maybe you can skip them entirely) try LLMs.

by peterjliu

3/31/2025 at 7:56:02 PM

> I would start by making the examples yourself initially

What I'm doing right now is this:

  1) I have X problem to solve using the coding agent.
  2) I ask the agent to do X
  3) I use my own brain: did the agent do it correctly?

If the agent did not do it correctly, I then ask: should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.

The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation.

SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green.

by ai-christianson

4/1/2025 at 4:51:00 AM

I've spent the last six months building a coding agent at work, and the reliability issues are killing us. Our users don't want 'superhuman' results 10% of the time - they want predictable behavior they can trust.

When we tried the 'full agent' approach (letting it roam freely through our codebase), we ended up with some impressive demos but constant production incidents. We've since pivoted to more constrained workflows with human checkpoints, and while less flashy, user satisfaction has gone way up.

The Cursor wipeout incident is a perfect example. It's not about blaming users who don't understand git - it's about tools that should know better. When I hand my code to another developer, they understand the implied contract of 'don't delete all my shit without asking.' Why should AI get a pass?

Reliable > clever. It's the difference between a senior engineer who delivers consistently and a junior who occasionally writes brilliant code but breaks the build every other week."

by gcp123

3/31/2025 at 3:35:41 PM

My rule of thumb has thus far been: if I’m gonna allow AI to write any bit of code for me, then I must, at a bare minimum, be able to understand that code.

There’s no way I could do what some of these “vibe coders” are doing where they allow AI to write code for them that they don’t even understand.

by joshdavham

3/31/2025 at 4:04:12 PM

I think there's a lot of code that gets written that's either disposable or effectively "write only" in that no one is expected to maintain it. I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

Basically, what's worse? "Vibes" code that no one understands or a cascade of 20 spreadsheets that no one understands? At least with the "vibes" code you can stick it in git and have some semblance of sane revision control and change tracking.

by AlexandrB

3/31/2025 at 4:16:43 PM

> I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

That sort of makes sense, but then again... if you run some analysis code and it spits out a few plots, how do you know what you're looking at is correct if you have no idea what the code is doing?

by pton_xd

3/31/2025 at 5:15:08 PM

> how do you know what you're looking at is correct if you have no idea what the code is doing?

Does it reaffirm the biases of the one who signs my paychecks? If so, then the code is correct.

by kibwen

3/31/2025 at 6:24:57 PM

Lol, that's definitely a factor. Actually plotting is the perfect example because python is really popular in the space and matplotlib sucks so much. While an analyst may not understand Python very well, they often understand the data itself through either previous projects or through other analysis tools. It's kind of like vibe coding a UI for a backend that's hand built.

by AlexandrB

3/31/2025 at 6:14:06 PM

LOL thanks for the laughs. But yes seriously though, most kinds of data analysis jobs several rungs down the ladder where the result is not in a critical path amount to reaffirming what upper people believe. Don't rock the boat.

by usui

3/31/2025 at 5:11:08 PM

> I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain

Considering the hallucinations we've all seen I don't know how they can be comfortable using AI generated data analysis to drive the future direction of the business.

by cube00

3/31/2025 at 5:26:05 PM

> think there's a lot of code that gets written that's either disposable or effectively "write only" in that no one is expected to maintain it. I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

> Basically, what's worse? "Vibes" code that no one understands or a cascade of 20 spreadsheets that no one understands?

Correction: it's a "cascade of 20 spreadsheets" that one person understood/understands.

Write only code still needs to work, and someone at some point needs to understand it well enough to know that it works.

by palmotea

3/31/2025 at 4:09:49 PM

> I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

I think this is a great use case for AI, but the analyst still needs to understand what the code that is output does. There are a lot of ways to transform data that result in inaccurate or misleading results.

by Centigonal

3/31/2025 at 4:15:35 PM

Vibe coders focus on writing tests, and verifying function/correctness. It’s not like they don’t read _any_ of the code. They get the vibes, but ignore the details.

by LPisGood

4/1/2025 at 1:34:41 AM

This seems optimistic -- prescriptive, rather than descriptive. From what I've observed, vibe coders largely do not include tests; they do verify functionality/correctness, but not rigorously or repeatably.

by achierius

3/31/2025 at 6:05:58 PM

Huh. The whole promise of vibe coding is that you don't have to pay attention to the details.

by hooverd

3/31/2025 at 9:26:29 PM

Yeah, of course. I don’t think what I described could possibly be misconstrued as someone paying attention to details.

by LPisGood

3/31/2025 at 6:24:05 PM

"You're programming wrong wrong" /s

by namaria

3/31/2025 at 5:12:48 PM

> Vibe coders focus on writing tests

From the boasting I've seen, Vibe coders are also using AI to slop out their tests as well.

by cube00

3/31/2025 at 5:16:19 PM

Worry not, we can solve this by using AI to generate tests to test the tests.

by kibwen

3/31/2025 at 9:28:18 PM

Testing the tests is pretty much the definition of being a vibe coder.

by LPisGood

3/31/2025 at 9:27:30 PM

Yeah and tests are much easier to validate than functions.

by LPisGood

3/31/2025 at 5:13:53 PM

> what's worse? "Vibes" code that no one understands or a cascade of 20 spreadsheets that no one understands? At least with the "vibes" code you can stick it in git and have some semblance of sane revision control and change tracking.

You can for spreadsheets too.

by inetknght

3/31/2025 at 4:32:01 PM

two wrongs don't make a right

by liveoneggs

3/31/2025 at 5:42:19 PM

Sure, but you're a professional software engineer, who I assume gets feedback and performance reviews based on the quality of your code.

There's always been a group of beginners that throws stuff together without fully understanding what it does. In the past, this would be copy n' paste from Stackoverflow. Now, that process is simply more automated.

by SkyPuncher

4/1/2025 at 1:11:11 AM

There is also likely to be increased pressure in a SE job to produce more code. You'll find that if others use AI, it'll be hard to be a hold-out and hit fewer delivery milestones, and quality is hard to measure. People are rewarded for shipping, primarily (unless you're explicitly working on high reliability/assurance products).

by __jochen__

3/31/2025 at 3:56:45 PM

That's only true as long as you want to modify said code. If it meets your bar for reliability then you won't need to understand it, much like how we don't really need to read/understand compiled assembly code so we largely trust the compiler.

A lot of these vibe coders just have a much lower bar for reliability than you.

by kevmo314

3/31/2025 at 6:59:19 PM

This is an interesting point and it's certainly true with respect to most peoples' attitudes towards dependencies.

For example, while I feel the need to understand the code I wrote using pytorch, I don't generally feel the need to totally grok how pytorch works.

by joshdavham

3/31/2025 at 4:09:33 PM

How do you know if it meets your bar for reliability if you don’t understand the output? I don’t know that the analogy to a compiler is apples to apples. A compiler isn’t producing an answer based on statistically generating something that should look like the right answer.

by fourside

3/31/2025 at 4:38:51 PM

The premise for vibe coding is that it's generating the entire app or site. If the app does what you want then it's meeting the bar.

by kevmo314

3/31/2025 at 7:09:54 PM

I think there are times where it's ok to treat a function like a black box--cases where anything that makes the test pass will do because the test is in fact an exhaustive evaluation of what that code needs to do.

We just need to be better about making it clear which code is that way and which is not.

by __MatrixMan__

3/31/2025 at 4:40:24 PM

FWIW, work has pushed use of Cursor and I quickly came around to a related conclusion: given a reliability vs. anything tradeoff, you more or less always have to prefer reliability. For example, even ignoring subtle head-scratcher type bugs, a faster model's output on average needs more revision before it basically works, and on average you end up spending more energy on that than you save by reducing time to first response. Up-front work that decreases the chance of trouble--detailing how you want something done, explicitly pulling into context specific libraries--also tends to be worth it on net, even if the agent might have gotten there by searching (or you could get it there through follow-up requests).

That's my experience working with a largeish mature codebase (all on non-prod code) where you can't get far if you can't use various internal libraries correctly. With standalone (or small greenfield) projects, where results can lean more on public info from pre-training and there's not as much project specific info to pull in, you might see different outcomes.

Maybe the tech and surrounding practice will change over time, but in my short experience it's mostly been about trying to just get to 'acceptable' for this kind of task.

by twotwotwo

3/31/2025 at 7:32:08 PM

"Less capability, more reliability, please" is what I want to say about everything that's happened in the past 20 years. Of everything that's happened since then, I'm happy to have a few new capabilities: smartphones, driving directions, cloud storage, real-time collaborative editing of documents. I don't need anything else. And now I just want my gadget batteries to last longer, and working parental controls on my kids' devices.

by getnormality

3/31/2025 at 5:11:02 PM

I think the replies [0] to the mentioned reddit thread sums up my (perhaps complacent?) feelings about the current state of automated AI programming:

> Does it terrify anyone else that there is an entire cohort of new engineers who are getting into programming because of AI, but missing these absolute basic bare necessities?

> > Terrify? No, it's reassuring that I might still have a place in the world.

[0] https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdo...

by danso

3/31/2025 at 5:27:21 PM

The reddit post feels like engagement bait to me.

Why would you ask the community a question like "how to source control" when you've been working with (presumably) a programming genius LLM that could provide the most personally tailored path for baby's first git experience? Even if you don't know that "git" is a thing, you could ask questions as if you were a golden retriever and the model would still inevitably recommend git in the first turn of conversation.

Is it really the case that a person who has the ability to use a compiler, IDE, LLM, web browser, reddit, etc., somehow simultaneously lacks the ability to frame basic-ass questions about the very mission they set out on? If stuff like this is not manufactured, then we should all walk away feeling pretty fantastic about our future job prospects.

by bob1029

3/31/2025 at 6:19:41 PM

If you start from scratch trying to build an ideal system to program computers, you always converge on the time tested tooling that we have now. Code, compilers, interpreters, versioning, etc.

People think "this is hard, I'll re-invent it in an easier way" and end up with a half-assed version of the tooling we've honed over the decades.

by namaria

3/31/2025 at 9:17:15 PM

> People think "this is hard, I'll re-invent it in an easier way" and end up with a half-assed version of the tooling we've honed over the decades.

This is a win in the long run because the occassional and successful thought people labor over sometimes is a better way.

by mycall

4/1/2025 at 4:43:58 AM

Agreed. We wouldn't have distributed version control, container environments, profilers, etc without people trying to to make programming better. But those are all based on improving single aspects (better versioning, repeatability, debug, etc).

When the goal is "re-invent programming to make it easier" all you get is a hodgepodge of half-ass solutions like GP said. Enhancing traditional focused workflows seems a lot more interesting to me than "coding assistant".

Hopefully AI tooling will continue to evolve. I don't see how you get around the reliability issues with this iteration of AI (GPT+RLHF+RAG, etc). Transfer learning is still abysmal.

by daveguy

3/31/2025 at 5:35:38 PM

The account is a throwaway but based on its short posting history and its replies, I don't have reason to believe it's a troll:

https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdr...

> I'm not a dev or engineers at all (just a geek working in Finance)

This fits my experience of teaching very intelligent students how to code; if you're an experienced programmer, you simply cannot fathom the kinds of assumptions beginners will make due to gaps in yet-to-be foundational knowledge. I remember having to tell students to mindful when searching Stack Overflow for help, because of how something as simple as an error from Requests (e.g. while doing web scraping) could lead them down a rabbit hole of "solutions" such as completely uninstalling their Python for a different/older version of Python.

by danso

3/31/2025 at 5:37:04 PM

They were using Cursor, not a general LLM, and were asking their fellow Cursor users how they deal with the risk of Cursor destroying the code base.

by layer8

3/31/2025 at 3:04:06 PM

Google Flights already nails this UX perfectly

Often when using an AI agent, I think to myself that a web search gets me what I need more reliably and just as quick. Maybe AI has to learn to crawl before it learns to walk, but each agent I use is leaving me without confidence that it will ever be useful and I genuinely wonder if they've ever been tested before being published...

by dfxm12

3/31/2025 at 3:36:08 PM

Assume humans can do anything in a factory. So we create a tool to increase the speed and reliability of the human’s output. We do this so much that eventually the whole factory is automated, and the human is simply observing.

Nowhere in that story above is there a customer or factory worker feeding in open-ended inputs. The factory is precise, it takes inputs and produces outputs. The variability is restricted to variability of inputs and the reliability of the factory kit.

Much business software is analogous to the factory. You have human workers who ultimately operate the business. And software is built to automate those tasks precisely.

AI struggles because engineers are trying to build factories through incantation - if they just say the right series of magic spells, the LLM will produce a factory.

And often it can. It’s just a shitty factory that does simple things, often inefficiently with unforeseen edge cases.

At the moment, skilled factory builders (software engineers) are better at holistically understanding the needs of the business and building precise, maintainable, specific factories.

The factory builders will use AI as a tool to help build better factories. Trying to get the AI to build the whole factory soup-to-nuts won’t work.

by monero-xmr

3/31/2025 at 4:15:56 PM

I have been thinking about the exact same problem for a while and was literally hours away from publishing a blogpost on the subject.

+100 on the footnote:

> agents or workflows?

Workflows. Workflows, all the way.

The agents can start using these workflows once they are actually ready to execute stuff with high precision. And, by then we would have figured out how to create effective, accurate and easily diagnozable workflows, so people will stop complaining about "I want to know what's going on inside the black box".

by bhu8

3/31/2025 at 8:28:02 PM

I've been building workflows with "AI" capability inserted where appropriate since 2016. Mostly customer service chatbots.

99.9% of real world enterprise AI use cases today are for workflows not agents.

However, "agents" are being pushed because the industry needs a next big thing to keep the investment funding flowing in.

The problem is that even the best reasoning models available today don't have the actual reasoning and planning capability needed to build truly autonomous agents. They might in a year. Or they might not.

by DebtDeflation

3/31/2025 at 5:35:25 PM

Agreed, I started crafting workflows last week. Still not impressed with how poorly the current crop of models is at following instructions.

And are there any guidelines on how to manage workflows for a project or set of projects? I’m just keeping them in plain text and including them in conversations ad hoc.

by breckenedge

3/31/2025 at 5:02:56 PM

Unfortunately, the picked example kind of weighs down the point. Cursor has an extremely vocal minority (beginner coders) that isn't really representative of their heavy weight users (professional coders). These beginner users face significant issues that come from being new to programming, in general. Cursor gives them amazing capabilities, but it also lets them make the same dumb mistakes that most professional developers have done once or twice in their career.

That being said, back in February I was trying out of bunch of AI personal assistant apps/tools. I found, without fail, every single one of them was advertising features their LLMs could theoretically accomplish, but in practice couldn't. Even worse was many of these "assistants" would proactively suggest they could accomplish something but when you sent them out to do it, they'd tell you they couldn't.

* "Would you like me to call that restaurant?"...."Sorry, I don't have support for that yet"

* "Would you like me to create a reminder?"....Created the reminder, but never executed it

* "Do you want me to check their website?"...."Sorry, I don't support that yet"

Of all of the promised features, the only thing I ended up using any of them for was a text message interface to an LLM. Now that Siri has native ChatGPT support, it's not necessary.

by SkyPuncher

3/31/2025 at 6:21:20 PM

Unfortunately, LLMs, natural language, and human cognition largely are what they are. Mix the three together and you don't get reliability as a result.

It's not like there's a lever in Cursor HQ where one side is "Capability" and one side is "Reliability", and they can make things better just by tipping it back towards the latter.

You can bias designs and efforts in that direction, and get your tool to output reversible steps or bake in sanity checks to blessed actions, but that doesn't change the nature of the problem.

by LeifCarrotson

3/31/2025 at 4:22:13 PM

I feel like OP would have been better of not referencing the viral thread about a developer not using any version control and surprised when the AI made changes, I don't think anyone who doesn't understand version control should be using a tool like cursor, there are other SAAS apps that build and deploy apps using AI and for people with the skill demonstrated in the thread, that is what they should be using.

It's like saying rm -rf / should have more safeguards built in. It feels unfair to call out the AI based tools for this.

by narmiouh

3/31/2025 at 5:26:20 PM

I think it's a useful anecdote because it underscores how catastrophically unreliable* agents can be, especially in the hands of users who aren't experienced in the particular domain. In the domain of programming, it's much easier to quantify a "catastrophic" scenario vs. more open-ended "real world" situations like booking a flight.

* "unreliable" may not be the right word. For all we know, the agent performed admirably given whatever the user's prompt may have been. Just goes to show that even in a relatively constricted domain of programming, where a lot (but far from all) outcomes are binary, the room for misinterpretation and error is still quite vast.

by danso

3/31/2025 at 6:28:46 PM

More than that, I think it's quite relevant, because it shows that the complexity in the tooling around writing code manually is not so inessential as it seems.

Any system capable of automating a complex task will by need be more complex than the task at hand. This complexity doesn't evaporate when you through statistical fuzzers at it.

by namaria

3/31/2025 at 5:19:15 PM

`rm -rf /` does have a safeguard:

> For example, if a user with appropriate privileges mistakenly runs ‘rm -rf / tmp/junk’, that may remove all files on the entire system. Since there are so few legitimate uses for such a command, GNU rm normally declines to operate on any directory that resolves to /. If you really want to try to remove all the files on your system, you can use the --no-preserve-root option, but the default behavior, specified by the --preserve-root option, is safer for most purposes.

https://www.gnu.org/software/coreutils/manual/html_node/Trea...

by fabianhjr

3/31/2025 at 5:31:06 PM

That was added in 2006, so didn’t exist for a good half of its life (even longer if you count pre-GNU). I remember rm -rf / being considered just one instance of having to double-check what you do when using the -rf option. It’s one reason it became common to alias rm to rm -i.

by layer8

3/31/2025 at 5:31:33 PM

Technically, they could be using version control, not have a copy on their local machine for some reason, and have an AI agent issue a `git push -f` wiping out all the previous work.

by outime

4/1/2025 at 2:54:28 AM

I know this trope appears fairly often, but in reality unless it's someone's copy of $(git serve) running under their desk with no CI attached to it whatsoever, the commit history still exists and one can recover from a force push by just typing the equivalent of github.example.com/MyAwesomeOrg/MyRepo/commit/cafebabedeadbeef

I hypothesize that a $(git fetch --mirror) would pull down the "orphaned" revision, too, but don't currently have the mental energy to prove it

by mdaniel

4/1/2025 at 1:35:37 PM

I actually had cursor on in yolo mode with the github integration on and it's actually pretty good about doing commits and pushes and opening PRs and stuff. Though I would never do that in a repo I cared a lot about. It doesn't always know what branch it's on and makes assumptions all the time about what org it's in, etc...

by empath75

4/1/2025 at 3:55:08 AM

Seems related to another recent post: https://news.ycombinator.com/item?id=43542259

I tend to think that what this article is asking for isn't achievable, because what people mean by "AI" is precisely "we don't know how it works".

An analogy I've used sometimes when talking with people about AI is the "I know a guy" situation. Someone you know comes and tells you "I know a guy who can do X for you", where "do X" is "write your class paper" or "book a flight" or "describe what a supernova is" or "invest your life savings". In this situation, the more important the task, the more you would probably want to know about this "guy". What are his credentials? Has he done this before? How often has he failed? What were the consequences? Can he be trusted? Etc.

The thing that "a guy" and an AI have in common is that you don't know what they're doing. Where they differ is in your ability to gradually gain knowledge. In real life, "know a guy" situations become transformed into something more specific as you gain information about who the person is and how they do what they do, and especially as you understand more about the system of consequences in which they are embedded (e.g., "if this painter had ruined many people's houses he would have been sued into oblivion, or at least I would have heard about it"). And also real people are unavoidably embedded in the system of physical reality which imposes certain constraints that bound plausibility (e.g., if someone tells you "I know a guy who can paint your entire house in five seconds" you will smell a rat).

Asking for "reliability" means asking for a network of causes and effects that surrounds and supports whatever "guy" or AI you're relying on. At this point I don't see any mechanism to provide that other than social and ultimately legal pressure, and I don't see any strong action being taken in that direction.

by BrenBarn

3/31/2025 at 6:16:46 PM

I appreciate the distinction between agents and workflows - this seems to be commonly overlooked and in my opinion helps ground people in reliability vs capability. Today (and in the near future) there's not going to be "one agent to rule them all", so these LLM workflows don't need to be incredibly capable. They just need to do what they're intended to do _reliably_ and nothing more.

I've started taking a very data engineering-centric approach to the problem where you treat an LLM as an API call as you would any other tool in a pipeline, and it's crazy (or maybe not so crazy) what LLM workflows are capable of doing, all with increased reliability. So much so that I've tried to package my thoughts / opinions up into an AI SDK for Apache Airflow [1] (one of the more popular orchestration tools that data engineers use). This feels like the right approach and in our customer base / community, it also maps perfectly to the organizations that have been most successful. The number of times I've seen companies stand up an AI team without really understanding _what problem they want to solve_...

[1] https://github.com/astronomer/airflow-ai-sdk

by jlaneve

3/31/2025 at 7:53:53 PM

"If your task can be expressed as a workflow, build a workflow". 100% true but the thing all these 'agent pattern' or 'workflow' diagrams miss is that real tasks require back-and-forth with a user, not just a Rube Goldberg machine that gets triggered in response to a _single user message_. What you need is not 'tool use' but something like 'process use'. This is what we did at Rasa, giving you a declarative way to define multi-step processes. An LLM lets you have a fluent conversation, but the execution of the task is pre-defined and deterministic: https://rasa.com/docs/learn/concepts/calm/ The fact that every framework starts with a `while` loop around an LLM and then duct-tapes on some "guardrails" betrays a lack of imagination.

by bendyBus

3/31/2025 at 6:57:31 PM

I've been working on this problem for a while. There are whole companies that do this. They all work by having a human review a sample of the results and score them (with various uses of magic to make that more efficient). And then suggest changes to make it more accurate in the future.

The best companies can get up to 90% accuracy. Most are closer to 80%.

But it's important to remember, we're expecting perfection here. But think about this: Have you ever asked someone to book a flight for you? How did it go?

At least in my experience, there's usually a few back and forth emails, and then something is always not quite right or as good as if you did it yourself, but you're ok with that because it saved you time. The one thing that makes it better is if the same person does it for you a couple of times and learned your specific habits and what you care about.

I think the biggest problem in AI accuracy is expecting the AI to be better than a human.

by jedberg

3/31/2025 at 7:14:44 PM

> I think the biggest problem in AI accuracy is expecting the AI to be better than a human.

If it's not better across at least one of {more accurate, faster, cheaper} then there is no business. You have to be offering one of the above.

And that applies both to humans and to existing tech solutions: an LLM solution must beat both in some dimension. Current flight booking interfaces are actually better than a human at all three: they're more accurate, they're free, and they're faster than trying to do the back and forth, which means the bar to clear for an agent is extremely high.

by lolinder

3/31/2025 at 8:14:56 PM

> Current flight booking interfaces are actually better than a human at all three

Only when you know exactly where to go. If you need to get to customers in 3 cities where order doesn't matter (ie the traveling salemen problem, though you are allowed to hit any city more than once) current solutions are not great. If you want to go on vacation but don't care much about where (almost every place with an airport would be an acceptable vacation, though some are better than others)

by bluGill

3/31/2025 at 7:07:55 PM

This is really cool. I agree with your point that a human would also struggle to book a flight for someone but what I take from that is conversation is not the best interface for picking flights. I am not really sure how you beat a list of available flights + filters. There are a lot of criteria: total fight time, price, number of stops, length of layover, airline, which airport if your destination is served by multiple airports. I couldn't really communicate to anyone how I weigh those and it shifts over time.

by morsecodist

3/31/2025 at 5:06:08 PM

Does anyone have AI agent use cases that that you think might happen within this year and that feels very exciting to you?

I personally struggle to find a new one (AI agent coding assistants already exist, and of course I'm excited about them, especially as they get better). I will not, any time soon, trust unsupervised AI to send emails on my behalf, make travel reservations, or perform other actions that are very costly to fix. AI as a shopping agent just isn't too exciting for me, since I do not believe I actually know what features in a speaker / laptop / car I want until I do my own research by reading what experts and users say.

by _cs2017_

3/31/2025 at 5:27:45 PM

The problem with Devin wasn't that it was a black box doing too much. It's that the outcome demo'd were fake and what was inside the box wasn't an "AI engineer."

Transparency? If it worked even unreliably, nobody would care what it does. Problem is stochastic machines aren't engineers, don't reason, are not intelligence.

I find articles attacking Ai but finding excuses in some mouse rather than pointing at the elephant, exhausting.

by hirako2000

3/31/2025 at 6:01:05 PM

The thing I most want an AI agent to do is something I can't trust to any third-party, it'd need to be local, and it's something well within LLM capabilities today. I just want a "secretary in my pocket" to take notes during conversations and produce minutes, but do so in a way that's secure and privacy-respecting (e.g. I can use it at work or at home).

by tristor

4/1/2025 at 6:51:41 AM

get a pixel/s24 with on device asr & summaries

by htrp

4/1/2025 at 11:08:14 PM

What has me slightly puzzled is why there isn’t a sharp pivot towards typed languages for vibe coding.

Would be much easier for the AI/IDE to confirm the code is likely good. Or well better than untyped. The whole rust if it compiles it probably works thing.

Instead it’s all python/JS let LLM write code and pray you don’t hit run time errors on a novel code path

I get that there is more python training data but still seems like the inferior fit for LLM assisted coding

by Havoc

3/31/2025 at 10:53:50 PM

It's natural to expect reliability from AI agents — but I don't think Cursor is a fair example. It's a developer tool deeply integrated with git, where every action can have serious consequences, as in any software development context.

Rather than blaming the agent, we should recognize that this behavior is expected. It’s not that AI is uniquely flawed — it's that we're automating a class of human communication problems that already exist.

This is less about broken tools and more about adjusting our expectations. Just like hunters had to learn how to manage gunpowder weapons after using bows, we’re now figuring out how to responsibly wield this new power.

After all, when something works exactly as intended, we already have a word for that: software.

by kuil009

3/31/2025 at 11:00:27 PM

Lol software is a field that pretty severely lacks rigor -- if software is "something that works exactly as intended", then you've had a very different experience in this industry then I have.

by bigfishrunning

3/31/2025 at 11:19:19 PM

Garbage in, garbage out. Like it or not, even someone’s trashy intentions can run exactly as designed — so I guess we’ve had the same experience.

by kuil009

4/2/2025 at 10:26:48 PM

This is my biggest complaint about AI.

Instead of creating easy-to-navigate help sections of the website, and explaining the product and everything clearly, the flashy vendors simply put everything behind an opaque model as if that's somehow better.

Then you have to guess what to type to get the most basic info about fees, terms and procedures of a service.

You want to see how the Pros are doing it? Well, they're not using any AI! Tesla, for example, still has a regular PDF and a regular section-based manual (in HTML) where you can read the details about your car.

$TSLA is priced as being the most innovative auto manufacturer, and they're clearly proficient with the AI (Autopilot/FSD), yet when it comes to user's manual, clearly they're following the same process as all the legacy automakers always have had (besides not hiding the PDF behind a parts paywall, and having an open-access HTML version of the manual, too, of course). Why? Because that actually works!

by cnst

3/31/2025 at 5:45:15 PM

Agents in the current format are unlikely to go beyond a current levels of reliability. I believe agents are a good use case in a low trust environments (outside of coding where you could see the errors quickly with testing or deployment) like inter-company communications and tasks, where there are already systems in place for checks and things going wrong. Might be a hot space in some time. For intra company, high trust environment cannot just be a workflow automation given any error would need the knowledge worker to redo the whole thing to check if its correct. We can do it via other agents - less chances of it going wrong - but more chances it screws up in the same place as previous one.

by ankit219

4/1/2025 at 9:27:22 AM

Agents introduce causality, reflection, necessity and various other sub-components never to be found in purely stochastic completion engines. This is an improvement, but it does require breaking down what each "agent" needs to do. What are the "core components" of cognition?

That's why I claim that any sufficiently complicated cognitive architecture contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Immanuel Kant's work.

by whatnow37373

3/31/2025 at 6:40:29 PM

I heard you, so we decided to now tweak the dials a bit. The dial for 'capability' we can turn back a little, no problem, but the dial for 'reliability', uhm yeah... I'm sorry, but we couldn't find that dial. Sorry.

by rambambram

3/31/2025 at 3:07:09 PM

We have been looking at Hamming distance vs time to signature for ambient note generation in medicine. Any other metrics? Lots of metrics in the ML papers, but a lot of them seem sus. They take a lot of work to reproduce or they are designed around some strategy like maxing out the easy true negatives (so you get desirable accuracy and F1 score), etc. as someone trying to build validation protocols I can get vendors to enable (need them to write certain data from memory to a DB table we can access) I’d welcome that discussion. Right now the MBAs running the hospital systems are doing whatever their ML buddies say without regard to patient or provider.

by killjoywashere

3/31/2025 at 7:39:50 PM

I think many people share the same sentiment. We don’t need agents that can kind of do many things. We need reliable programs that are really good at doing a single thing. I said as much about Manus when it came out.

https://news.ycombinator.com/item?id=43350950

There are mistakes in the Manus demo if you actually look at it. And with so many AI demos, they never want you to look too closely because the thing that was created is fairly mediocre. No one is asking for the tsunami of sludge except for VCs apparently.

by janalsncm

4/1/2025 at 8:59:03 AM

Thought-provoking piece on the reliability vs capability tradeoff in system design. I especially liked the insight about optimizing for reliability by minimizing dependencies. There's wisdom in keeping critical components simple and self-contained. Capability can be incrementally added in isolation without compromising the core system's reliability.

by dev_susan30

3/31/2025 at 8:28:04 PM

Totally agree with author here. Also, reliability is pretty hard to achieve when the underlying models are all mountains of probability that no one yet understands how they do what they exactly do and how to precisely fix a problem without affecting other parts.

Here's CNBC Business is pushing greed that these aren't AI wrappers but next best thing after fire, bread and axe[0]

[0]. https://youtu.be/mmws6Oqtq9o

by wg0

3/31/2025 at 8:32:39 PM

same can be said about digital tech/infrastructure in general!

by freeamz

3/31/2025 at 8:39:13 PM

I can't say that based on what I know about both.

by wg0

3/31/2025 at 3:03:35 PM

This is refreshing to read. I, like everyone apparently, am working on my own coding agent [1]. And I suppose it's not that capable yet. But it sure is getting more reliable. I have it only modify 1 file at a time. It generates tickets for itself to complete - but never enough tickets to really get all the work done. The tickets it does generate, however, it often can complete (at least, in simple cases haha). The file modification is done through parsing ASTs and modifying those, so the AI doesn't go off and do all kinds of things to your whole codebase.

And I'm so sick of everything trying for 100% automation and failing. There's a place for the human in the loop, in quickly identifying bugs the AI doesn't have the context for, or large-scale vision, or security or product-focused mindset, etc.

It's going to be AI and humans collaborating. The solutions that figure that out the best are going to win IMO. AI won't be doing everything and humans won't be doing it all either. The tools with the best human-AI collaboration are where it's at.

[1] https://codeplusequalsai.com

by cryptoz

3/31/2025 at 3:19:41 PM

How do you modify ASTs?

by helltone

3/31/2025 at 3:22:05 PM

I support HTML, JS, Python and CSS. For HTML, (not technically an AST), I give the LLM the original-file HTML source, and then I instruct it to write python code that uses BeautifulSoup to modify the HTML. Then I get the string back from python of the full HTML file, modified according to the user prompt.

For python changes I use ast and astor packages, for JS I use esprima/escodegen/estraverse, and for CSS I use postcss. The process is the same for each one: I give the original input souce file, and I instruct the LLM to parse the file into AST form and then write code that modifies that AST.

I blogged about it here if you want more details! https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

by cryptoz

3/31/2025 at 5:07:03 PM

I took a look at your project and while it's nice (technically), for the actual use case shown, I can't see the value over something like the old Dreamweaver with a bit of training.

I still think like prompting is still the wrong interface for programming systems. Even though they're restricted, configurations forms, visual programming with nodes, and small scripts attached to objects on a platform is way more reliable and useful.

by skydhash

3/31/2025 at 6:46:49 PM

Appreciate you having a look and for that feedback, thanks - I do agree I have work to do to prove that my idea is better than alternatives. We'll see...

by cryptoz

3/31/2025 at 5:58:02 PM

We are building this with https://lets.dev. We believe there will be great demand for less capable, but much more determinisic agents. I also recommend everyone to read "What is an agent?" by Harrison Chase. https://blog.langchain.dev/what-is-an-agent/

by andreash

4/1/2025 at 2:50:34 AM

That's some pretty big chutzpah putting up an 3l33t page and a "subscribe to our mailing list" input box to draw in potential customers

But hey, this whole thread is about the warring factions about whether anything matters anymore, so don't listen to me

by mdaniel

3/31/2025 at 10:07:17 PM

I agree up until the coding example. If someone doesn't know about version control I don't think that's any fault of the company trying to stretch the technology to its limits and let people experiment. Cursor is a really cool step in a direction, and it's weird to say we should clamp what it's doing because people might not be competent enough to fix its mistakes.

by genevra

3/31/2025 at 3:33:42 PM

You get more reliability from better capability though. More capability means being better at not misclassifying subtle tasks, which is what causes reliability issues.

by qoez

3/31/2025 at 10:03:28 PM

Models aren’t great at deciding whether an action is irreversible - and thus whether to stop to ask for input/advice/approval. Hence agentic systems usually are given a policy to follow.

Perhaps the question “is this irreversible?” should be delegated to a separate model invocation.

There could be a future in which agentic systems are a tree of model and tool invocations, maybe with a shared scratchpad.

by cadamsdotcom

3/31/2025 at 7:50:22 PM

I think the author is doing apples to oranges comparison. If you have AI acting agnatically, capability is likely positively correlated with reliability. If you don't have AI agents, it is more reliable.

AI agents are not there yet and even cursor has agent mode not selected by default. I have seen cursor agent quite a bit worse that the raw model with human selected context.

by YetAnotherNick

3/31/2025 at 4:39:16 PM

Can't wait for this being a plot point in a murder mystery, someone gamed the AI agent to create a planned "accident"

by jappwilson

3/31/2025 at 4:44:04 PM

But but...

People don't get promoted for reliability. They get promoted for new capabilities. Everyone thinks they're the next Google.

by nottorp

3/31/2025 at 4:59:37 PM

I think the best shot we have at solving this problem is an explosion of specialized agents. That will limit how off the rails each one can go at interpreting or performing some type of task. The end user still just needs to interact with one agent though, as long as it can delegate properly to subagents.

by prng2021

3/31/2025 at 6:04:44 PM

Funny note about Cursor. Commercial project, rather expensive, cannot figure out that it would be good to use, say, version control not to break somebody's work. That's why I prefer Aider (free), which is simply committing whatever it does, so any change could be reverted. Easily.

by piokoch

3/31/2025 at 7:19:39 PM

> Given the intensifying competition within AI, teams face a difficult balance: move fast and risk breaking things, or prioritize reliability and risk being left behind.

Can we please retire this dichotomy? Part of why teams do this in the first place is because there's this language of "being left behind."

We badly need to retreat to a world in which rigorous engineering is applauded and expected—not treated as a nice to have or "old world thinking."

by rglover

4/1/2025 at 4:57:06 PM

> If your task can be expressed as a workflow, build a workflow.

And miss out on the sweet, sweet VC millions? Naah.

by xg15

3/31/2025 at 3:52:42 PM

Capability demos (like Rabbit R1 vaporware) will go up as long as the market is hot and investors (like lemmings) foolishly running after those companies that are best @ hype.

by mentalgear

3/31/2025 at 5:48:08 PM

" It’s easy to blame the user's missing grasp of basic version control, but that misses the deeper point."

Uhh, no, that's pretty much the point. A developer without basic understanding of version control is like a pilot without a basic understanding of landing. A ton of problems with AI (or any other tool, including your own brain) get fixed by iterating on small commits and branching. Throw away the commit or branch if it really goes sideways. I can't fathom working on something for 4 months without realizing a problem or having any way to roll back.

That said, the one argument I could see is if Cursor (or copilot, etc) had built in to suggest "this project isn't in source control, we should probably do that before getting too far ahead of ourselves.", then help the user setup sc, repo, commit, etc. The topic _is_ tricky and I do remember not totally grasping git, branching, etc.

by shireboy

3/31/2025 at 5:52:33 PM

The nice thing is that adding this to the basic prompt that cursor uses will advance all those users and directly do away with this problem only to discover the next one. However, all these little things add up to a very powerful prompt where the LLM will make it only easier for anyone to build real stuff that on the surface looks very good

by highmastdon

3/31/2025 at 5:15:39 PM

remember 2016 chatbots anymore. sounds like the same thing all over again except this time we got hallucinations and unpredictability

by vivzkestrel

3/31/2025 at 11:46:52 PM

Are we reinventing software engineering? What happened to the "write code for error" principle?

by fullstackwife

3/31/2025 at 4:03:53 PM

Giving up accuracy for a bit of convenience—if any at all—almost never pays off. Looking at you, Alexa.

by marban

3/31/2025 at 4:39:06 PM

Image compression, eventual consistency, fuzzy search. There are many more examples I'm sure.

by danielbln

3/31/2025 at 4:56:25 PM

> Image compression, eventual consistency, fuzzy search. There are many more examples I'm sure.

Isn't all of these very deterministic? You can predict what's going to be discarded by the compression algorithm. Eventual consistency is only eventual because of the generation of events. Once that stops, you will have a consistent system and the whole thing can be replayed based on the history of events. Even with fuzzy search you can intuit how to get reliable results and ordering without even looking at the algorithms.

An LLMs based agent is the least efficient method for most of the cases they're marketing if for. Sometimes all you need is a rule-based engine. Then you can add bounded fuzziness where it's actually helpful.

by skydhash

4/1/2025 at 10:21:46 AM

i think this agents vs workflow is a false dichotomy. A workflow - at least as I understand it - is the atomic unit of an agent i.e an agent stitches workflow(s) together.

by bobosha

3/31/2025 at 10:59:35 PM

Reliability, consistency and accuracy is the next frontier that we all have to tackle it sucks. Friend of mine is building Empromptu.ai to tackle exactly this. From what she told me built a model where that let's you define accuracy based on your use case and their models optimize your whole system towards it.

by amogul

3/31/2025 at 5:11:24 PM

> choosing a small number of tasks to execute exceptionally well

And that is the Unix philosophy

by donfotto

3/31/2025 at 3:27:12 PM

Lots of people are building on the edge of current AI capabilities, where things don't quite work, because in 6 months when the AI labs release a more capable model, you will just be able to plug it in and have it work consistently.

by segh

3/31/2025 at 5:18:34 PM

> because in 6 months when the AI labs release a more capable model

How many years do we have to keep hearing this line? ChatGPT is two years old and still can't be relied on.

by cube00

3/31/2025 at 3:56:22 PM

and where is that product that was developed on the edge of current AI capabilities and now with latest AI model plugged in it's suddenly working consistently? All I am seeing is models getting better and better in generating videos of spaghetti eating movie stars.

by postexitus

3/31/2025 at 11:21:32 PM

They're coming. I've seen the observability tools try to do this but I still have to tweak it. it's just time-consuming. Empromptu.ai is the closest to solving this problem. They are the only ones that have a library that you install in your to do system optimization, evals, for accuracy in real-time.

by alltoowell

3/31/2025 at 4:32:22 PM

For me, they have come from the AI labs themselves. I have been impressed with Claude Code and OpenAI's Deep Research.

by segh

3/31/2025 at 5:22:23 PM

while i'm bullish on AI capabilities, that is not a very optimistic observation for developers building on top of it

by vslira

3/31/2025 at 3:39:01 PM

In 6 months when FSD is completed, and we get robots in every home? I suspect we keep adding features, because reliability is hard. I do not know what heuristic you would be looking to conclude that this problem will eventually be solved by current AI paradigms.

by techpineapple

3/31/2025 at 3:53:06 PM

GP comment is what has already happened "every 6 months" multiple times

by thornewolf

4/1/2025 at 6:13:04 AM

Appreciate the effort in writing this.

by techblaze3

4/1/2025 at 8:11:48 PM

Lmao, training models off what is essentially a process directly inspired by imperfect "Good enough" biological processes and expecting it to be a calculator.

Ofc I'm not defending all thy hype and I look forward to more advanced models that get it right more often.

But I do laugh at him tech people and managers who expect ml based on an analog process to be sterile and clean like a digital environs.

by fennecbutt

4/1/2025 at 5:45:51 AM

Ai can uhnderstand its output.

by revskill

4/1/2025 at 8:45:00 AM

Can you actually make the LLM more reliable tho ?

As far as I know, llm hallucinations are inherent to them and will never be completely removed. If I book a flight, i want 100,0% reliability, Not 99% ( which we are still far away today).

People got to take llm for what they are, good bullshiter, awesome to translate text or reformulate words but it's not designed to have thought or be an alternate secretary. Merely a secretary tool.

by aucisson_masque

3/31/2025 at 4:39:45 PM

We can barely make deterministic distributed services reliable. And microservices now have a bad reputation for being expensive distributed spaghetti. I'm not holding my breath for distributed AI agents to be a thing.

by daxfohl

3/31/2025 at 4:41:49 PM

want reliability? build automation instead of using non deterministic models to complete tasks

by asdev

3/31/2025 at 6:01:48 PM

Check out BAML at boundaryml.com

by anishpalakurT

3/31/2025 at 11:15:29 PM

baml isn't great. You need to write it in their format and it still doesnt really solve the accuracy problem from humans interacting with your system that we're talking about here.

by alltoowell

3/31/2025 at 3:19:42 PM

More capability, less reliability please. I want something that can achieve superhuman results 1 out of 10 times, not something that gives mediocre human results 9 out of 10 times.

All of reality is probabilistic. Expecting that to map deterministically to solving open ended complex problems is absurd. It's vectors all the way down.

by ramesh31

3/31/2025 at 5:06:52 PM

Reality is probabilistic yes but it’s not black box. We can improve our systems by understanding and addressing the flaws in our engineering. Do you want probabilistic black-box banking? Flight controls? Insurance?

”It works when it works” is fine when stakes are low and human is in the loop, like artwork for a blog post. And so in a way, I agree with you. AI doesn’t belong in intermediate computer-to-computer interactions, unless the stakes are low. What scares me is that the AI optimists are desperately looking to apply LLMs to domains and tasks where the cost of mistakes are high.

by klabb3

3/31/2025 at 3:35:08 PM

Stability is the bedrock of the evolution of stable systems. LLMs will not democratize software until an average person can get consistently decent and useful results without needing to be a senior engineer capable of a thorough audit.

by soulofmischief

3/31/2025 at 3:42:39 PM

>Stability is the bedrock of the evolution of stable systems.

So we also thought with AI in general, and spent decades toiling on rules based systems. Until interpretability was thrown out the window and we just started letting deep learning algorithms run wild with endless compute, and looked at the actual results. This will be very similar.

by ramesh31

3/31/2025 at 5:13:00 PM

This can be explained easily – there are simply some domains that were hard to model, and those are the ones where AI is outperforming humans. Natural language is the canonical example of this. Just because we focus on those domains now due to the recent advancements, doesn’t mean that AI will be better at every domain, especially the ones we understand exceptionally well. In fact, all evidence suggests that AI excels at some tasks and struggles with others. The null hypothesis should be that it continues to be the case, even as capability improves. Not all computation is the same.

by klabb3

3/31/2025 at 5:10:49 PM

Rules based systems are quite useful, not for interacting with an untrained human, but for getting things done. Deep learning can be good at exploring the edges of a problem space, but when a solution is found, we can actually get to the doing part.

by skydhash

3/31/2025 at 5:47:48 PM

Stability and probability are orthogonal concepts. You can have stable probabilistic systems. Look no further than our own universe, where everything is ultimately probabilistic and not "rules-based".

by soulofmischief

3/31/2025 at 5:15:54 PM

> Expecting that to map deterministically to solving open ended complex problems is absurd.

TCP creates an abstraction layer with more reliability than what it's built on. If you can detect failure, you can create a retry loop, assuming you can understand the rules of the environment you're operating in.

by recursive

3/31/2025 at 11:34:41 PM

>If you can detect failure, you can create a retry loop, assuming you can understand the rules of the environment you're operating in

Indeed, this is what makes autonomous agentic tool using systems robust as well. Those retry loops become ad-hoc where needed, and the agent can self correct based on error responses, compared to a defined workflow that would get stuck in said loop if it couldn't figure things out, or just error out the whole process.

by ramesh31

3/31/2025 at 3:50:00 PM

Superhuman results 1/10 are, in fact, a very strong reliability guarantee (maybe not up to today's nth 9 decimal standard that we are accustomed to, but probably much higher than any agent in real-world workflow).

by Jianghong94

3/31/2025 at 3:46:13 PM

What would be a superhuman result for booking a flight?

by deprave

3/31/2025 at 4:06:21 PM

10% of the time the seat on either side of you is empty, 90% of the time you land in the wrong country.

by mjmsmith

4/1/2025 at 5:46:10 AM

[dead]

by rcdwealth

3/31/2025 at 5:29:52 PM

[dead]

by mosura

3/31/2025 at 3:01:41 PM

[flagged]

by NAHWheatCracker