alt.hn

2/25/2026 at 8:11:37 PM

PA bench: Evaluating web agents on real world personal assistant workflows

https://vibrantlabs.com/blog/pa-bench

by shahules

2/26/2026 at 7:41:11 AM

I just don't get why would you would want an agent to use the browser to do these mundane things (check email, work with calendar etc), when you can simply give it a few tools, and save maybe six gazillion tokens per task?

by mrorigo

2/26/2026 at 11:42:47 AM

Using existing enterprise apps probably - this solution is scalable for the vendor and it's easier to sell using existing software as-is than to start out by writing new custom tools.

by shenberg

2/26/2026 at 4:48:32 PM

After doing few experiments, I think that having Agents work on browser for all tasks wouldn't be best due to many factors like token cost, safety, etc. But browser/computer can be a tool that the agent can be alongside MCPs to complete tasks that requires interaction with such modalities.

by shahules

2/25/2026 at 10:04:56 PM

Is there a possible way computer use can be automated using multiple computer use agents from different providers, but also with some sort of routing setup so the best course of action can be chosen without hitting failures (for eg: permission issues in OpenAI could be rerouted to Gemini)

by abhijithneil

2/25/2026 at 11:05:53 PM

There are few agents like browser-use, skyvern etc that may provide this capability.

by shahules

2/26/2026 at 8:32:35 AM

Well if these guys computer action model works as they intended (ground up video trained model)

https://news.ycombinator.com/item?id=47125014

maybe this benchmark will be conquered far faster then expected

by AIorNot

2/26/2026 at 4:49:13 PM

Nice, their training recipe seems unique.

by shahules

2/25/2026 at 9:43:02 PM

[dead]

by shahules

2/26/2026 at 12:05:54 PM

[flagged]

by MidasTools