1/17/2025 at 12:27:51 PM
I'm one of the creators of OpenHands (fka OpenDevin). I agree with most of what's been said here, wrt to software agents in general.We are not even close to the point where AI can "replace" a software engineer. Their code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp. I've talked to companies who went all in on AI engineers, only to realize two months later that their codebase was rotting because no one was reviewing the changes.
But once you develop some intuition for how to use them, software agents can be a _massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself. I especially love asking it to do simple, tedious things like fixing merge conflicts or failing linters. It's great at getting an existing PR over the line.
It's also important to keep in mind that these agents are literally improving on a _weekly_ basis. A few weeks ago we were at the top of the SWE-bench leaderboard; now there are half a dozen agents that have pulled ahead of us. And we're one launch away from leapfrogging back to the top. Exciting times!
by rbren
1/17/2025 at 1:09:22 PM
> code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp> ..._massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself.
I'm having trouble reconciling these statements. Where does the productivity boost come from since that reviewing burden seems much greater than you'd have if you knew commits were coming from a competent human?
by jebarker
1/17/2025 at 1:39:14 PM
There's often a lot of small fixes that not time efficient to do, but a solution is not much code and is quick to verify.If the cost is small to setting a coding agent (e.g. aider) on a task, seeing if it reaches a quick solution, and just aborting if it spins out, you can solve a subset of these types of issues very quickly, instead of leaving them in issue tracking to grow stale. That lets you up the polish on your work.
That's still quite a different story to having it do the core, most important part of your work. That feels a little further away. One of the challenges is the scout rule, the refactoring alongside change that makes the codebase nicer. I feel like today it's easier to get a correct change that slightly degrades codebase quality, than one that maintains it.
by lars512
1/17/2025 at 1:49:22 PM
Thanks - this all makes sense - I still don't feel like this would constitute a massive productivity boost in most cases, since it's not fixing time consuming major issues. But I can see how it's nice to have.by jebarker
1/17/2025 at 2:04:59 PM
The bigger win comes not from saving keystrokes, but from saving you from a context switch.Merge conflicts are probably the biggest one for me. I put up a PR and move onto a new task. Someone approves, but now there are conflicts. I could switch off my task, spend 5-10 min remembering the intent of this PR and fixing the issues. Or I could just say "@openhands fix the merge conflicts" and move back to my new task.
by rbren
1/17/2025 at 9:08:37 PM
The issue is that you still need to review the fixed PR (or someone else does) which means you just deferred the context switch, you didn't eliminate it. And if the fix is in a new commit, that's possible (whereas if it rebases you have to remember your old SHA).Playing the other side, pipelining is real.
by svieira
1/17/2025 at 2:15:50 PM
I haven't started doing this with agents, but with autocomplete models I know exactly what OP is talking about: you stop trying to use models for things that models are bad at. A lot of people complain that Copilot is more harm than good, but after a couple of months of using it I figured out when to bother and when not to bother and it's been a huge help since then.I imagine the same thing applies to agents. You can waste a lot of time by giving them tasks that are beyond them and then having to review complicated work that is more likely to be wrong than right. But once you develop an intuition for what they can and cannot do you can act appropriately.
by lolinder
1/17/2025 at 1:48:21 PM
I suspect that many engineers do not expend significant energy on reviewing code; especially if the change is lengthy.by drewbug01
1/17/2025 at 1:32:11 PM
>burden seems much greater than...Because the burden is much lower than if you were authoring the same commit yourself without any automation?
by linsomniac
1/17/2025 at 1:37:05 PM
Is that true? I'd like to think my commits are less burdensome to review than a fresh out of boot camp junior dev especially if all that's being done is fixing linter issues. Perhaps there's a small benefit, but doesn't seem like a major productivity boost.by jebarker
1/17/2025 at 1:47:02 PM
A junior dev is not a good approximation of the strengths and weaknesses of these models.by ErikBjare
1/17/2025 at 2:08:31 PM
Agreed! The comparison is great for estimating the scope of the tasks they're capable of--they do very well with bite-sized tasks that can be individually verified. But their world knowledge is that of a principal engineer!I think this is why people struggle so much with agents--they see the agent perform magic, then assume it can be trusted with a larger task, where it completely falls down.
by rbren
1/17/2025 at 2:00:37 PM
The post I originally commented on literally made that comparison when describing the models as a massive productivity boost.by jebarker
1/17/2025 at 4:38:15 PM
We've seen exponential improvements in LLM's coding abilities. Went from almost useless to somewhat useful in like two years.Claude 3.5 is not bad really. I wanted to do a side project that has been on my mind for a few years, and Claude coded it in like 30 seconds.
So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.
by bufferoverflow
1/17/2025 at 4:59:31 PM
> So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.These sorts of things can’t be extrapolated. It could be 6-months, it could be a local maxima / dead end that’ll take another breakthrough in 10 years like transformers were. See self-driving cars.
by Zanfa
1/17/2025 at 4:22:16 PM
What does the cost look like for running OpenHands yourself? From your docs, it looks like you recommend Sonnet @ $3 / million tokens. But I could imagine this can add up quickly if you are sending large portions of the repository at a time as context.by veggieroll