2/12/2026 at 6:25:05 PM
I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the strong tendency to reward hacking, often write nonsensical test report while the tests actually failed. And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]
by sinuhe69
2/12/2026 at 6:35:22 PM
That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.by osti
2/12/2026 at 7:03:03 PM
I have asked GLM4.7 in opencode to make an application to basically filter a couple of spatial datasets hosted at a url I provided it, and instead of trying to download read the dataset, it just read the url, assumed what the datasets were (and got it wrong) is and it's shape (and got it wrong) and the fields (and got it wrong) and just built an application based on vibes that was completely unfixable.It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.
This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.
I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own
by qinsignificance
2/12/2026 at 10:43:20 PM
this aligns perfecly with my experience, but of course, the discourse on X and other forums are filled with people who are not hands on. Marketing is first out of the gate. These models are not yet good enough to be put through a long coding session. They are getting better though! GLM 4.7 and Kimi 2.5 are alright.by hsaliak
2/12/2026 at 7:29:27 PM
It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.
by esafak
2/12/2026 at 6:55:28 PM
Sounds exactly what a junior-dev would do without proper guidance. Could better direction in the prompts help? I find I frequently have to tell it where to put what fixes. IME they make a lot of spaghetti (LLMs and juniors)by edoceo
2/12/2026 at 6:58:09 PM
Maybe the Juniors you have seen are actually retarded?by heliumtera
2/12/2026 at 9:10:11 PM
wtf kinda juniors are you interacting withby throawayonthe
2/12/2026 at 9:59:27 PM
Lots of self-taught; looking for an entry level.by edoceo
2/13/2026 at 3:54:21 AM
I'm self-taught and I've always understood that adjusting tests to cheat is a fail.by alsetmusic
2/12/2026 at 11:09:59 PM
> And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.I haven’t tried MiniMax, but GPT-5.2-Codex has this problem. Yesterday I watched it observe a Python type error (variable declared with explicit incorrect type — fix was trivial), and it added a cast. (“cast” is Python speak for “override typing for this expression”.) I told it to fix it for real and not use cast. So it started sprinkling Any around the program (“Any” is awful Python speak for “don’t even try to understand this value and don’t warn either”).
by amluto
2/13/2026 at 5:53:53 AM
Even Claude opus 4.6 is pretty willing to start tearing apart my tests or special-case test values if it doesn't find a solution quickly (and in c++/rust land a good proportion of its "patience" seems to be taken up just getting things that compile)by kimixa
2/13/2026 at 9:26:09 PM
I’ve found that GPT-5.2 is shockingly good at producing code that compiles, despite also being shockingly good at not even trying to compiling it and instead asking me whether I want it to compile the code.by amluto
2/13/2026 at 5:15:23 PM
Or it uses type ignore commentsby whattheheckheck
2/12/2026 at 7:10:00 PM
MiniMax 2.1 didn't really work for my data-parsing tasks, a lot of errors.Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash
by XCSme