MiniMax M2.5 released: 80.2% in SWE-bench Verified

2/12/2026 at 6:25:05 PM

I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the strong tendency to reward hacking, often write nonsensical test report while the tests actually failed. And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.

Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]

[1] https://artificialanalysis.ai/models/minimax-m2-1

by sinuhe69

2/12/2026 at 6:35:22 PM

That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.

by osti

2/12/2026 at 7:03:03 PM

I have asked GLM4.7 in opencode to make an application to basically filter a couple of spatial datasets hosted at a url I provided it, and instead of trying to download read the dataset, it just read the url, assumed what the datasets were (and got it wrong) is and it's shape (and got it wrong) and the fields (and got it wrong) and just built an application based on vibes that was completely unfixable.

It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.

This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.

I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own

by qinsignificance

2/12/2026 at 10:43:20 PM

this aligns perfecly with my experience, but of course, the discourse on X and other forums are filled with people who are not hands on. Marketing is first out of the gate. These models are not yet good enough to be put through a long coding session. They are getting better though! GLM 4.7 and Kimi 2.5 are alright.

by hsaliak

2/12/2026 at 7:29:27 PM

It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.

Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.

by esafak

2/12/2026 at 6:55:28 PM

Sounds exactly what a junior-dev would do without proper guidance. Could better direction in the prompts help? I find I frequently have to tell it where to put what fixes. IME they make a lot of spaghetti (LLMs and juniors)

by edoceo

2/12/2026 at 6:58:09 PM

Maybe the Juniors you have seen are actually retarded?

by heliumtera

2/12/2026 at 9:10:11 PM

wtf kinda juniors are you interacting with

by throawayonthe

2/12/2026 at 9:59:27 PM

Lots of self-taught; looking for an entry level.

by edoceo

2/13/2026 at 3:54:21 AM

I'm self-taught and I've always understood that adjusting tests to cheat is a fail.

by alsetmusic

2/12/2026 at 11:09:59 PM

> And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.

I haven’t tried MiniMax, but GPT-5.2-Codex has this problem. Yesterday I watched it observe a Python type error (variable declared with explicit incorrect type — fix was trivial), and it added a cast. (“cast” is Python speak for “override typing for this expression”.) I told it to fix it for real and not use cast. So it started sprinkling Any around the program (“Any” is awful Python speak for “don’t even try to understand this value and don’t warn either”).

by amluto

2/13/2026 at 5:53:53 AM

Even Claude opus 4.6 is pretty willing to start tearing apart my tests or special-case test values if it doesn't find a solution quickly (and in c++/rust land a good proportion of its "patience" seems to be taken up just getting things that compile)

by kimixa

2/13/2026 at 9:26:09 PM

I’ve found that GPT-5.2 is shockingly good at producing code that compiles, despite also being shockingly good at not even trying to compiling it and instead asking me whether I want it to compile the code.

by amluto

2/13/2026 at 5:15:23 PM

Or it uses type ignore comments

by whattheheckheck

2/12/2026 at 7:10:00 PM

MiniMax 2.1 didn't really work for my data-parsing tasks, a lot of errors.

Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash

by XCSme

2/12/2026 at 6:59:27 PM

Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826...

by simonw

2/13/2026 at 6:33:07 AM

Hmm, I am not sure the missing front fork is worse than the unsteerable front wheel mountings (which look like rear wheel mountings) most models so far have produced. It might be better... sort of an admission of an unsolved problem in design of the bike rather than producing something that looks approximately correct but can't possibly work. Like a "TODO" comment in code.

Also the position of the pelican on the bike would be somewhat awkward, but fits anatomically with a pelican's relatively short legs. In fact I can remember riding (or trying to ride) an adult bike as a young child using a similar position.

by jbotz

2/12/2026 at 7:26:45 PM

You should switch to an octopus riding a bike, much harder.

by UltraSane

2/12/2026 at 9:02:01 PM

Not an SVG, but I'm pretty impressed by what Gemini 3.0 Fast does: https://gemini.google.com/share/52c1229bd1d9

/imagine an svg of an octopus riding a bike. 1 arm shading its eyes from the sun, another waving a cute white flag, 2 driving the bike, 2 peddling the wheels, and 2 drifting behind in the wind

by onlyrealcuzzo

2/13/2026 at 5:50:40 AM

Maybe in the future models render a bitmap and trace the vector image with a tool, like a human would do.

by miohtama

2/12/2026 at 11:12:05 PM

I think part of the point is that the SVG is the hard part. Gemini is quite good at generating images, but it’s trained to generate raster images.

by amluto

2/12/2026 at 8:00:14 PM

also much less in training data by now

by mentalgear

2/12/2026 at 5:15:49 PM

Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kimi K2.5 when deep English analysis is needed.

Not self-hosting yet, but I prefer using Chinese OSS models for AI workflows because of the potential to self-host in future if needed. Also using it to power my openclaw assistant since IMO it has the best balance of speed, quality and cost:

> It costs just $1 to run the model continuously for an hour at 100 tokens/sec. At 50 tokens/sec, the cost drops to $0.30.

by mythz

2/12/2026 at 6:47:42 PM

> MiniMax first in my AI workflows, GLM for code tasks and Kimi K2.5

Its good to have these models to keep the frontier labs honest! Can i ask if you use the API or a monthly plan? Do the monthly plan throttle/reset ?

edit: i agree that MM2.1 most economic, and K2.5 generally the strongest

by algo_trader

2/12/2026 at 7:01:15 PM

Using a coding plan, haven't noticed any throttling and very happy with the performance. They publish the quotas for each of their plans on their website [1]:

- $10/mo: 100 prompts / 5 hours

- $20/mo: 300 prompts / 5 hours

- $50/mo: 1000 prompts / 5 hours

[1] https://platform.minimax.io/docs/guides/pricing-coding-plan

by mythz

2/12/2026 at 7:23:39 PM

They count one prompt as 15 requests. That gives you exactly 1500 API requests for 5 hours. Tokens are not counted.

by miroljub

2/12/2026 at 5:35:51 PM

!!!!!! Incredibly cheap!!!!!

I'll have to look for it in OpenRouter.

by user2722

2/12/2026 at 5:49:26 PM

For the moment is free in Opencode, if you want ot try it.

by amunozo

2/12/2026 at 5:58:55 PM

Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective properties of the model and my past experiences with models from the same lab.

For instance,

I'm inclined to generally believe Kimi K2.5's benchmarks, because I've found that their models tend to be extremely good qualitatively and feel actually well-rounded and intelligent instead of brittle and bench-maxed.

I'm inclined to give GLM 5 some benefit of the doubt, because while I think their past benchmarks have overstated their models' capabilities, I've also found their models relatively competent, and they 2X'd the size of their models, as well as introduced a new architecture and raised the number of active parameters, which makes me feel like there is a possibility they could actually meet the benchmarks they are claiming.

Meanwhile, I've never found MiniMax remotely competent. It's always been extremely brittle, tended to screw up edits and misformat even simple JavaScript code, get into error loops, and quickly get context rot. And it's also simply just too small, in my opinion, to see the kind of performance they are claiming.

by logicprog

2/12/2026 at 8:58:39 PM

M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking

by jbellis

2/12/2026 at 5:50:10 PM

Wish my company allowed more of these LLMs through Github Copilot, stuck with OpenAi, Anthropic and Google LLMs where they burn my credit one week into the month

by 3adawi

2/12/2026 at 7:00:35 PM

Wouldn't it be nice if we have language specific llms that work on average computers.

Like LLM that only trained on Python 3+, certain frameworks, certain code repos. Then you can use a different model for searching the internet to implement different things to cut down on costs.

Maybe I have no idea what I'm talking about lol

by thedangler

2/12/2026 at 7:02:32 PM

I imagine some sort of distill like this would be possible, but I think multi-language training really helps the LLM.

by theLiminator

2/12/2026 at 11:19:35 PM

Not a serious test, but I tried M2.5 briefly in OpenCode on a very simple task equivalent to the last commit or two here[0] and it was really, really bad. This is a 250 line self-contained standalone script and what it does is very simple. M2.5 would have required far more detailed prompting to get me the result Opus 4.6 can do with the vaguest hints.

[0]: https://github.com/oxidecomputer/console/pull/3070/commits

by dcre

2/12/2026 at 10:45:08 PM

It's interesting that we do not have a wave of tier-2 companies with NNN Million dollar cap releasing anything competitive. It's the big 4 labs vs the chinese labs. No Tier-2.

by hsaliak

2/12/2026 at 10:49:29 PM

There’s mistral

by vessenes

2/12/2026 at 10:50:57 PM

I've not had good luck with devstral at all..I am really rooting for them though!

by hsaliak

2/13/2026 at 4:03:19 AM

It's been a long time since they were good. But Europe definitely needs a homegrown frontier model company, one way or the other. I consider them Tier 2 right now.

by vessenes

2/12/2026 at 8:10:34 PM

This is cool, but they mentioned affordability, and said this is about $1/hour to run, which is about what I pay for claude code on $200/mo plan. This is not literally true, sometimes I'm running up to 3 concurrent intermittently throughout the day for maybe 60 hours per week.

So I do believe if there is something that comes up that is literally continuous, would be interesting, but I'm not sure about it right now. I would be curious if anyone has anything they would literally use running 24/7.

by mchusma

2/12/2026 at 5:56:25 PM

Btw, the model is free on OpenCode for now

by denysvitali

2/12/2026 at 7:04:07 PM

A reasonably sized OSS model that's this good at coding is a HUGE step forward.

We've done some vibe checks on it with OpenHands and it indeed performs roughly as good as Sonnet 4.5.

OSS models are catching up

by rbren

2/12/2026 at 6:17:25 PM

> M2.5-Lightning [...] costs $0.3 per million input tokens and $2.4 per million output tokens. M2.5 [...] costs half that. Both model versions support caching. Based on output price, the cost of M2.5 is one-tenth to one-twentieth that of Opus, Gemini 3 Pro, and GPT-5.

Huge - if not groundbreaking - if the benchmark stats are true.

by OsrsNeedsf2P

2/12/2026 at 8:58:58 PM

yes it's good. But you should also look at GLM 5 and Kimi K2.5 when looking at M2.5. It's amazing we have so many good and cheap open weight models now which are really not far behind the top models from the big US AI companies.

Anthropic Claude Code and OpenAI Codex plans are subsidised.

The Chinese open weight models hosted in US or Europe make more sense to use when you want to stay model agnostic and less dependent on a single AI company with relative expensive APIs.

by therealmarv

2/12/2026 at 9:12:10 PM

Cost per token doesn't really matter anymore, cost per task it more important.

by lm28469

2/12/2026 at 8:39:52 PM

Everyone is using this sort of let me group the plots weirdly instead of sorting them to make harder to compare. I see you folks

by motbus3

2/12/2026 at 8:08:38 PM

I wonder if these are starting to get reasonable enough to use locally?

by aliljet

2/12/2026 at 5:38:26 PM

And it's available on their coding plans, even the cheapest one.

by jhack

2/12/2026 at 5:45:22 PM

With the GLM news yesterday and now this, I'd love to try out one of these models, but I'm pretty tied to my Claude Code workflow. I see there's a workaround for GLM, but how are people utilizing MiniMax, especially for coding?

by turnsout

2/12/2026 at 5:50:48 PM

you can use Claude Code with these models. You just need to pass the right env vars. Have a look at the client setup guide on z.ai

by hasperdi

2/12/2026 at 6:00:37 PM

Interesting—thanks!

by turnsout

2/12/2026 at 5:50:01 PM

I use Opencode, when the model is free for the moment. I have not used Claude Code so I cannot compare.

by amunozo

2/12/2026 at 5:47:27 PM

anything with an open ai compatible endpoint can have claude code router put in front of it afaik https://github.com/musistudio/claude-code-router

by claythearc

2/12/2026 at 9:01:58 PM

$1/hr sounds suspiciously close to a price of one A100 80GB GPU.

Maybe an 8x node assuming batching >= 8 users per node.

by tgrowazay

2/13/2026 at 9:20:43 AM

[dead]

by liyuan851277048

2/13/2026 at 9:18:19 AM

[dead]

by liyuan851277048