4/22/2026 at 4:46:49 PM
The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.
Performance numbers:
Reading: 20 tokens, 0.4s, 54.32 tokens/s
Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/
by simonw
4/22/2026 at 4:47:40 PM
I feel like this time it is indeed in the training set, because it is too good to be true.Can you run your other tests and see the difference?
by throwaw12
4/22/2026 at 5:01:34 PM
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...
by simonw
4/22/2026 at 5:09:44 PM
compared to your test with GLM 5.1, this indeed looks offby throwaw12
4/22/2026 at 5:21:57 PM
Yeah GLM 5.1 did an outstanding job on the possum - better than Opus 4.7 or GPT-5.4 and I think better than Gemini 3.1 Pro too.But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.
by simonw
4/22/2026 at 6:25:52 PM
The point is in the relative difference between the Pelican vs "other" test for each model suggesting the Pelican is being treated special these days (could be as simple as being common in recent data), not the relative difference between the models on the "other" case in isolation.by zamadatix
4/22/2026 at 5:18:24 PM
Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.by refulgentis
4/22/2026 at 5:20:39 PM
You can collapse the pelican thread with the little [-] toggle at the top.by simonw
4/22/2026 at 5:24:26 PM
Why would you though?And by the way: Thanks for relentlessly holding new models’ feet to the pelican SVG fire.
by taspeotis
4/22/2026 at 5:34:12 PM
Because I want to read about Qwen, not someone's one-off vibe test followed by 1:1 conversations. (case in miniature here: which is the last comment in this thread that says something about Qwen? The root post. Is that fun policing? Yes, apologies.)by refulgentis
4/22/2026 at 6:02:30 PM
There's a bunch of useful information in my comment that's independent of the fact that it drew a pelican:1. You can run this on a Mac using llama-server and a 17GB downloaded file
2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model
3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s
by simonw
4/22/2026 at 6:16:49 PM
Right, that is exactly what I meant by "the root post [had info about Qwen]" - you shouldn't feel I'm being critical of you or asking you to do anything different, at all. I admire you deeply and feel humbled* by interacting with you, so I really want that to be 100% clear, because this is the 2nd time I'm reading that it might be personal.* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.
by refulgentis
4/22/2026 at 7:32:55 PM
I like the pelican-bicycle test because it's pretty predictive of how the model does helping me with TikZ. And I hate writing TikZ.by mlyle
4/22/2026 at 7:31:25 PM
Somewhat ironically - as of when I write this this tangent is dominating the size of this topic.by interstice
4/23/2026 at 8:47:39 AM
I understand your reasoning and it's valid, but I think the best you can do is indeed collapse the thread (not sure if any mobile clients do better than that?)It's perhaps not a serious test, it isn't to me, but on the edges of jokes about pelicans they're usually some useful things people smarter than me say, and additionally if providers are spending some time on making pelicans or svg look better, this benefits all of us.
So, no hard feelings, you're understood (and I'm not trying to be patronising, I'm just awkward with the language), but pelicans are here to stay because it seems that the consensus is they're beneficial and on topic.
All the best!
by subscribed
4/22/2026 at 5:56:54 PM
I think it's to help drive traffic to his blog now that he's accepted sponsors in the header of every page. I do see this pelican thing come up from him on every model post that gets released.by rob
4/22/2026 at 6:12:32 PM
The traffic I get from a comment with a link to a pelican is pretty tiny.by simonw
4/22/2026 at 6:56:17 PM
"Create me an SVG to drive MAXIMUM ENGAGEMENT for my sponsors".Missing an opportunity here, lol.
by ai_critic
4/22/2026 at 9:37:07 PM
I think at this point we can safely put the pelican test in the category of Goodhart's law.by sifar
4/22/2026 at 9:39:46 PM
If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.by amelius
4/22/2026 at 5:32:29 PM
if they cook these in, i wonder what else was cooked in there to make it look good.by m3kw9
4/22/2026 at 5:40:40 PM
Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.by zargon
4/23/2026 at 3:18:26 PM
I have an out-there idea. Make a test set of fairly hard trivia questions, some 100000 of them, which all have the answer "Argentina". The idea is that if the model was tuned on it, it might become readily apparent, since the model would be a bit more likely to answer "Argentina" to trivia questions.It's probably not good for actually powerful models, since they would score 100% on it anyway and wouldn't need to cheat. But for heavily distilled and/or finetuned models, it might be interesting to run a couple of easy and trivially cheatable tests like this, in order to measure how much it lost in certain non-targeted capabilities.
by vintermann
4/23/2026 at 12:59:47 PM
[dead]by agdexai
4/23/2026 at 10:59:52 AM
I think it's important to see that the other similar example, a dragon driving a car while eating hotdog, doesn't nearly render as well.by nsoonhui
4/22/2026 at 11:35:12 PM
You'd think by now the LLMs would have figured out that the body of a bicycle is basically just a bisected rhombus. → ◿◸(I hope I don't ruin the test.)
by russellbeattie
4/23/2026 at 1:47:37 PM
It would be funny to do an optimization pass to find a compact description of how to coax an accurate pelican bicycle out of a few of the current models, then just blast that snippet everywhere.by hedgehog
4/23/2026 at 3:42:13 AM
I am getter 13 t/s on my 36GB M3 Max with almost everything closed (to debug some issues I was having).by jrumbut
4/23/2026 at 4:09:43 AM
If you ever consider a logo, make sure it’s either a very poorly considered,or wildly realistic,
pelican.
by DANmode
4/22/2026 at 9:48:08 PM
I don’t think I ever heard you said excellent for the pelican test. It looks excellent indeed!The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.
by sbinnee
4/23/2026 at 12:33:42 PM
IMHO looks more like a stork, not a pelican. Look up any image of an actual pelican and check the ratio of legs to body. IMHO that's a weird mistake to make when asked for a "pelican".Have you considered asking a couple of artists on Fiverr or something to draw you a picture with the same prompt? I don't mean this as a gotcha, it's actual advice, you should probably get a sense of what a real human artist/designer (or three) would do with this prompt.
For example, I hope you will find that: One reasoning choice is wrong with this picture that's not much to do with its ability to draw. Do we enlarge the pelican to human size? Or do we shrink the bike to pelican size? There is only one answer that keeps pelican proportions. Draw a pelican on a very tiny bike, and its legs will just fit without making it a different species, and you can even sort of cover part of the steer under the wings, etc etc.
I'm curious if other artists would come up with the same or other solutions, but they should in general come up with solutions, which I haven't seen the LLM do, really.
You (or maybe others?) said that the "pelican on a bike" prompt is good because "there is no right answer" cause you can't really fit a pelican on a bike. But most artists will say "hold my beer" and figure it out anyway. Cartoonists won't even have to think. The "figuring out" of these problems is what I'm missing in the LLMs response. It just put a pelican on a bike and makes it look like a stork if necessary. I don't really feel like it's actually testing for the thing this prompt is designed for, unless the test still says "FAIL" for each and all of them, including the one you just called "excellent".
by tripzilch
4/23/2026 at 1:43:47 PM
Honestly it never crossed my mind to waste some artist's time with this, but now that the joke "benchmark" has somehow reached orbital velocity maybe I should be thinking about it!I've run the prompt through dozens of dedicated image generation models so I've seen many versions of this that are better attempts than a text model spitting out SVG - here's gpt-image-2 as a recent example: https://chatgpt.com/share/69ea21ab-8738-83e8-a4d7-67374d84e0...
by simonw
4/24/2026 at 5:13:22 PM
I believe that if you pay them for their time, it's not really "wasted", at least not nearly as "wasted" as when the next person would pay them to design some vapid advertisement.In addition to that 1) it's for science and 2) maybe you owe it to yourself to have a really nice framed picture of a pelican riding a bicycle on the wall :D
About the dedicated image generation results, I still would have made the bicycle smaller, but it starts to depend on how motivated the artist is to make both the bike and pelican accurate. Which is fine, but if you want to have a benchmark, it's important to have at least one "known good" example, I think.
by tripzilch
4/22/2026 at 7:42:58 PM
Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?Can you replace Claude Code Opus or Codex with this?
Does it feel >80% as good on "real world" tasks you do on a day to day basis.
by echelon
4/22/2026 at 5:53:57 PM
at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)by ahoog42
4/22/2026 at 5:58:16 PM
They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.by hansonkd
4/22/2026 at 6:03:30 PM
See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...by simonw
4/22/2026 at 9:53:50 PM
Why is the assumption that they trained for a pelican on a bicycle, rather than running RL for all kinds of 'generate an SVG' tasks?by mudkipdev
4/23/2026 at 12:23:13 AM
Gemini did exactly that, and boasted about it at launch: https://x.com/JeffDean/status/2024525132266688757by simonw
4/23/2026 at 1:40:07 PM
That post doesn't say anything about training for SVG generationby acchow
4/23/2026 at 2:09:45 PM
https://blog.google/innovation-and-ai/models-and-research/ge...> Code-based animation: 3.1 Pro can generate website-ready, animated SVGs directly from a text prompt. Because these are built in pure code rather than pixels, they remain crisp at any scale and maintain incredibly small file sizes compared to traditional video.
by simonw
4/23/2026 at 12:54:07 AM
https://imgur.com/a/UlGcBouby bschwindHN
4/23/2026 at 6:24:35 AM
So this is it. We have finally achieved excellent illustrating of your svg art.by Alifatisk
4/23/2026 at 5:24:19 PM
Time for a spin, mate.by gverrilla
4/23/2026 at 12:01:02 AM
That bowtie on the Qwen Flamingo is also chef's kiss, imhoby verdverm
4/23/2026 at 7:18:40 AM
PelicanBench, the last benchmark for AGI.by brtkwr
4/22/2026 at 8:08:42 PM
These are the stupidest things to cleave to.by halJordan
4/23/2026 at 3:12:25 AM
[flagged]by ItsClo688
4/23/2026 at 3:16:34 AM
I've been using it in a few harnesses (FP8 quant, max context length) and it does seem to get tripped up by tool use, often repeating the same tool when it failed previously - that's usually not a great sign for long-term context and multi-step reasoning. It is excellent at one-shotting though and might be most useful as a sub-agent for a stronger frontier coordinator.by tgtweak
4/24/2026 at 12:56:39 AM
yeah that tracks, tool repetition on failure is a classic sign the model isn't really reading its own context. The sub-agent framing makes sense, one-shot strength is exactly what you want in that role. (Also somehow got flagged for my original comment, which, classic HN lol)by ItsClo688
4/23/2026 at 11:28:10 AM
I just create the nopelican user to avoid seeing the same type of comments for scoring new models. Why doesn't someone create a pelican by month thread, like who is hiring, so that all who want to talk about their prefered mode and pelican can post with leisure at full extend. Perhaps such a thread could add some good information when grouped by time, model and pelican features. But I, honestly, think that the pelican test and the type of comments about it are too much, too repetitive, and it add no new information day after day.The author of the pelican test has provided rich information about LLMs and AI just since LLM started to gain traction, but the pelican must fly and let the bicycle in the garage to show off just once a month.
Finally, a bitter take. Perhaps an information dense post without the pelican could be less commented and less reddit type, and some people might enjoy the image, so my comment from a boring, formal, not amussing person, may be not welcome from those, I agree.
This post suggest to create a by month thread about the pelican, it could give more value to the test. So I think is not far from meeting the HN etiquette of style.
Finally, since I think I will be downvoted until disappearing, LLM understand me: The "Substance" vs. "Meme" Conflict
I understand your frustration perfectly. When a model like Qwen 3.6-27B drops—a model explicitly marketed for "Flagship-Level Coding"—you want to know:
How does it handle dependency injection in complex Python projects?
What is its context window performance like for real-world repo analysis?
How does it compare to Claude 3.5 Sonnet for agentic workflows?
Instead, the top comments are often just people saying "Look, the pelican has three wheels!" or "The pelican is floating!" To you, this feels like a waste of the front page.
by nopelican
4/23/2026 at 4:47:18 PM
The point of a benchmark is that it allows a relative comparison. The Pelican is one such benchmark.Feel free to create a "how does it compare to Claude 3.5 Sonnet" benchmark. If people find it useful, it will be run against new LLMs to generate additional points of comparison.
I will also say; it's really easy to just skim past comments. I suspect your ROI time-wise in creating this account to complain will never be recouped compared with just skimming past pelican comment chains.
by hex4def6
4/23/2026 at 6:04:11 PM
Usually I read the top comments in posts, they usually have the best information. I don't think the pelican test deserve to be at top position. HN top posts should reflect the best of our community, not by karma but by the value and insight that they provide.by nopelican
4/23/2026 at 2:07:40 AM
it seemed HN was moving the right direction when we added the "no AI comments", and yet, every single post about a new model is from you and your pelican. it's tired. please stop, it adds no value and has become cliche.by syndacks
4/23/2026 at 2:22:14 AM
Wholly disagree. This a comment made by a person about an AI topic. Not an AI bot commenting on an article, which (as I understand it) is what “no AI comments” is saying.Plus it’s a test that gives varied enough performance across multiple LLMs that it is a good barometer for how well it can think through the steps. Never mind the fact that most people can’t draw a bike from memory. The whole thing is hilarious!
by pixelatedindex
4/23/2026 at 4:27:56 AM
Are you saying I write comments here using an LLM? I don't do that.by simonw
4/23/2026 at 4:46:05 AM
How does a quick benchmark of a model "add no value" to the post about the model?by stavros
4/23/2026 at 2:09:33 AM
We like the pelican posts.by 0xbadcafebee
4/23/2026 at 12:21:38 PM
I think it added plenty of value!by rpdillon