5/20/2025 at 6:22:12 PM
After doing some testing, Imagen 4 doesn't score any higher than Imagen 3 on my comparison chart, approximately ~60% prompt adherence accuracy.by vunderba
5/20/2025 at 10:56:19 PM
I'm curious why you decide to declare victory after one successful attempt, but try many times for unsuccessful models. Are you trying to measure whether a model _can_ get it right, or whether it frequently _does_ get it right? I feel like success rate is a better metric here, or at least a fixed number of trials with some success rate threshold to determine model success.by bigmadshoe
5/21/2025 at 3:21:15 AM
It's hard to nail down a good objective metric on something that is always going to be marginally qualitative in nature but it's a good call out - I should probably add a FAQ to the site.To clarify this test is purely a PASS/FAIL - unsuccessful means that the model NEVER managed to generate an image adhering to the prompt. So as an example, Midjourney 7 did not manage to generate the correct vertical stack of translucent cubes ordered by color in 64 gen attempts.
It's a little beyond the scope of my site but I do like the idea of maintaining a more granular metric for the models that were successful to see how often they were successful.
by vunderba
5/21/2025 at 3:32:21 AM
Makes sense. It just set off some statistical alarm bells in my head to see a model marked as passing with 1 trial, and some models marked as failing with 5. What if the probability of success is 5% for both models? How confident are we that our grading of the models is correct? It's an interesting problem.Cool site btw! Thanks for sharing.
by bigmadshoe
5/21/2025 at 6:21:11 AM
The current metric is actually quite strong -- it mirrors the real-world use case of people trying a few times and being satisfied if any of them's what they're looking for. It rewards diversity of responses.Actually, search engines do this this too: Google something with many possible meanings -- like "egg" -- on Google, and you'll get a set of intentionally diversified results. I get Wikipedia; then a restaurant; then YouTube cooking videos; Big Green Egg's homepage; news stories about egg shortages. Each individual link is very unlike the others to maximize the chance that one of them's the one you want.
by npinsker
5/21/2025 at 6:06:18 AM
Its made a little bit better by the fact that there's something like a dozen different prompts. Across all of the prompts each model had a fair number of opportunities to show off.by Taek
5/21/2025 at 2:51:50 AM
It is indicative of marginal improvements instead of new breakthroughs. iPhone 1 was a paradigm shift. iPhone 10 was essentially iPhone 9 with tweaks. As an AI optimist I would be disappointed to find we are already seeing diminishing returns on R&D.by ipnon
5/21/2025 at 10:48:57 AM
Just moments ago, I managed to turn a photo of a person into a short clip of them dancing, in half-decent quality, fully locally, on a mid-range gaming GPU (RTX 4070 Ti, 12GB VRAM). I almost run out of RAM (32GB), but it worked, worked well, and took only couple of minutes.Half a year ago, that was sort of possible for some genius really bent on making it happen. A year ago, that was unthinkable. Today, it's a matter of drag&dropping a workflow to a fresh ComfyUI install and downloading a couple dozen GB of img2vid models.
The returns on R&D are not diminishing, the progress is just not happening everywhere evenly and at the same time.
by TeMPOraL
5/21/2025 at 4:35:59 AM
Not only did the iPhone 9 never exist, the iPhone X was a huge paradigm shift in design and capabilities. That was the phone that introduced edge-to-edge OLED screens to the iPhone line, as well as the IR camera that enabled FaceID and the first generation of Portrait mode. I know it well since it also introduced the ability for developers to build facial motion capture apps that would’ve previously required expensive pro hardware and allowed people like me to build live facial motion capture effects for theatre.Sorry to dunk so hard, but your example of technology stagnating is actually an example of breakthrough technological innovation deep into a product’s lifecycle: the very thing you were trying to say doesn’t happen.
by Uehreka
5/21/2025 at 12:17:44 PM
Arguably OLED screens and IR cameras are no paradigm shift. At least nothing comparable to "no smartphone" to iPhone 1.by ofrzeta
5/21/2025 at 2:26:25 PM
There were smartphones before the iPhone. One could also describe the difference as "just a touchscreen".by jere
5/21/2025 at 2:48:40 PM
The iPhone 1 featured "touch screen, GPS, camera, iPod, and internet access. Its software capabilities were a turning point for the smartphone industry" (random source: https://www.textline.com/blog/smartphone-history).If you want to doubt that it was in fact a not a turning point you'd need to provide very strong arguments.
by ofrzeta
5/21/2025 at 6:37:19 PM
All of the things you mentioned were available in phones before the first iPhone (assuming by ipod you mean mp3 player). In fact from a software point of view it was lacking a bunch of functionality and software ecosystem some competitors had.In my view the reason the iPhone felt so new was almost entirely the incredibly responsive capacitive touch screen with a finger ui, everything I'd used before it did resistive and preferred pen for detail. Pen actually is better for detail so in some ways it was that more than anything else that turned the device from a creation device to a consumption device which was whole new way of thinking about smart personal devices.
Of course it was also sold in a decent package too where Apple did deals that ensured it was available with good mobile internet plans which were also unusual at the time.
by kybernetikos
5/21/2025 at 3:24:55 PM
Also, the touchscreen was the type that unlike all previous touchscreens (except the ones made by a startup that Apple had bought) could detect touches at more than one screen location simultaneously.by hollerith
5/21/2025 at 5:59:50 PM
as someone who owned a number of smartphones and PDAs prior to the first iPhone coming out, the real advance was a usable mobile browser. i'd had all the same capabilities with devices for quite some time before the iphone came out, but their browsers were painful to use. the touch interface was also a big advance over previous touch interfaces. in other areas the first iphone was lacking compared to other smartphones.. copy and paste and 3rd party apps were missing for example.by spogbiper
5/21/2025 at 5:06:58 AM
technological advancements yes, but did it drastically changed how users (majority) use the iphone. I'd say marginally. Fancy selfie filters, ok i'll give it that. But edge to edge screens, meh, give me back my home button :Dby vincnetas
5/21/2025 at 4:10:51 AM
At the risk of unbearable pedantry, there's never been an iPhone 9. (There was never a 2 either; there was kind of a 3, although it was really called the 3G.)by sethaurus
5/20/2025 at 7:20:29 PM
The winning image entry for "The Yarrctic Circle" by OpenAI 4o doesn't actually wields a cutlass. It's very aesthetically pleasing, even though it's so wrong in all fundamental aspects (perspective is nonsensical and anatomy is messed up, with one leg 150% longer than the other, ...).It's a very interesting resource to map some of the limits of existing models.
by woolion
5/20/2025 at 10:37:49 PM
In my own testing between the two this is what I’ve noticed. Imagen will follow the instructions, and 4o will often not, but produces aesthetically more pleasing images.I don’t know which is more important, but I would say that people mostly won’t pay for fun but disposable images, and I think people will pay for art but there will be an increased emphasis on the human artist. However users might pay for reliable tools that can generate images for a purpose, things like educational illustrations, and those need to be able to follow the spec very well.
by danpalmer
5/21/2025 at 12:19:05 AM
People pay for digital sticker packs so their memoji in iMessage are customized. How much money they make on sticker packs is unknown to me, but image generation platform Midjourney seems to be doing alright.by fragmede
5/21/2025 at 3:29:04 AM
Midjourney got in REALLY early in the GenAI game despite only allowing image generation through Discord for at least a year. I heard that it was one of the largest Discord channels ever having something absurd like 20+ million members.I'd love to see some financials but I'd tend to agree they're probably doing pretty well.
by vunderba
5/21/2025 at 1:00:55 PM
o4-mini-high I’ve noticed is far better the 4o on prompt adherence in image generation in personal use.by ilikehurdles
5/20/2025 at 9:36:43 PM
Google Flow is remarkable as video editing UX, but Imagen 4 doesn't really stand out amongst its image gen peers.I want to interrupt all of this hype over Imagen 4 to talk about the totally slept on Tencent Hunyuan Image 2.0 that stealthily launched last Friday. It's absolutely remarkable and features:
- millisecond generation times
- real time image-to-image drawing capabilities
- visual instructivity (eg. you can circle regions, draw arrows, and write prompts addressing them.)
- incredible prompt adherence and quality
Nothing else on the market has these properties in quite this combination, so it's rather unique.
Release Tweet: https://x.com/TencentHunyuan/status/1923263203825549457
Tencent Hunyuan had a bunch of model releases all wrapped up in a product that they call "Hunyuan Game", but the Hunyuan Image 2.0 real time drawing canvas is the real star of it all. It's basically a faster, higher quality Krea: https://x.com/TencentHunyuan/status/1924713242150273424
More real time canvas samples: https://youtu.be/tVgT42iI31c?si=WEuvie-fIDaGk2J6&t=141 (I haven't found any other videos on the internet apart from these two.)
You can see how this is an incredible illustration tool. If they were to open source this, this would immediately become the top image generation model over Flux, Imagen 4, etc. At this point, really only gpt-image-1 stands apart as having godlike instructivity, but it's on the other end of the [real time <--> instructive] spectrum.
A total creative image tool kit might just be gpt-image-1 and Hunyuan Image 2.0. The other models are degenerate cases.
More image samples: https://x.com/Gdgtify/status/1923374102653317545
If anyone from Tencent or the Hunyuan team is reading this: PLEASE, PLEASE, PLEASE OPEN SOURCE THIS. (PLEASE!!)
by echelon
5/20/2025 at 10:35:57 PM
> but Imagen 4 doesn't really stand out amongst its image gen peers.In this AI rat race, whenever one model gets ahead, they all tend to reach parity within 3-6 months. If you can wait 6 months to create your video I'm sure Imagen 5 will be more than good enough.
It's honestly kind of ridiculous the pace things are moving at these days. 10 years ago waiting a year for something was very normal, nowadays people are judging the model-of-the-week against last week's model-of-the-week but last week's org will probably not sleep and they'll release another one next week.
by dheera
5/20/2025 at 10:12:53 PM
This is amazing, can’t see how I’ve missed it. Thank you!by Narciss
5/21/2025 at 12:58:28 PM
I've given this some more thought. Even if Imagen 4 isn't that great on its own, all of Google's models and UX products in conjunction (Veo 3, Flow, etc.) are orders of magnitude above the rest of the playing field.If Tencent wants to keep Google from winning the game, they should open source their models. From my perspective right now, it looks like Google is going to win this entire game, and open source AI might be the only way to stop that from being a runaway victory.
by echelon
5/21/2025 at 12:11:08 AM
Good catch - that's on me I accidentally uploaded the wrong image for gpt-image-1. Fixed!by vunderba
5/20/2025 at 9:16:29 PM
I can't find the image you're talking about. Link pls?by NoahZuniga
5/20/2025 at 8:26:09 PM
Hands in Winning entry in "Not the Bees" are very unlike any driver. I wouldn't count it as a pass.by tintor
5/20/2025 at 10:32:05 PM
I hate to say it but I feel like as a result of staring at so many equivalents of Tyrone Rugen since the dark ages of Stable Diffusion 1.5 - I literally DID NOT EVEN notice that until you called it out. The training data in my wetware has been corrupted.by vunderba
5/20/2025 at 8:22:43 PM
More difficult examples:- wine glass that is full to the edge with wine (ie. not half full)
- wrist watch not showing V (hands at 10 and 2 o'clock)
- 9 step IKEA shelf assembly instruction diagram
- any kind of gymnastics / sport acro
by tintor
5/21/2025 at 12:10:56 AM
What's the reason to test the "not showing ..."? I've never seen anyone make that kind of request in real life. They ask for what they actually want instead. You'd ask for a clock showing 3:25 rather than "not 10:10".I mean, it's a fun edge case, but I'm practice - does it matter?
by viraptor
5/27/2025 at 8:24:48 PM
Problem is that watchmakers always set the watches to show V with clock hands when they market their watches. This causes a very strong bias in image generation models, making it very difficult for them to generate watch that shows any other time, even if user requests it.by tintor
5/21/2025 at 12:32:03 AM
> I mean, it's a fun edge case, but I'm practice - does it matter?*in practice, not I'm practice. (I swear I have a point, I'm not being needlessly pedantic.) In English, in images, mistakes stick out. Thus negative prompts are used a lot for iterative image generation. Even when you're working with a human graphics designer, you may not know what exactly you want, but you know that you don't want (some aspect of) the image in front of you.
Ie: "Not that", for varying values of "that".
by fragmede
5/21/2025 at 1:29:49 AM
> Thus negative prompts are used a lot for iterative image generation.Are they still? The negative keywords were popular in SD era. The negative prompt was popular with later models in advanced tools. But modern iterations look different - the models capable of editing are perfectly fine with processing the previous image with a prompt "remove the elephant" or "make the clock show a different time". Are the negative parts in the initial prompt still actually used in iteration?
by viraptor
5/20/2025 at 8:16:48 PM
How can you tell you're using Imagen 4 and not Imagen 3? Gemini seems unable to tell me which model it's using. Are you using Vertex AI?by strongpigeon
5/20/2025 at 10:24:12 PM
I used Whisk. The model listing shows 3/4 because testing against Imagen 4 did not result in a measurable increase in accuracy from Imagen 3.by vunderba
5/20/2025 at 9:49:10 PM
Well they've labelled it 3/4 so I'm guessing they can't but you can use 4 it in whiskby sidibe
5/20/2025 at 8:40:06 PM
Tell me you’re using Imagen 3 without telling me you’re using Imagen 4… or somethingby EGreg
5/21/2025 at 9:33:36 AM
Side note. It's my understanding that being a pith helmet is pretty orthogonal to having a spike. Plenty of helmets with spikes aren't pith helmets and plenty of pith helmets don't have spikes.Not sure if this affects your results or not but I resist chiming in!
by andybak
5/21/2025 at 9:45:46 AM
Also "Hippity Hop" is a Space Hopper! Wikipedia agrees with me: https://en.wikipedia.org/wiki/Space_hopper :)I wonder how much the commonality or frequency of names for things affects image generation? My hunch is that it it roughly correlates and you'd get better results for terms with more hits in the training data. I'd probably use Google image search as a rough proxy for this.
by andybak
5/26/2025 at 8:08:04 PM
well I'm about a billion years late to this but going to reply anyway. It's not mentioned but internally I give each model several "iterations" of the prompts themselves. This included using hippity hop, space hopper, and even a physical description of the toy itself. But it's a good call out!by vunderba
5/20/2025 at 6:32:04 PM
How do companies like https://icon.com do their image Gen if the existing SOTA for prompt adherence is so poor?by Onavo
5/20/2025 at 7:10:44 PM
People who generate images for ads probably don't often need strict prompt adherence, just a random backdrop to slap a picture of their product on top of. The kind of thing they'd have used a stock image library for before.Also "create static + video ads that are 0-99% complete" suggests the performance is hit or miss.
by yorwba
5/21/2025 at 2:21:01 AM
Exactly this. It just helps the foundation which doesn’t need specific details in most cases.by AsmodiusVI
5/20/2025 at 7:00:48 PM
fine tuning and prompt techniques can go a long way. That + cherrypicking resultsby peab
5/20/2025 at 9:30:25 PM
multishot generation with discriminatorsby htrp
5/20/2025 at 8:49:08 PM
> "A dolphin is using its fluke to discipline a mermaid by paddling it across the backside."Hmm.
by mcphage
5/20/2025 at 7:19:24 PM
How do you determine how many attempts are made before the results are failing?by snug
5/20/2025 at 8:48:38 PM
It's listed in Purple to the right of the model name.by mcphage
5/20/2025 at 9:32:04 PM
I think they're asking how the number to stop at was determined, not what the number stopped at was.My guess as to determining whether it's 64 attempts to a pass for one and 5 attempts to a fail for another is simply "whether or not the author felt there was a chance random variance would result in a pass with a few more tries based on the initial 5ish". I.e. a bit subjective, as is the overall grading in the end anyways.
by zamadatix
5/20/2025 at 10:29:40 PM
That's exactly what it was. It's hard to define a discrete rubric for grading at an inherently qualitative level. Usually more attempts means that it seemed like the model had the "potential" to get across the finish line so I gave it more opportunities.If there's only a few attempts and ends in a failure, there's a pretty good chance that I could sort of tell that the model had ZERO chance.
by vunderba
5/21/2025 at 2:30:50 PM
They failed but man those snakes are cool. Awesome website!by anton-c
5/20/2025 at 6:42:32 PM
Awesome showcase! Fun descriptions. Are there similar sites?by xixixao
5/21/2025 at 12:30:16 AM
Thanks! There are definitely other GenAI image comparison sites out there - but I found that the majority of them were more concerned with visual fidelity which IMHO is a less challenging problem than prompt adherence.This is probably one of the better known benchmarks but when I see Midjourney 7 and Imagen3 within spitting distance of each other it makes me question what kind of metrics they are using.
by vunderba
5/20/2025 at 9:32:23 PM
I love the writing style in this.by zamadatix
5/21/2025 at 1:58:00 AM
The website is brokenby mvdtnz
5/21/2025 at 3:17:30 AM
That's unusual - I don't see anything in the logs and perf tests / website speed tests show everything is good. Maybe Cloudflare had a hiccup.by vunderba
5/20/2025 at 7:00:09 PM
great website!by peab