1/12/2025 at 4:29:22 AM
Over the holidays, we published a post[1] on using high-precision few-shot examples to get `gpt-4o-mini` to perform similar to `gpt-4o`. I just re-ran that same experiment, but swapped out `gpt-4o-mini` with `phi-4`.`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!
By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).
[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like
by sgk284
1/12/2025 at 7:15:24 AM
I like the direction, but have a pretty different experience in practice. This spans legal analytics, social media analytics, code synthesis, news analysis, cyber security LLMs, etc:1. The only ultimate absolute quality metric I saw in that blogpost afaict was expert agreement... at 90%. All of our customers would fire us at that level across all of the diff b2b domains we work in. I'm surprised 90% is considered acceptable quality in a paying business context like retail.
2. Gpt-4o-mini is great. I find we can get, for these kind of simple tasks you describe, gpt-4o-mini to achieve about 95-98% agreement with gpt-4o by iteratively manually improving prompts over increasingly large synthetic evals. Given data and a good dev, we do this basically same-day for a lot of simple tasks, which is astounding.
I do expect automatic prompt optimizers to win here long-term, and keep hopefully revisiting dspy et al. For now, they fail over standard prompt engineering. Likewise, I do believe in example learning over time for areas like personalization.... but doing semantic search recall of high-rated answers was a V1 thing we had to rethink due to too many issues.
by lmeyerov
1/12/2025 at 7:55:25 AM
re: 90% – this particular case is a fairly subjective and creative task, where humans (and the LLM) are asked to follow a 22 page SOP. They've had a team of humans doing the task for 9 years, with exceptionally high variance in performance. The blended performance of the human team is meaningfully below this 90% threshold (~76%) – which speaks to the difficulty of the task.It's, admittedly, a tough task to measure objectively though, in that it's like a code review. If a Principal Engineer pointed out 20 deficiencies in a code change and another Principal Engineer pointed out 18 of the same 20 things, but also pointed out 3 other things that the first reviewer didn't, it doesn't necessarily mean either review is wrong – they just meaningfully deviate from each other.
In this case, we chose an expert that we treat as an objective "source of truth".
re: simple tasks – We run hundreds of thousands of tasks every month with more-or-less deterministic behavior (in that, we'll reliably do it correctly a million out of a million times). We chose a particularly challenging task for the case-study though.
re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.
by sgk284
1/12/2025 at 3:41:22 PM
I'm more confused now. If this is a tough and high-value task, we would not use gpt-4o-mini on its own, eg, add more steps like a verifier & retry, or just do gpt-4o to begin with, and would more seriously consider fine-tuning in addition to the prompt engineering. The blog argued against that, but maybe I read too quickly.And agreed, people expect $ they invest into computer systems to do much better than their bad & avg employees. AI systems get the added challenge where they must do ~100% on what non-AI rules would catch ("why are you using AI?") + extra lift from AI ("what did this add?"). We generally get evaluated on matching experts (low bar), and exceeding them (high bar). Comparing to average staff is, frustratingly, a breakout.
Each scenario is different obviously..
by lmeyerov
1/12/2025 at 11:37:38 PM
One point of confusion might be that this is a tough but relatively low-value task (on a per-unit basis). The budget per item moderated is measured in small double-digit cents, but there's hundreds of thousands of items regularly being ingested.FWIW – across all of these, we already do automated prompt rewriting, self-reflection, verification, and a suite of other things that help maximize reliability / quality, but those tokens add up quickly and being able to dynamically switch over to a smaller model without degrading performance improves margin substantially.
Fine-tuning is a non-starter for a number of reasons, but that's a much longer post.
by sgk284
1/12/2025 at 4:18:50 PM
I feel like LLMs are going to be a skill to have similar to the ability to google or type since it can get good answers pretty well but bad answers when you don't know the subject manner.by zitterbewegung
1/12/2025 at 4:40:30 PM
Agreed, and that's where teams like the OP come inOpenAI does great at training for general tasks, and we should not be disappointed when specialized tasks fail. Interestingly, openai advertises increasingly many subjects they are special casing like math, code, & law, and so holding them to standards is fair there IMO.
For specialized contexts openai doesn't eval on, these merit hiring consultants / product to add the last-mile LLM data & tuning for the specific task. And at least in my experience, people paying money for AI experts & tech expect expert-level performance to be met, and ultimately, exceeded..
by lmeyerov
1/12/2025 at 7:06:12 PM
What's your loop for prompt engineering with GPT-4o? Do you feed the meta-prompter the misclassified examples? Also does the evaluation drive the synthetic data production almost like boosting?by potatoman22
1/12/2025 at 7:28:09 PM
'it varies' b/c we do everything from an interactive analytics chat agent (loiue.ai UI) to data-intensive continuous-monitoring (louie.ai pipelines) to one-off customer assists like $B court cases1. Common themes in our development-time loop:
* We don't do synthetic data. We do real data or anonymized data. When we lack data, we go and get some. That may mean paying people, doing it ourselves, setting up simulation environments, etc.
* We start with synthetic judges, esp for scale tasks that are simple and thus considering smaller models like gpt-4o-mini (the topic here). Before we worry about expert agreement, we worry about gpt-4o agreement, and make evals that cover concerns like sample size and class imbalance...
* ... When the task is high value, e.g., tied closely to a paying customer deliverable or core product workflow, we invest more on expert evals, making calls like on how many experts and of what caliber. Informally, we've learned multiple of our teammates, despite good at what they do, can be lousy experts, while others are known for precision, even if not data people (ex: our field staff can be great!). Likewise, we hire subject matter experts as full-timers (ex: former europol/fbi equivs!), source as contractors, and, partner with our customers here.
* After a year+ of prompt engineering with different tasks, models, data, and prompt styles, there's a lot of rote tricks & standard practices we know. Most are 'static' -- you can audit a prompt for gotchas & top examples to fill in -- and a smaller number are like in the OP's suggestion of dynamic prompts where we include elements like RAG.
On the last point, it seems incredibly automatable, so I keep trying tools. I've found automatic prompt optimizers like dspy to be disappointing in being unable to match what our prompt engineers can do here: they did not do better then prompts we wrote as experts with bare bones iteration, and leaning into the tools failed to get noticeable lift. I don't think this is inherent, just they're probably eval'ing against people we would consider trainees. Ex: I see what stanford medical fellows+phds are doing for their genai publications, and they would probably benefit from dspy if it was easier, but again, we would classify them as 'interns' wrt the quality of prompt engineering I see them doing behind-the-scenes. I'm optimistic that by 2026, tools here will be useful for skilled AI engineers too, just they're not there yet.
2. It's a lot more murky when we get into online+active learning loops for LLMs & agentic pipelines.
E.g., louie.ai works with live operational databases, where there is a lot wrt people + systems you can learn from, and issues like databases changing, differences in role & expertise, data privacy, adverserial data, and even the workflows change. Another area we deal with is data streams where the physical realities they're working with changes (questions+answers about logs, news, social, etc).
IMO these are a lot harder and one of the areas a lot of our 2025 energy is going. Conversely, 'automatic prompt engineering' seems like something PhDs can make big strides in a vacuum...
by lmeyerov
1/12/2025 at 10:44:57 PM
Thanks! I love your focus on evaluation, it's missing in a lot of LLM products. I worked in the medical field and we valued model validation with similar importance. Our processes sound similar, too. One difference is that our customers still saw utility in models with much lower F1 than 90%. Rare events are hard to predict.by potatoman22
1/12/2025 at 6:26:45 AM
This is really nice. I loved the detailed process and I'm definitely gonna use it. One nit though: I didn't understand what the graphs mean, maybe you should add the axes names.by yard2010
1/12/2025 at 7:14:46 AM
Thanks! Great suggestion for improving the graphs – I just updated the post with axis labels.by sgk284
1/13/2025 at 5:57:04 PM
As a bit minor criticism "Tasks Completed (Time)" is hard to evaluate without the time intervals or units. I'm not sure if it should just be "Time"?by blharr
1/13/2025 at 5:52:53 PM
m ∈ ℤ is the threshold for determining high or low noveltysearch(T,θ,m) retrieves the first m historical tasks that are semantically similar above the θ threshold
Are both m's here the same or different numbers? I found this a bit confusing
by blharr
1/14/2025 at 4:06:17 AM
In our case, yes we treat them the same. Though it might be interesting to decouple them.You could, for example, include all few-shots that meet the similarity threshold, but you’ll use more tokens for (I assume) marginal gain. Definitely worth a try though.
by sgk284
1/12/2025 at 6:27:05 AM
Have you also tried using the large model as FSKD model?by vincent_s
1/12/2025 at 7:30:14 AM
We have, and it works great! We currently do this in production, though we use it to help us optimize for consistency between task executions (vs the linked post, which is about improving the capabilities of a model).Phrased differently, when a task has many valid and correct conclusions, this technique allows the LLM to see "How did I do similar tasks before?" and it'll tend to solve new tasks by making similar decisions it made for previous similar tasks.
Two things to note:
- You'll typically still want to have some small epsilon where you choose to run the task without few-shots. This will help prevent mistakes from propagating forward indefinitely.
- You can have humans correct historical examples, and use their feedback to improve the large model dynamically in real-time. This is basically FSKD where the human is the "large model" and the large foundation model is the "small model".
by sgk284
1/12/2025 at 12:37:18 PM
Nice blogby nothrowaways