5/7/2026 at 2:41:04 PM
Quick question, for average joe do we still need to "train" LLM or we can just use off the shelf model and use it ("inference"?) for normal use cases like business process augmentation (e.g. helping read paper receipts, or generate cat videos)?by wiradikusuma
5/7/2026 at 3:55:12 PM
Modern smaller LLMs like Qwen3.6 27B is quite good at visual tasks like describing images. I wouldn't trust it on receipts unless you're fine with a bit less than 100% accuracy, say 90-ish%. For descriptions of images and such I've found they do quite well indeed. A key change was the introduction of more or even dynamic visual tokens, that really helped the model "see" more details.Generating cat videos is the domain of diffusion models. If you have at least a 16GB GPU and a fair bit of patience you can get quite good results, check out ComfyUI reddit for example.
by magicalhippo
5/7/2026 at 4:09:51 PM
Just as example, here's what Qwen3.6 27B Q5_K_XL can do given this[1] image. I didn't do any prompt engineering here just a dead simple prompt: "Transcribe the following receipt. Put line items in a separate section, each line item separated by a double newline". Temperature set to 0.5.Here's the output:
Publix.
Bradenton Commons Shopping Center
4651 Cortez Rd. W.
Bradenton, FL 34210
Store Manager: Joe Galati
941-792-7195
N/O LF WHEAT BREAD 3.99 F
PBX THCK L/S BACON 7.82 F
PUBLIX BROWN GRAVY 0.83 F
TOP SIRLOIN STEAK 11.74 F
You Saved 3.92
VITA PRTY SNK WINE 6.99 F
You Saved 3.00
ORGANIC CARROTS 1.69 F
BRC FLRT EAT SMART 3.34 F
1 @ 3 FOR 10.00
You Saved 0.15
GINGER ROOT 0.65 F
0.13 lb @ 4.99/ lb
POTATOES RUSSET 0.84 F
0.65 lb @ 1.29/ lb
POTATOES SWEET 0.49 F
0.49 lb @ 0.99/ lb
DELECT BSQUE CK/TN 10.99 T
FS OUTSTRETCH UNSC 15.99 T
Order Total 65.36
Sales Tax 1.89
Grand Total 67.25
Credit Payment 67.25
Change 0.00
Savings Summary
Special Price Savings 7.07
************************************************************
* Your Savings at Publix *
* 7.07 *
************************************************************
Receipt ID: 5957 6249 2191 1277 712
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PRESTO!
Trace #: 766630
Reference #: 0098440513
Acct #: XXXXXXXXXXXX2034
Purchase VISA
[1]: https://i.pinimg.com/originals/41/08/dc/4108dcf51f15af464bb6...
by magicalhippo
5/8/2026 at 11:44:11 AM
What is the difference between this and using normal OCR and then running that output through a LLM? It seems such a bazooka way to kill a fly to me using a modelime Qwen.by sebazzz
5/8/2026 at 9:57:07 PM
For this example, perhaps not much, but a similar query also works. given this image[0] I queried qwen3.5-35b-a3b "where is waldo", and after thinking some got this back:---
Waldo is located in the *center* of the image.
Here is how to spot him:
1. Look towards the middle of the beach scene.
2. Find the *red and white striped lifeguard stand/booth* on the right side of the center area.
3. Just to the left of that booth, there is a man standing up wearing his signature *red and white striped shirt*, blue jeans, and red hat with a pom-pom. He also has a camera around his neck.
---
Which is not generally something OCR can do. If you instead ask "how many horses are in this image?" you get this response:
---
Based on a careful look at the image, there are *3* equine animals (which appear to be donkeys or ponies used for beach rides).
1. One is near the top left, being ridden by someone in yellow shorts.
2. Right next to it is another one, being ridden by someone in blue and yellow.
3. On the far right edge of the image (near the water), there is a third grey animal being ridden by someone wearing a hat.
--
Now is this all anything you can't do with more boring machine learning? sure, but there's something incredibly convenient about how generic LLMs are. You don't need to train anything, just point the LLM at an image and ask.
[0] https://i.pinimg.com/originals/18/64/44/1864444c819a7adae742...
by Doxin
5/8/2026 at 12:17:22 PM
For most tasks I agree. However once you've done your OCR you already have lost a lot of positional and context information, so for some tasks it might not be good enough.If you have scanned PDFs that follow a template, like an invoice from a repeat supplier, then yeah OCR is definitely the way to go.
by magicalhippo
5/7/2026 at 3:43:11 PM
You can use modern off-the-shelf models for those types of tasks, however a smaller-but-bespoke model will usually be more cost-efficient if used at scale.by minimaxir
5/7/2026 at 4:30:36 PM
And smaller bespoke models running locally are better for regulated workflows (healthcare, banking etc) as wellby najarvg
5/7/2026 at 6:35:18 PM
I think nowadays a lot of models are trained more at doing this than at knowing things, while being smaller. So I’d say yes!At least that’s my impression.
by jiehong