5/11/2026 at 8:52:05 PM
Amazing!I just tried the OCR capabilities with a photo of a DIN A4 page which was written with a typewriter. The image isn't the easiest to interpret. The text perspective is distorted because the page is part of a book and the page margin toward the spine of the book is very small. There are also many inline corrections due to typing errors while the page was written (backspace couldn't erase characters back then, and arrow keys couldn't be used to add text in between existing words). Over the past months I've tried to use several LLMs on this very same image already (1 out of 200 pages that seek digitization). The result is by far the most accurate so far. Only some very minor errors (which are also non-trivial for human translators) were made.
This page induced costs of about 25 cent. I assume I could tweak the input image a little more to consume less input tokens. OCR-ing all 200 pages would otherwise cost a juicy 50$ - although there is a generous 20$ of free credits.
Induced cost: 108.8k Input tokens => 16,32 cent 24.5k Output tokens => 8,58 cent
// Edit: I just re-tried the same task utilizing a capability of the API to only run a specific part of the model (e.g. _only_ OCR). This cuts cost by 3x (to ~8c/page) but significantly worsens the result. The result is missing entire lines of the original document. There are also many error in the text that was recognized.
by schanz
5/11/2026 at 11:22:54 PM
Yup run task mode runs a much smaller part of the model when can drop quality of scans. The issue with run task we have to figure out is how much of the model is needed just for OCR and how to activate the right parts. A lot more improvements coming here with the same cost reduction.I'd be happy to test it against your sample and see how we can get good results at a lower per page cost. Feel free to email me yoeven@interfaze.ai
by yoeven
5/11/2026 at 9:55:18 PM
Have you tried this task using an actual OCR model like Google Cloud Vision AI? I am not sure if this is what Gemini uses under the hood but multi-modal LLMs are not designed to extract text like this so it should be no surprise it's not good at it?by AnthonyR
5/12/2026 at 8:25:59 PM
I don't think I've tried Google Cloud Vision on that particular image, no. In my experience, based on some tests from a year ago or so, Azure Document Intelligence impressed me the most in terms of OCR - out of the big three players: GCP, AWS and Azure.I should retry the experiment because there has been a lot of progress since then and I could imagine that GCP improved there vision models since then.
by schanz
5/11/2026 at 11:27:09 PM
Google Cloud Vision AI is a specialized model built on CNNs frameworks which is part of the Interfaze architecture which is an hybrid so you get best of both worlds. Google cloud vision was pretty far behind other specalized models like PaddleOCR etc anyways so if you're looking for a pure CNN, check them out.You can find the explanation and the comparison in the article, which we benchmarked pure CNN models, pure LLM models and a hybrid architecture like ours.
by yoeven
5/12/2026 at 8:45:48 AM
New account created ~5 hours after this post, with a single comment specifically praising the model / product. I want to believe, but this sort of astroturfing isn't very encouraging.by woadwarrior01
5/12/2026 at 8:21:53 PM
I totally understand, and I can't blame you for that. I wouldn't think otherwise. I am a long-time follower of YC but never posted any comments. I wanted to share that experience which is the reason I created the account. I don't know how I can proof to you that I am a legitimate person who has _no_ affiliation whatsoever with Interfaze. I can only ask to try it out for yourself. I was genuinely impressed by the results.by schanz