alt.hn

5/11/2026 at 4:22:31 PM

Interfaze: A new model architecture built for high accuracy at scale

https://interfaze.ai/blog/interfaze-a-new-model-architecture-built-for-high-accuracy-at-scale

by yoeven

5/11/2026 at 8:52:05 PM

Amazing!

I just tried the OCR capabilities with a photo of a DIN A4 page which was written with a typewriter. The image isn't the easiest to interpret. The text perspective is distorted because the page is part of a book and the page margin toward the spine of the book is very small. There are also many inline corrections due to typing errors while the page was written (backspace couldn't erase characters back then, and arrow keys couldn't be used to add text in between existing words). Over the past months I've tried to use several LLMs on this very same image already (1 out of 200 pages that seek digitization). The result is by far the most accurate so far. Only some very minor errors (which are also non-trivial for human translators) were made.

This page induced costs of about 25 cent. I assume I could tweak the input image a little more to consume less input tokens. OCR-ing all 200 pages would otherwise cost a juicy 50$ - although there is a generous 20$ of free credits.

Induced cost: 108.8k Input tokens => 16,32 cent 24.5k Output tokens => 8,58 cent

// Edit: I just re-tried the same task utilizing a capability of the API to only run a specific part of the model (e.g. _only_ OCR). This cuts cost by 3x (to ~8c/page) but significantly worsens the result. The result is missing entire lines of the original document. There are also many error in the text that was recognized.

by schanz

5/11/2026 at 11:22:54 PM

Yup run task mode runs a much smaller part of the model when can drop quality of scans. The issue with run task we have to figure out is how much of the model is needed just for OCR and how to activate the right parts. A lot more improvements coming here with the same cost reduction.

I'd be happy to test it against your sample and see how we can get good results at a lower per page cost. Feel free to email me yoeven@interfaze.ai

by yoeven

5/11/2026 at 9:55:18 PM

Have you tried this task using an actual OCR model like Google Cloud Vision AI? I am not sure if this is what Gemini uses under the hood but multi-modal LLMs are not designed to extract text like this so it should be no surprise it's not good at it?

by AnthonyR

5/12/2026 at 8:25:59 PM

I don't think I've tried Google Cloud Vision on that particular image, no. In my experience, based on some tests from a year ago or so, Azure Document Intelligence impressed me the most in terms of OCR - out of the big three players: GCP, AWS and Azure.

I should retry the experiment because there has been a lot of progress since then and I could imagine that GCP improved there vision models since then.

by schanz

5/11/2026 at 11:27:09 PM

Google Cloud Vision AI is a specialized model built on CNNs frameworks which is part of the Interfaze architecture which is an hybrid so you get best of both worlds. Google cloud vision was pretty far behind other specalized models like PaddleOCR etc anyways so if you're looking for a pure CNN, check them out.

You can find the explanation and the comparison in the article, which we benchmarked pure CNN models, pure LLM models and a hybrid architecture like ours.

by yoeven

5/12/2026 at 8:45:48 AM

New account created ~5 hours after this post, with a single comment specifically praising the model / product. I want to believe, but this sort of astroturfing isn't very encouraging.

by woadwarrior01

5/12/2026 at 8:21:53 PM

I totally understand, and I can't blame you for that. I wouldn't think otherwise. I am a long-time follower of YC but never posted any comments. I wanted to share that experience which is the reason I created the account. I don't know how I can proof to you that I am a legitimate person who has _no_ affiliation whatsoever with Interfaze. I can only ask to try it out for yourself. I was genuinely impressed by the results.

by schanz

5/11/2026 at 6:40:36 PM

Potentially stupid question: Does that mean we can chain them together line UNIX command line programs ? That would be so, so intuitive.

by euroderf

5/12/2026 at 11:32:10 AM

Gave it a try for structured data extraction. Tested returning a JSON object from images.

The output was correct, and seemed deterministic, although I ran it only 2-3 times on the same image.

Main problem is response time: it took about 20-25 seconds for a simple structure of 5 fields. As such unusable at scale, let alone "real time" processing.

Other problem is cost, it is considerably more expensive than more established models for the same document, like flash-light.

Shame, the architecture is very interesting.

by nickserv

5/13/2026 at 1:06:02 AM

Thanks for the feedback!

We're working a lot more on speed in the coming few weeks :) More GPUs and more optimizations.

Our has been focus on quality of output first and we'll make optimizations as we grow :)

The lite models are great for simple use cases but won't don well in more complex OCR use cases.

by yoeven

5/12/2026 at 12:32:56 AM

Ok that's...just cheating. You can't take a benchmark like MMLU designed to test the performance of a single general language model and compare it to performance of a small specialized model designed to do well on MMLU.

by gok

5/12/2026 at 1:28:21 AM

It wasn't designed to do well on MMMLU, it's a general model designed for deterministic task like OCR, object detection, STT and more and a by product of that is great language abilities. It still has a transformer backbone giving great language skills while being good at other stuff.

See the full benchmark: https://interfaze.ai/leaderboards

by yoeven

5/11/2026 at 5:42:46 PM

> These are deep neural network architectures that are task-specific for things like OCR, translation, or GUI detection. The way they consume and see data is trained to be task specific, which makes them up to 100x more accurate at their specific task. They also produce useful metadata like bounding boxes and confidence scores, letting developers build predictable workflows they can rely on.

Does code extraction and manipulation fit in that? Would interfaze be the agent that a coding agent uses?

by wood_spirit

5/11/2026 at 7:40:48 PM

Code extraction maybe, not something we have tested or built for but you could give it a try.

Code manipulation probably not since it's a lot smaller of a model compared to a Claude Opus which is SOTA for code generation/manipulation.

Generally code generation is a non-deterministic task by nature and general LLMs tend to be better at them.

by yoeven

5/11/2026 at 8:02:37 PM

The idea of what to change is perhaps an llm task but the job of doing the find replace and that kind of tooling is something LLMs actually struggle with and have all kinds or crutches and try retry loops to paste over in coding agents etc.

by wood_spirit

5/12/2026 at 6:13:35 AM

Interesting approach! One question though: can the model do column detection?

The first OCR example returns output that does not detect the article columns - the bounding box is the entire first line.

by bazzmt

5/12/2026 at 7:10:57 AM

It can, you could try prompting the model to use object detection vision and text extraction, we realized when we purely extract text it does amazing at word/sentence level bounds since the text acts as the anchor. However, when you treat it as a object detection problem, it sees that chunk of text as a segment allowing you the extract it as one column bound. Give that a try.

by yoeven

5/11/2026 at 7:44:54 PM

This is very cool, though I don't understand exactly what they've done here. Is it some kind of LLM with convolutional layers added?

The graph doesn't exactly make it clear but it describes a pipeline that goes beyond the LLM, so the CNN could be a separate model there.

by andai

5/11/2026 at 9:53:44 PM

Here’s the academic paper behind it: https://arxiv.org/abs/2602.04101

by tomsyouruncle

5/12/2026 at 5:29:15 PM

Thanks. Well this is fascinating.

>Instead of a single transformer, we combine (i) a stack of heterogeneous DNNs paired with small language models as perception modules

It seems that we're reinventing the brain's organs one by one from first principles. (Though Transformer + Common Crawl unintentionally builds a whole bunch of them we don't even understand yet.)

I found some broader context and the whole thing is indeed very harness-shaped:

>Using Interfaze as a Tool Inside Your Agent

https://interfaze.ai/blog/using-interfaze-as-a-tool-inside-y...

Well, Harness is the wrong word here... "environment/tools the LLM interacts with" definitely fits though. Or "other organoid" to use the previous metaphor.

by andai

5/13/2026 at 1:12:33 AM

Yup does really depend on the use case.

We see two types: workflows & agents.

Workflows are the most common, there's a pipeline like processing loan documents before data gets loaded to the next step or translating user comments before being stored in the database.

Agents are where you have a chat based system or a brain of sorts that calls many tools to achieve a user goal. The model doing this is a lot better at non deterministic task which then delegates to Interfaze for specific deterministic actions like OCR, Web extract then consumes that data. That's the article you referenced :)

by yoeven

5/11/2026 at 5:32:45 PM

Smaller models really arent great at structured output. If this works it would be great for a local model that might not be as good but as long as it respects structured output will be vastly more useful.

by sareiodata

5/11/2026 at 6:44:47 PM

> Smaller models really arent great at structured output.

That doesn't seem to hold true. Consider gpt-5.4-nano which supports structured output just fine.

https://developers.openai.com/api/docs/models/gpt-5.4-nano

It seems like a concern that's orthogonal to the model size.

by OutOfHere

5/11/2026 at 7:04:46 PM

I genuinely doubt that they are just lying though lol

by nosyke

5/11/2026 at 8:05:59 PM

So is this basically a task-specific MoA transformer arch with a DNN that helps make routing decisions? Trying to understand this.

by fraywing

5/11/2026 at 11:31:49 PM

The other way round, task specific DNNs adapted to share the same vector space as omni-transformers with generalized vision, audio encoders.

E.g. For an OCR task, the first pass will be handled by the CNN, converted to shared tokens which the transformer can consume, correct any issues if needed and a decoder that can handle both the DNN and transformer output.

by yoeven

5/11/2026 at 10:43:51 PM

Can this run locally or is this a service?

by jadbox

5/11/2026 at 11:27:50 PM

It's a service API but we do have on prem deployment in certain regions for enterprises

by yoeven

5/11/2026 at 5:30:06 PM

This is cool, Id love to be able to fine tune on this architecture. Is this something on the roadmap ever?

by sweaterkokuro

5/11/2026 at 7:47:18 PM

It isn't on our roadmap right now since in most cases it should work out of the box and if it doesn't we'll work with you to train that into the model generally.

However, if we see enough people who has something super niche that our model can't handle, we might start considering a fine tuning service

by yoeven

5/11/2026 at 8:12:33 PM

What I want are precise and tight bounding boxes. Why is this so difficult?

by florians

5/11/2026 at 8:35:34 PM

The PP-DocLayoutV3 [1] bounding boxes are pretty good in my experience, if you want boxes around individual document headings or paragraphs. If you want boxes around individual words, similar to what's shown in the Interfaze screen shot [2], Apple has a LiveText "token" model that's proprietary but free/bundled with macOS and iOS. There are easy to use Python bindings here: https://github.com/straussmaximilian/ocrmac

I presume that some otherwise-great OCR models (like Chandra) have terrible bounding boxes because generating good bounding boxes just wasn't a training priority. A lot of people are using OCR models to bulk-process documents without a lot of care for how the layout is preserved. It matters a lot if (e.g.) you want to be able to update and re-print old documents, but it doesn't matter if you are just transcribing whole documents for indexing/chunking/translation.

[1] https://huggingface.co/PaddlePaddle/PP-DocLayoutV3

[2] https://r2public.jigsawstack.com/interfaze/examples/dense_te...

by philipkglass

5/11/2026 at 11:39:05 PM

For sure there a tons of OCR bounding models and tons of other models like SAM 3 for segmentation.

Interfaze is a more powerful version of them combined into a single model, you can run multi turn tasks like extract all the text and object from this document then translate or generate a report.

It's like getting the best of both worlds from pure DNN/CNN models like Paddle and the flexibility and nuace of an LLM while outperforming both in accuracy.

by yoeven

5/11/2026 at 8:07:23 PM

Great in the benchmarks but not as good in the real world, sorry to say. Just gave it a try in my STT bot, it's worse than whisper

by icemaze

5/12/2026 at 3:19:34 AM

does it handle source code extraction from images?

how do I run it locally?

by vivzkestrel

5/12/2026 at 7:12:25 AM

yeah it would treat it like an OCR task and extract it, you could prompt it to format it better with the code alignment.

We serve it though an API. Check out the docs: https://interfaze.ai/docs

It's free to gets started.

by yoeven

5/11/2026 at 7:55:54 PM

Similar to a large action model?

by redwood

5/12/2026 at 1:32:25 AM

Not directly, LAMs tend to be focused a lot on tool calling or trained for a set of specific action for example in the robotics field. Good tool calling might be a good by product of Interfaze but wasn't specifically trained for that use case.

The focus has been for deterministic outputs that require high accuracy. In situations where there is "one right answer"

by yoeven

5/11/2026 at 5:27:44 PM

[flagged]

by a7om_com

5/11/2026 at 11:06:44 PM

[dead]

by qzgrid37