Nvidia-Ingest: Multi-modal data extraction

1/10/2025 at 12:47:20 PM

Is there a OCR toolkit or a ML Model which is able to reliable extract tables from invoices?

by hammersbald

1/10/2025 at 2:58:40 PM

By far the best one I've come across is Microsoft Azure Document Intelligence with the Layout Model[0].

It's really, really good at tables.

You have to use the Layout Model and not just the base Document Intelligence.

A bit pricey, but if you're processing content one time and it's high value (my use case as clinical trial protocol documents and the trial will run anywhere from 6-24 months), then it's worth it, IMO.

[0] https://learn.microsoft.com/en-us/azure/ai-services/document...

by CharlieDigital

1/10/2025 at 1:04:58 PM

All frontier multi modal LLMs can do this - there’s likely something lighter weight as well.

In my experience, the latest Gemini is best at vision and OCR

by benpacker

1/10/2025 at 2:48:05 PM

> All frontier multi modal LLMs can do this

There's reliable, and there's reliable. For example [1] is a conversation where I ask ChatGPT 4o questions about a seven-page tabular PDF from [2] which contains a list of election polling stations.

The results are simultaneously impressive and unimpressive. The document contains some repeated addresses, and the LLM correctly identifies all 11 of them... then says it found ten.

It gracefully deals with the PDF table, and converts the all-caps input data into Title Case.

The table is split across multiple pages, and the title row repeats each time. It deals with that easily.

It correctly finds all five schools mentioned.

When asked to extract an address that isn't in the document it correctly refuses, instead of hallucinating an answer.

When asked to count churches, "Bunyan Baptist Church" gets missed out. Of two church halls, only one gets counted.

The "Friends Meeting House" also doesn't get counted, but arguably that's not a church even if it is a place of worship.

Longmeadow Evangelical Church has one address, three rows and two polling station numbers. When asked how many polling stations are in the table, the LLM counts that as two. A reasonable person might have expected one, two, three, or a warning. If I was writing an invoice parser, I would want this to be very predictable.

So, it's a mixed bag. I've certainly seen worse attempts at parsing a PDF.

[1] https://chatgpt.com/share/67812ad9-f2bc-8011-96be-faea40e48d... [2] https://www.stevenage.gov.uk/documents/elections/2024-pcc-el...

by michaelt

1/10/2025 at 10:34:51 PM

You can try to ask it to list all churches and assign them incremental number starting with 1. then print the last number. It's a variation of counting 'r' in 'raspberry' which works better than simple direct question.

by numba888

1/10/2025 at 7:32:36 PM

> There's reliable, and there's reliable. For example [1] is a conversation where I ask ChatGPT 4o questions about a seven-page tabular PDF from [2] which contains a list of election polling stations.

From your description, it does perfectly at the task asked about upthread (extraction) and has mixed results on other, question-answering, tasks, that weren't the subject.

by dragonwriter

1/10/2025 at 9:00:32 PM

> From your description, it does perfectly at the task asked about upthread (extraction) and has mixed results on other, question-answering, tasks, that weren't the subject.

¯\_(ツ)_/¯

Which do you think was which?

by michaelt

1/10/2025 at 6:24:16 PM

Do I understand correctly that nearly all issues were related to counting (i.e. numerical operations)? that makes it still impressive because you can do that client-side with the structured data

by NeedMoreTime4Me

1/10/2025 at 9:29:44 PM

Some would say the numerical information is among the most important parts of an invoice.

by michaelt

1/10/2025 at 3:20:50 PM

I wonder if performance would improve if you asked it to create csvs from the tables first, then fed the CSVs in to a new chat?

by philomath_mn

1/10/2025 at 3:52:53 PM

https://github.com/microsoft/table-transformer

This is much lighter weight and more reliable than vllm

by ttt3ts

1/10/2025 at 4:35:50 PM

As someone that spent quite a bit of time with table-transformers, I would definitely not recommend it. It was one of the first libraries we added for parsing tables into our chunking library [1] and the results were very underwhelming. This was a while back and at this point, it's just so much easier to use an LLM end to end for parsing docs (Gemini Flash can parse 20k pages per dollar) and I'm wary of any approach that stitches together different models.

[1] https://github.com/Filimoa/open-parse/

by serjester

1/11/2025 at 6:05:54 PM

Do you have some benchmark results I can look at that compares results?

by ttt3ts

1/10/2025 at 8:52:09 PM

I would like to through our project in the ring. We use ColQwen2 over a ColPali implementation. Basically, search & extract pipeline: https://docs.colivara.com/guide/markdown

by jonathan-adly

1/11/2025 at 1:59:06 PM

Surya is a great open source toolkit for table parsing, layout analysis and OCR: https://github.com/VikParuchuri/surya

by m_ke

1/10/2025 at 12:03:45 PM

Ah so like NIM is a set of microservices on top of various models, and this is another set of microservices using NIM microservices to do large scale OCR?

and that too integrated with prometheus, 160GB VRAM requirement and so on?

Looks like this is targeted for enterprises or maybe governments etc trying to digitalize at scale.

by ixaxaar

1/10/2025 at 11:10:43 AM

I have hard time to understand what they mean by "early access micro services"...?

Does it mean that it is yet another wrapper library to call they proprietary cloud api?

Or that when you have the specific access right, you can retrieve a proprietary docker image with secret proprietary binary stuffs inside that will be the server used by the library available in GitHub?

by greatgib

1/10/2025 at 12:08:43 PM

The latter. NIMs is Nvidia's umbrella branding for proprietary containerized AI models, which is being pushed hard by Jensen. They build models and containers, then push them to ngc.nvidia.com. They then provide reference architectures which rely on them. In this case the images are in an invite only org, so to use the helm chart you have to sign up, request access, then use an API key to pull the image.

You can imagine how fun it is to debug.

by theossuary

1/10/2025 at 11:53:36 PM

How is this different than elasticsearch and solr? That’s not any kind of challenging question… I really don’t know that much about these different tools and I just want to know what this one is about.

Also: I noticed that it mentioned images… does it do any kind of OCR or summary of them?

by joeevans1000

1/11/2025 at 1:38:21 AM

It is a method of extracting structured data from messy documents meant for human consumption that can then be indexed by tools like Elasticsearch and solr.

by UltraSane

1/10/2025 at 8:49:38 PM

Before you get too exited, this needs 2 A100 or H100's minimum.

by PeterStuer

1/10/2025 at 9:59:06 PM

GH200 $1.49 / GPU / hr

https://lambdalabs.com/nvidia-gh200

by alecco

1/11/2025 at 12:45:18 AM

yes, that's the whole idea, they want you to rent.

by numba888

1/10/2025 at 7:27:01 PM

This requires Nvidia GPUs to run.

The open question is whether to use rule-based parsing using simpler software or model-based parsing using this software.

by OutOfHere

1/10/2025 at 6:52:18 PM

So who is going to deploy this and turn this into a service/API?

by lyime

1/11/2025 at 1:36:04 AM

What is the effective $/document of this method?

by UltraSane

1/10/2025 at 7:16:52 PM

Is this like Nvidia version of MCP? (https://modelcontextprotocol.io/introduction)

by wiradikusuma

1/10/2025 at 7:25:22 PM

No relation.

by OutOfHere

1/10/2025 at 11:15:18 AM

lol, while checking which OCR is using (PaddleOCR) I found a line with the text: "TODO(Devin)" and was pretty excited thinking they were already using Devin AI...

"Devin Robison" is the author of the package!! Funny, guess it will be similar with the name Alexa

by joaquincabezas

1/10/2025 at 11:09:31 AM

Sounds pretty useful. What are the system requirements?

  Prerequisites
  Hardware
  GPU Family Memory # of GPUs (min.)
  H100 SXM or PCIe 80GB 2
  A100 SXM or PCIe 80GB 2

Hmm, perhaps this is not for me.

by vardump

1/10/2025 at 4:18:11 PM

Seems pretty ridiculous to me to parse some PDFs. Almost like they made this as bloated as possible to justify buying $5,000+ GPUs for an office.

by neuroelectron

1/10/2025 at 5:11:45 PM

I think those GPUs cost between $25-40k each.

by vardump

1/10/2025 at 8:58:02 PM

Why even buy them at this point... just rent neocloud for $1-2... even at $2/hr, that's over a year of rental for $25k... by then you'd have made your money off the implementation.

by latchkey

1/10/2025 at 10:15:11 PM

Not sure whether I'd like to send potentially sensitive documents to a lesser known provider. Or even to a well known.

by vardump

1/10/2025 at 10:33:57 PM

Even at $3/hour (which is above the current market rate), that's roughly a year.

I genuinely appreciate your perspective, but as a smaller, lesser-known provider, I’d like to understand your concerns better.

Are you worried that I might misuse your data and compromise my entire business, by selling it to the highest bidder? Do you feel uncertain about the security of my systems? Or is it a belief that owning and managing the hardware yourself gives you greater control over security?

What kind of validation or reassurance would help address these concerns?

by latchkey

1/12/2025 at 7:50:52 AM

The main issue is a lot of smaller providers are clearly incentivized to get into the industry simply for the opportunity to eavesdrop into what other companies are doing. If you want to confuse or cloud this perceived motivation, also provide security services.

by neuroelectron

1/12/2025 at 6:18:51 PM

This is why we have a clearly defined shared responsibility model.

https://hotaisle.xyz/shared-responsibility-model/

I'm not sure what you mean by "security services"? Can you please expand on that?

by latchkey

1/10/2025 at 11:39:12 AM

Wow, I perhaps need a kubernetes cluster just for a demo:

    CONTAINER ID   IMAGE                                                    
    0f2f86615ea5   nvcr.io/ohlfw0olaadg/ea-participants/nv-ingest:24.10     
    de44122c6ddc   otel/opentelemetry-collector-contrib:0.91.0              
    02c9ab8c6901   nvcr.io/ohlfw0olaadg/ea-participants/cached:0.2.0        
    d49369334398   nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.1.0                
    508715a24998   nvcr.io/ohlfw0olaadg/ea-participants/nv-yolox-structured-images-v1:0.2.0
    5b7a174a0a85   nvcr.io/ohlfw0olaadg/ea-participants/deplot:1.0.0                                                                     
    430045f98c02   nvcr.io/ohlfw0olaadg/ea-participants/paddleocr:0.2.0                                                                  
    8e587b45821b   grafana/grafana                                                         
    aa2c0ec387e2   redis/redis-stack                                                       
    bda9a2a9c8b5   openzipkin/zipkin                                                       
    ac27e5297d57   prom/prometheus:latest

by shutty

1/10/2025 at 1:38:48 PM

It may be least of your worries considering it requires 2x[A/H]100 80GB Ram.

by fsniper

1/10/2025 at 1:36:54 PM

You can just use k3s/rke2 and run everything on the same node.

by threeseed

1/10/2025 at 5:33:09 PM

You can run vanilla k8s on a single node too

by verdverm

1/10/2025 at 4:17:27 PM

Also, they're rolling the dice continuing to use Redis https://github.com/redis/redis/blob/21aee83abdbfe8878d8b870b...

by mdaniel

1/10/2025 at 4:37:57 PM

You think there is a risk of them pivoting from this project to providing redis as a service?

by mirekrusin

1/10/2025 at 4:56:48 PM

[dead]

by foxhop

1/10/2025 at 12:05:51 PM

Nvidia getting in on the lucrative gpt-wrapper market.

by jappgar

1/10/2025 at 7:35:19 PM

If it was a GPT wrapper, it wouldn't require an A100/H100 GPU; the container has a model wrapper, sure, but also it has the wrapped, standalone model, as well; its not calling OpenAI's model.

by dragonwriter