Rolling your own serverless OCR in 40 lines of code

2/16/2026 at 3:08:39 PM

Not sure what “your own” in the title is supposed to mean if you are running a model that you didn’t train using a framework that you didn’t write on a server that you don’t own.

by eapriv

2/16/2026 at 5:36:33 PM

I think in this case "your own" means under your control, rather than a service or license you pay for. "your own" as in ownership of artefacts, not as in being the creator.

by ddevnyc

2/16/2026 at 3:24:18 PM

I originally tried to do this on my own server but my GPU is too old :(

by ckrapu

2/16/2026 at 4:37:34 PM

Slammed an A380 in my old server that doesn't even have a GPU power connector & it works pretty well for stuff that will fit on it. They're only like, $150 brand new nowadays; could be a decent option.

by LoganDark

2/16/2026 at 7:32:05 PM

And then call it serverless

by croes

2/16/2026 at 6:46:19 PM

by self_awareness

2/16/2026 at 12:44:42 PM

Wouldn't "Serverless OCR" mean something like running tesseract locally on your computer, rather than creating an AI framework and running it on a server?

by voidUpdate

2/16/2026 at 12:47:49 PM

Serverless means spinning compute resources up on demand in the cloud vs. running a server permanently.

by cachius

2/16/2026 at 1:21:06 PM

~99.995% of the computing resources used on this are from somebody else's servers, running the LLM model.

by dsr_

2/16/2026 at 5:05:02 PM

> Serverless means spinning compute resources up on demand in the cloud vs. running a server permanently.

Not quite. Serverless means you can run a server permanently, but you need pay someone else to manage the infrastructure for you.

by locknitpicker

2/16/2026 at 7:19:14 PM

You might be conflating "cloud" with serverless. Serverless is where developers can focus on code, with little care of the infrastructure it runs on, and is pay-as-you-go.

by Stefan-H

2/16/2026 at 6:34:07 PM

Close. It means there's no persistent infra charges and you're charged on use. You dont run anything permanently.

by turtlebits

2/16/2026 at 8:31:16 PM

It still doesn't capture the concept because, say, both AWS Lambda and EC2 can be run just for 5 minutes and only one of them is called serverless.

by dvfjsdhgfv

2/16/2026 at 7:07:31 PM

Depends if you mean "server" as in piece of metal (or vm), or as in "a daemon"

by jwiz

2/16/2026 at 1:18:18 PM

Thanks for noting this - for a moment I was excited.

by normie3000

2/16/2026 at 2:59:29 PM

You can still be excited! Recently, GLM-OCR was released, which is a relatively small OCR model (2.5 GB unquantized) that can run on CPU with good quality. I've been using it to digitize various hand-written notes and all my shopping receipts this week.

https://github.com/zai-org/GLM-OCR

(Shameless plug: I also maintain a simplified version of GLM-OCR without dependency on the transformers library, which makes it much easier to install: https://github.com/99991/Simple-GLM-OCR/)

by xml

2/16/2026 at 1:44:12 PM

When people mentions the number of lines of code, I've started to become suspicious. More often than not it's X number of lines, calling a massive library loading a large model, either locally or remote. We're just waiting for spinning up your entire company infrastructure in two lines of code, and then just being presented a Terraform shell script wrapper.

I do agree with the use of serverless though. I feel like we agree long ago that serverless just means that you're not spinning up a physical or virtual server, but simply ask some cloud infrastructure to run your code, without having to care about how it's run.

by mrweasel

2/16/2026 at 4:42:47 PM

>implement RSA with this one simple line of python!

by goodmythical

2/16/2026 at 5:07:22 PM

> When people mentions the number of lines of code, I've started to become suspicious.

Low LoC count is a telltale sign that the project adds little to no value. It's a claim that the project integrates third party services and/or modules, and does a little plumbing to tie things together.

by locknitpicker

2/16/2026 at 5:26:52 PM

No, that would be "Running OCR locally..."

'Serverless' has become a term of art: https://en.wikipedia.org/wiki/Serverless_computing

by esafak

2/16/2026 at 8:32:37 PM

It's good they note explicitly:

> Serverless is a misnomer

by dvfjsdhgfv

2/16/2026 at 2:37:38 PM

Running it locally would typically be called “client(-)side”.

But this caught me for a bit as well. :-)

by spockz

2/16/2026 at 5:19:42 PM

That's the beauty of such stupid terms.

I use carless transportation (taxis).

by ahartmetz

2/16/2026 at 5:36:30 PM

taxis are cars, aren't they?

by wolfi1

2/16/2026 at 7:54:33 PM

Precisely. And serverless uses servers.

by BenjiWiebe

2/16/2026 at 1:30:49 PM

Deepseek OCR is no longer state of the art. There are much better open source OCR models available now.

ocrarena.ai maintains a leaderboard, and a number of other open source options like dots [1] or olmOCR [2] rank higher.

[1] https://www.ocrarena.ai/compare/dots-ocr/deepseek-ocr

[2] https://www.ocrarena.ai/compare/olmocr-2/deepseek-ocr

by kbyatnal

2/16/2026 at 2:07:29 PM

I wasn't aware of dots when I wrote the blog post. This is really good to know!! I would like to try again with some newer models.

by ckrapu

2/16/2026 at 9:09:28 PM

A bit surprised to learn that Rednote maintains one of the leading open-source OCR models on the market, nice.

by vovavili

2/16/2026 at 2:10:42 PM

you are comparing to DeepSeek's old OCR, there's DeepSeek-OCR2 which btw is amazing from my experimentations. https://huggingface.co/deepseek-ai/DeepSeek-OCR-2

by segmondy

2/16/2026 at 1:58:51 PM

The article mentions choosing the model for its ability to parse math well.

by tclancy

2/16/2026 at 3:02:39 PM

hi. i run "ocr" with dmenu on linux, that triggers maim where i make a visual selection. a push notification shows the body (nice indicator of a whiff), but also it's on my clipboard

  #!/usr/bin/env bash

  # requires: tesseract-ocr imagemagick maim xsel

  IMG=$(mktemp)
  trap "rm $IMG*" EXIT

  # --nodrag means click 2x
  maim -s --nodrag --quality=10 $IMG.png

  # should increase detection rate
  mogrify -modulate 100,0 -resize 400% $IMG.png

  tesseract $IMG.png $IMG &>/dev/null
  cat $IMG.txt | xsel -bi
  notify-send "Text copied" "$(cat $IMG.txt)"

  exit

by grimgrin

2/16/2026 at 2:14:12 PM

I am working on a client project, originally built using Google Vision APIs, and then I realized Tesseract is so good. Like really good. Also, if PDF text is available, then pdftotext tools are awesome.

My client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.

There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.

by brainless

2/16/2026 at 1:06:45 PM

Slight tangent: i was wondering why DeepSeek would develop something like this. In the linked paper it says

> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).

That... doesn't sound legal

by coolness

2/16/2026 at 2:56:24 PM

HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.

by Zababa

2/16/2026 at 2:45:34 PM

Tried adding a receipt itemization feature into an app using OpenAI. It does 95% right but the remaining 5% are a mess. Mostly it mixes prices between items (Olive oil 0.99 while Banana 7.99). Is there some lightweight open source lib that can do this better?

by Bishonen88

2/16/2026 at 3:06:07 PM

So I'm trying to OCR 1000s of pages of old french dictionaries from the 1700s, has anything popped up that doesn't cost an arm and a leg, and works pretty decently?

by lkm0

2/16/2026 at 6:10:39 PM

I use Gemini for that. Split the PDF into 50 page chunks, throw it into aistudio and ask it to convert it. A couple of 1000 pages can be done with the free tier.

by grumbel

2/16/2026 at 5:13:11 PM

Take a look at Mistral, https://mistral.ai/news/mistral-ocr-3

by ks2048

2/16/2026 at 3:08:06 PM

Qwen3 VL.

by speedgoose

2/16/2026 at 3:34:52 PM

Thanks! I'll have a look

by lkm0

2/16/2026 at 1:03:11 PM

How does this compare to Tesserect?

by ddtaylor

2/16/2026 at 2:27:54 PM

Different tools for different jobs. Tesseract is free, runs on CPU, and handles clean printed text well. For standard documents with simple layouts, it's hard to beat.

Where it falls apart is complex pages. Multi-column layouts, tables, equations, handwriting. Tesseract works line-by-line with no understanding of page structure, so a two-column paper gets garbled into interleaved text. VLM-based models like DeepSeek treat the page as an image and infer structure visually, which handles those cases much better.

For this specific use case (stats textbook with heavy math), Tesseract would really struggle with the equations. LaTeX-rendered math has unusual character spacing and stacked symbols that confuse traditional OCR engines. The author chose DeepSeek specifically because it outputs markdown with math notation intact.

The tradeoff is cost and infrastructure. Tesseract runs on your laptop for free. The author spent $2 on A100 GPU time for 600 pages. For a one-off textbook that's nothing, but at scale the difference between "free on CPU" and "$0.003/page on GPU" matters. Worth noting that newer alternatives like dots and olmOCR (mentioned upthread by kbyatnal) are also worth comparing if accuracy on complex layouts is the priority.

by newzino

2/17/2026 at 12:37:48 AM

> The author spent $2 on A100 GPU time for 600 pages

Hopefully that cost would come down quite a bit, because that doesn't compete with most offerings right now IMO. I haven't tested it, but I can use models that have vision as an input modality for much cheaper, closer to 25k images per $1.

by ddtaylor

2/16/2026 at 1:18:30 PM

Question for the crowd -- with autoscaling, when a new pod is created it will still download the model right from huggingface?

I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.

by apwheele

2/16/2026 at 2:29:17 PM

[dead]

by newzino

2/16/2026 at 2:50:34 PM

That book is freely available from its author in pdf format already… but I guess it’s about the journey?

by bovinejoni

2/16/2026 at 3:01:48 PM

If I had to guess, I would say that this method might be applicable to other books besides the one featured in the post.

by velcrovan

2/16/2026 at 3:22:21 PM

I wanted to let an LLM be able to grep and read through it.

by ckrapu

2/16/2026 at 9:02:30 PM

Why "rolling"? Is this a reference to baking or what's the origin?

by jbs789

2/16/2026 at 2:26:26 PM

Always wondered how auth validation works on these. Could I use your serverless ocr?

by sails

2/16/2026 at 4:38:19 PM

The cold-boot time on this model can hardly be called “serverless”

by fzysingularity

2/16/2026 at 5:37:46 PM

Uh... So I've been telling AI to write a single page html/js OCR app. And I'll include the pdf I want as an attachment.

I have 4 of these now, some are better than others. But all worked great.

by PlatoIsADisease

2/16/2026 at 2:12:30 PM

tl'dr version:

  step 1 draw a circle
  step 2 import the rest of the owl

by zeroq