6/23/2026 at 12:21:29 PM
Very interesting.The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!
by robotswantdata
6/23/2026 at 2:11:16 PM
This hits a sweet spot I think for conversations too. I've been playing (for quite a while) on trying to encapsulate long running conversations.You have the overriding context, facts that don't change very often at all. The participants names, their backgrounds etc.
Then you have some very fine grained facts (what they ate for breakfast this morning) which might be useful right now, but are irrelevant outside of a general trend over the longer term.
When trying to reconstruct a conversation you really need to find the right balance without pulling in everything that has ever been discussed.
This definitely is worth further investigation.
by _puk
6/23/2026 at 2:26:29 PM
This sounds like we are trying to add an LSTM into a transformerby ewild
6/23/2026 at 3:07:59 PM
Sepp would like a wordby htrp
6/23/2026 at 11:20:40 PM
I tried to do that for very long translations, I had a sliding window, I had a memory for the important things to keep it consistent, a loop for repairs etc. https://jeena.net/loop-engineeringBut for some reason the local models I used back then that was almost 2 years ago) weren't good enough so none of my optimizations did anything good for the translation quality.
by jeena
6/23/2026 at 3:47:32 PM
Can you say more about how this applies to long-running conversations? I've been thinking about them as well, but can't write wrap my head around how this would be better than (or even different to) standard compaction.by timwis
6/23/2026 at 4:40:33 PM
standard compactions doesnt really distinguish between long term vs short term ephemeral facts ?by dominotw
6/23/2026 at 4:59:25 PM
Forgive me if I'm being naive, but can't you just tweak the compaction prompt to differentiate? Presumably that's what you would do in the separate prompt anyway, right?by timwis
6/23/2026 at 3:58:34 PM
Haven't read the full paper but thr local generation window is a little small, especially since image inputs are especially token heavy. Depending on where the local attention layer is located, it would be nicer if it's bigger e.g. 4096 words at least.by storywatch
6/23/2026 at 4:01:59 PM
I do OCR of images, and that's exactly what I do. I take one big image and slice it into many smaller ones, and send those to the LLM. Perfect every time, unlike using the whole image which resulted in hot garbage.by MattRogish
6/23/2026 at 4:30:56 PM
It works with relatively good scans, when there are bad/skewed scans and especially something with many label/value pairs, that aren't nicely tucked inside sentences, the more context you have, the more you can find the correct words and fix the errors.There is a whole class of tricky documents. A decent (if you ignore the marketing bias) post about this problem can be found here:
by freefaler
6/23/2026 at 5:07:36 PM
How do you know where to slice an image? What if you slice an image mid-word?by ryanisnan
6/24/2026 at 3:24:19 AM
Paddle-VL and GLM-OCR do this by using PP-DocLayoutv3 as their "detector/slicer" and then just batch the OCR on the clips to do pretty darn well at a tiny size.A lower-tech version is to use a good detector and XY-cut or just a naive Y-cut or orientation-away cut to slice up the page. But if you're doing that you're getting closer and closer to DocVLM style OCR+low res image. Been playing around with something like this using the new PPOCRv6 which itself punches well above most traditional OCRs and is multi-language without the hassle of language detection and dict-loading for rec.
by vrc
6/23/2026 at 6:20:29 PM
I calculate* the appropriate overlap and the slicer overlaps a certain amount of the previous slice. There is some post-processing assembly required, but it's trivial.[*] SWAG line height, trial and error to figure out the right amount of overlap given LLM error rates, etc.
by MattRogish
6/23/2026 at 6:30:45 PM
Interesting. Do you have a uniform data set? E.g. documents of a specific type that you know consistently have similar formats, or is this training something you need to do per-document?by ryanisnan
6/24/2026 at 12:40:13 AM
We have some broad shapes - it’s a finite set of “things that are interesting to us” and the dataset is bounded. It’s not “Google Image Search”. But it is kinda like “we have a giant pile of PDFs, pictures, etc and the user wishes to run an arbitrary query on them and extract the information they want. Ex: “I need the to know $something about the data embedded in the corpus, that look like excel data with line charts describing some particular class of metric that are to the left of gray dogs and are about $something_else earlier in the document”Gemini has a very specific mode where it has been trained on making boxes normalized to a 1000x1000 grid (https://docs.cloud.google.com/gemini-enterprise-agent-platfo...) and in our experience this “just works” AND is very fast on 3.5 and 3.1 models without needing much thinking (so it is not terrifically expensive).
(BTW A+++ gold star triple thumbs up give this person a bonus to whomever did that magic it basically made this task for us tractable. When we first found it nobody else had anything like it - it’s worked so well I haven’t felt any need to look. )
So we say, “Hey Gemini draw box_2d […] around #{things we are interested in}” and then it is pretty easy to then go - ok if this is here and that is there, let’s slice the image in this particular way, making sure to overlap by some amount because the boxes are fuzzy, then send the chunks to a thing that turns it into JSON, then we use something like edge detection to reconstruct the whole from the parts. (Squint and it looks like whole genome shotgun sequencing)
by MattRogish
6/23/2026 at 9:02:54 PM
I thought all the major LLM tools already supported sliding window attention?by ranger_danger
6/24/2026 at 6:32:01 AM
I mean sliding window attention is the most basic way of getting long context window. For the OCR case it seems like it should be even simpler, since you don't even need to have the "sliding" portion, unless I"m missing something you don't need to retain anything about the previous pages to OCR a new page so you could just pick a short context window and restart from scratch each time. [^1]Were people really trying to do OCR with vanilla attention?
[^1] Although maybe I guess looking at their demo, tables that span multiple pages might be a use-case for having some look back.
by krackers
6/23/2026 at 1:21:13 PM
See, leetcode is useful. As I do this leetcode grind, I’ve been why techniques exist / how they’re used irl. Lots of interesting stuff thereby d675
6/23/2026 at 1:41:42 PM
Who said it wasnt useful, dont listen to those people.by ai_fry_ur_brain
6/23/2026 at 2:00:56 PM
People who are applying to jobs and are tested with LeetCode problems to assess their skill level, despite the two not really being correlated or relevant for the positionby Xevion
6/23/2026 at 2:27:48 PM
As someone that gets very annoyed when having to do LeetCode in interviews...Knowing algorithms, data structures and their memory and time complexities is very relevant for SWE. I've had teammates that didn't understand them and everything was fine until when it wasn't (scaling and performance issues).
Or, as I put it to a teammate: "Would you rather review the PR of someone that understands the difference between a set and a list or the PR of someone who doesn't?". This was after we interviewed a candidate with ~15 YoE, on paper, that didn't know the difference.
by galbar
6/23/2026 at 2:55:53 PM
> Knowing algorithms, data structures and their memory and time complexities is very relevant for SWEAgree with this; however knowing how to roll your own BFS/LRU/etc isn't -- in that case I'd rather review the PR of someone who understands how to leverage tested and known implementations than the PR of someone who decided to roll their own.
by elliottcarlson
6/23/2026 at 3:31:20 PM
Who care's if the leetcode question doesn't relate to the job itself, it shows whether or not the person is willing to put in the work and gives you a glimpse into their ability to reason about hard problems.by ai_fry_ur_brain
6/24/2026 at 12:17:24 AM
just the level of questions being asked seems to be high idk, just passed round 1 for big tech. Not feeling great about the rest.main comment was a bit tongue in cheek
by d675