Claude-real-video － any LLM can watch a video

7/2/2026 at 11:43:46 PM

Pretty terribly expensive way to watch a video with Claude.

Use Gemini or some local VLM to do this way more efficiently. We spent quite a bit of time on video understanding, and Claude will just burn tokens.

Check out this library: https://vlm-run.github.io/mm/

You can swap models and try out different encoding methods for videos (https://vlm-run.github.io/mm/encoders/#video)

by fzysingularity

7/3/2026 at 4:49:04 AM

Exactly this. Gemini is best at this. Just give it video link - YouTube works best - and it will analyse the video.

by thisisit

7/3/2026 at 6:09:49 AM

Really, does this work now? What about NotebookLM? I was using it a lot until i realised it was only analysing the transcripts and not the video because i was mostly using it for technical ones with important charts.

by snthpy

7/3/2026 at 6:59:07 AM

It can tell you what’s on the screen at given point in time. My pipeline is mostly around simple questions like “does this video contain cars?” Not sure if it can spot charts on screen.

by thisisit

7/3/2026 at 2:44:51 PM

NotebookLM still uses the transcript method I think. But Gemini is wonderful. I have been using it to analyze the youtube videos of wrestling matches (trying to build a fan website for WXM, the best pro wrestling promotion to come out of India in a while). It does move by move analysis, audience reaction based match flow tracking, isolates interesting parts of the video (big moves, botches, story beats etc). I have run some experiments to get video editing plans out of it. I think I can combine it with something like remotion skill to make highlight videos.

Edit: BTW, you can analyze about 8 hours a day on free tier.

by newswasboring

7/3/2026 at 5:23:20 AM

Seems cool from the docs page, I was about to give it a shot but https://github.com/vlm-run/mm goes 404 …

by n0on3

7/3/2026 at 11:59:02 AM

It’s unclear if that’s intentional since it’s listed also under open source on the main company site: https://www.vlm.run/open-source/mm

by rancar2

7/3/2026 at 1:34:02 AM

Do you mean that Gemini is most token-efficent at watching videos? Is that the case for e.g. just giving it a video in the browser? I admit, I dont give LLMs videos as I just assume it'll burn too many tokens.

by Tenoke

7/3/2026 at 4:30:03 AM

Yes, Gemini is very token efficient at video. It also has "lower resolution" options which can make it even cheaper if. With Gemini 3.1 flash lite an hour of video works out to $0.24 at the API rates.

by achatham

7/3/2026 at 1:15:57 AM

Assuming that's your project, the GitHub link from the PyPi page is a 404.

by mh-

7/2/2026 at 10:23:27 PM

"Where the video goes: stays on your machine" - No, the frames (that this tool extracts) obviously get sent to Anthropic if you use Claude.

by bonoboTP

7/3/2026 at 12:36:45 AM

"Or any LLM" on your machine.

by fny

7/3/2026 at 1:13:35 AM

I’m currently punishing Fable by making it watch the entire series of 7th Heaven.

by nickpeterson

7/3/2026 at 6:37:57 AM

It's going to make itself unavailable again. Actually... that's probably a litmus test for sentience.

by chaboud

7/3/2026 at 1:22:32 AM

Inhumane

by kingkawn

7/3/2026 at 3:35:13 AM

Was it bad? I was too young too tell and thought it was nice.

by testycool

7/2/2026 at 10:54:52 PM

This looks cool but this should be renamed without having Claude in the name.

by zitterbewegung

7/2/2026 at 11:00:55 PM

llm-real-video would be a much better name

by walrus01

7/3/2026 at 8:36:58 PM

Took this — pip install llm-real-video works now, same tool. Kept the original repo name so existing links don't break.

by cortexosmain

7/3/2026 at 6:25:44 AM

llmrv.

by coss

7/3/2026 at 7:30:15 AM

Elmerview

by fragmede

7/3/2026 at 5:59:20 AM

I was creating a scene by scene remake of a cutscene from an old DOS game. The sprite sheet had several sprites which were cycled (e.g. a horse with it's head down and up). The engine would cycle through these regularly to create some "liveliness" in the background. It was tedious and I didn't want to figure out which sprites belonged at which pixel location.

I recorded a video of the relevant part of the cutscene using dosbox and then split it into numbered frames using ffmpeg. Then I gave that + the spritesheet to Claude Code and asked it to figure it out and tell me which ones are at what position. I should probably have deduped it but in any case, it churned through the whole thing and got one or two out of 15 or 16 sprites right. The rest, it just dropped into random places. YMMV

by noufalibrahim

7/2/2026 at 9:31:43 PM

Nice @OP i put together something similar as well. Incidentally I found for motion design specifically llm is not able to infer specific animations as well as it just being described very plainly and accurately what is happening and the timing.

One thing which sort of worked decently was actually take the frames and put them into a grid and have the agent look at the image of all of the frames together. It did surprisingly well but missed a lot of subtle details that it couldn’t see.

Also tried various kinds of vision embeddings, heat map of motion etc, and blur etc to show motion. But none really worked as well so I ended up just describing it until it got it. Haven’t quite found the right solution yet.

by gvkhna

7/3/2026 at 9:03:50 PM

[flagged]

by cortexosmain

7/3/2026 at 10:29:42 AM

So I did this yesterday for a video analysis sample with ChatGPT and it took the video, pulled out frames, did difference tests across the frames to look for significant frames to focus on, did image recognition on each frame, and interpolated motion and action between.

So I’m not sure why this says ChatGPT doesn’t “see” video and reads transcripts. Obviously if the video is already labeled that’s the shortcut. But it did an impressive job describing a video I have no inclination it would have in its training data. One could argue it wasn’t “native” and had an agent orchestrator to rely on external tools to accomplish the goal… but it worked.

by Frost1x

7/3/2026 at 11:45:28 AM

Had the same experience with Claude, just somehow the entire thing felt (token) expensive.

by frb

7/3/2026 at 12:40:16 AM

Are models any good at descerning motion from multiple frames?

For instance if I gave models multiple animations of a bouncing ball as individual frames. Would they be able to tell which bounce was the more realistic motion.

(Is this a potential new benchmark? maybe also variations of stair dismount)

by Lerc

7/3/2026 at 1:00:17 AM

I’d imagine they could. I’d try Gemini 3.5 flash with high fps.

by danbrooks

7/2/2026 at 9:37:37 PM

I was just thinking about this exact use case yesterday:

And it's for me measuring different charged speeds at different starting battery capacities and different temperatures and I was like well. What if I just had a video camera pointing at the voltage going in and out and then I could see the battery percentage increase and I can have a temperature gun pointed at the phone as well. And I couldn't know what temperature of the phone is as well and it could just figure it all out create charts..

This would make reviewing different charging equipment really easy as long as you really have to do is plug it in and tell other people to do the same thing and take a video of it and beat it to the system.

I might very well give this a try!

by ElijahLynn

7/2/2026 at 11:30:07 PM

It's kind of wild how much we are abandoning basic problem solving skills in favor of just pointing an enormous stack of GPUs at it

by idiotsecant

7/2/2026 at 11:36:33 PM

Identifying objects in pictures was considered an insurmountable task only a few years ago, like in the xckd comic https://xkcd.com/1425/

by siriusastrebe

7/3/2026 at 12:55:08 AM

In the general case, I guess. But watching gauges and dials like battery capacity only take a little work with a deterministic computer vision library.

by smallerize

7/3/2026 at 6:51:08 AM

Yeah - the correct way to use an LLM in this scenario is to ask it to put write such a model.

by mindok

7/3/2026 at 6:55:58 PM

Or just using voltage pickups like every system that monitors battery voltage ever, or about a dozen other very simple solutions.

by idiotsecant

7/3/2026 at 12:12:19 AM

[dead]

by yieldcrv

7/2/2026 at 10:28:50 PM

Cool idea, but keyframes are not videos. Motion, object permanence, are not things Claude can infer from a set of images. Nice demo though!

by octember

7/2/2026 at 11:46:09 PM

Exactly! We experimented with a whole bunch of video encoding techniques for LLMs here: https://vlm-run.github.io/mm/encoders/#video

by fzysingularity

7/3/2026 at 12:11:59 AM

I have been going through this with claude and qwenvl3:8b this week. Both are pretty decent at inferring context and analyzing contact sheets. Finding high visual interest moments with a mixture of coarse and fine keyframes.

by sawjet

7/3/2026 at 1:50:35 PM

Might be time to check gemma :)

by octember

7/3/2026 at 8:35:55 PM

[flagged]

by cortexosmain

7/3/2026 at 7:18:53 PM

[flagged]

by cortexosmain

7/2/2026 at 9:57:11 PM

I think this is much more useful than just LLM related applications. I'd suggest renaming it to not make it seem like it's LLM related.

by BeetleB

7/3/2026 at 6:16:16 AM

Based on my tests, a frame rate of 2fps is generally sufficient to resolve video content very well.

by dingody

7/3/2026 at 1:13:52 PM

my experience with ffmpeg scene detection is that it's flaky. it works, sometimes, but not reliable by any means

by high_byte

7/3/2026 at 12:51:26 PM

Interesting, but how expensive does it get?

by kraflio

7/3/2026 at 8:38:56 PM

[flagged]

by cortexosmain

7/3/2026 at 12:55:07 AM

Curious as to how many tokens are used per second of video.

by nickvec

7/3/2026 at 5:17:13 AM

So this work basically by dividing into frames..

by virajk_31

7/2/2026 at 9:44:40 PM

How do you handle things like scrolling quickly in a video?

by fred123123

7/3/2026 at 8:37:58 PM

[dead]

by cortexosmain

7/3/2026 at 12:48:19 PM

So Claude is a murderbot?

by speedgeek

7/3/2026 at 12:47:49 PM

So Claude is a murderbot?

by speedgeek

7/3/2026 at 11:23:48 AM

better off using a cloud solution.

by Jeff9James

7/3/2026 at 9:42:06 AM

Gemini does read videos.

by wesleywt

7/2/2026 at 10:06:20 PM

this is really clever, props

by nxtfari

7/2/2026 at 7:10:12 PM

Hi HN! I built this because I was frustrated that no LLM actually "sees" a video — Claude won't accept video files, ChatGPT reads the transcript only, and Gemini samples at a fixed 1fps (missing fast cuts, over-sampling static slides).

claude-real-video takes a URL or local file and:

1. Extracts frames at every scene change (not fixed intervals) + a density floor 2. Deduplicates with a sliding-window pixel-diff algorithm (so A-B-A interview cutaways don't re-send the same shot) 3. Transcribes audio (prefers embedded subtitles, falls back to Whisper) 4. Optionally keeps the full soundtrack for audio-capable models 5. Writes a clean MANIFEST.txt you can drop into any LLM chat

A 10-min presentation goes from ~600 fixed-interval frames to 5-15 meaningful keyframes. 90%+ token savings with better comprehension.

The dedup approach (v0.2.0) uses real pixel difference on 16x16 RGB thumbnails against a sliding window of the last N kept frames — inspired by videostil's pixelmatch, but simpler and self-contained.

`--report` generates a self-contained HTML showing every keep/drop decision with diff percentages, so you can tune the threshold visually.

pip install claude-real-video && crv "https://youtube.com/watch?v=..." --report

MIT licensed, pure Python + ffmpeg. Happy to answer questions!

by cortexosmain

7/2/2026 at 10:01:39 PM

I gave Claude a video provided by a county attorney for a speeding ticket I got. It was spot on in its analysis, even though I don’t like what the video showed.

What does it mean that Claude can’t view video; it did it just fine. Or do you mean tool less?

by garciasn

7/2/2026 at 10:23:58 PM

yeah im pretty sure claude code can handle videos. its been doing frame by frame analysis for me with generated video to iterate on pipelines

by torhorway

7/2/2026 at 9:54:47 PM

I think a more or less clunky name like 'llm video preprocessor' would be better description? In any case seems like a you came up with a good project idea. I wonder how long until the sota models will just have this kind of functionallity built in.

by AmazingEveryDay

7/2/2026 at 9:15:32 PM

Very cool I have something that does this as well along these lines. I’ll dig into yours over the next few days and contribute where and if I can too, awesome to see!

by ProofHouse

7/3/2026 at 9:02:47 PM

[flagged]

by cortexosmain

7/3/2026 at 6:08:45 AM

[flagged]

by mlpicker