Show HN: CPU-only transcription for YouTube, TikTok, X, Instagram videos

5/21/2026 at 5:45:03 AM

If someone is interested, this is my supershort zsh/bash scripts that I keep in .zshrc for doing the same thing using plain whisper.cpp, ffmpeg and yt-dlp (`brew install whisper-cpp yt-dlp` for Mac); I output it in vtt format (subtitles) though, but it's easy enough to change it to txt.

  yt_to_srt() {
    local url="$1"
    local output_base="$2"
    local language="${3:-en}"

    yt-dlp -x --audio-format wav --postprocessor-args "-ar 16000" -o "$output_base.wav" "$url"
    whisper-cli --language "$language" --model "$WHISPER_MODEL" --split-on-word --max-len 65 --output-vtt --output-file "$output_base" --file "$output_base.wav"
    rm "$output_base.wav"
  }

  file_to_srt() {
    local filepath="$1"
    local language="${2:-en}"

    local filename=$(basename "$filepath")
    local filename_no_ext="${filename%.*}"
    local output_base="$filename_no_ext"
    local temp_wav="$output_base.wav"

    ffmpeg -i "$filepath" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$temp_wav"
    whisper-cli --language "$language" --model "$WHISPER_MODEL" --split-on-word --max-len 65 --output-vtt --output-file "$output_base" --file "$temp_wav"
    rm "$temp_wav"
  }

plus additional bootstrap script for large-v3-turbo model from my chez-moi dotfiles:

  #!/bin/bash
  # Download whisper.cpp models from Hugging Face (runs once per machine).
  set -euo pipefail
  MODELS_DIR="$HOME/whisper-models"
  BASE_URL="https://huggingface.co/ggerganov/whisper.cpp/resolve/main"
  MODELS=("ggml-large-v3-turbo.bin" "ggml-tiny.bin")
  mkdir -p "$MODELS_DIR"
  for model in "${MODELS[@]}"; do
    if [ ! -f "$MODELS_DIR/$model" ]; then
      echo "Downloading $model..."
      curl -L --progress-bar -o "$MODELS_DIR/$model" "$BASE_URL/$model"
    else
      echo "$model already exists, skipping."
    fi
  done
  echo "Whisper models ready at $MODELS_DIR"

by piotrrojek

5/21/2026 at 7:28:49 AM

yt-dlp can download auto-subtitles and regular subtitles, why not do that and fall back to whisper?

by ramon156

5/21/2026 at 11:58:21 AM

To be frank I didn't know there's such an option :-)

by piotrrojek

5/21/2026 at 1:16:59 PM

In my experience Whisper is several orders of magnitude slower though.

by ranger_danger

5/21/2026 at 7:51:44 AM

Works extremely well. Command to install on Debian 13:

sudo apt update && sudo apt install -y ffmpeg python3-pip python3-venv && git clone https://github.com/kouhxp/yapsnap.git && cd yapsnap && python3 -m venv ~/yapsnap-venv && source ~/yapsnap-venv/bin/activate && pip install --upgrade pip && pip install .

On a 32GB ThinkPad X13, a 21 minute YouTube video was processed by yapsnap under 2 minutes.

Very well done!

by throw98226

5/21/2026 at 1:50:44 PM

thank you!

by mrkn1

5/21/2026 at 8:26:47 AM

Am I a bit thick, but first we created this amazing way to transfer any text very cheaply and fast over network, then we (well, I think it was Meta and Google) decided that no, everything must be a video, then we added subtitles and AI-transcriptions to those videos and now we just dowload transcriptions of those videos presumably to feel LLM to make summaries of them in order to… Read. Them.

I think I’m gonna go read a book.

by delis-thumbs-7e

5/21/2026 at 1:52:50 PM

Good point! I haven't found a faster way to consume info than reading. But depends on the type of learner you are (visual, auditory, hands-on/interactive, etc)

by mrkn1

5/20/2026 at 11:17:00 PM

So, this project consists of a ~175 line README and a ~500 line Python program that glues yt-dlp and Kroko together. Neat.

I guess if it encourages you to install and figure out how to use ffmpeg, yt-dlp, kroko, numpy, and onnx that's a good thing. Sometimes just knowing a thing is possible is a huge benefit.

by spudlyo

5/21/2026 at 12:14:51 AM

thank you. You nailed the actual value, that's right. The real win is just knowing you can do this on a laptop CPU, offline, no GPU or cloud bill. There are tiny done-for-you details, like rescaling token timestamps back to real time after the atempo speedup so --timestamps doesn't lie to you, but they are minor.

by mrkn1

5/21/2026 at 3:22:42 AM

Why the choice of Kroko over something like parakeet-tdt-0.6b-v3, which is also faster than realtime on CPU?

by mscdex

5/21/2026 at 5:59:24 AM

Kroko models are more accurate and their size is just a hundred megabytes compared to parakeet (2.5 gigabytes in default fp32)

by nshm

5/21/2026 at 6:40:12 AM

Do you have a link to results confirming this? Kroko does not seem to be on the Open ASR Leaderboard. Parakeet has an average WER of 6.32 across several common datasets.

by mscdex

5/21/2026 at 3:17:41 PM

Kroko's website says benchmarks aren't formalized yet. FWIW, this url says 5% WER for English [0]. though it doesn't specify the dataset, so not directly comparable to Parakeet's 6.32 on the Open ASR Leaderboard

Best way to judge is to try it on your own audio

[0] https://huggingface.co/hudaiapa88/sherpa-stt-onnx

by mrkn1

5/21/2026 at 12:51:52 AM

I see the value as a centralized anti-content-blocker.

This repo is now a good way to centralize hacks around the sure-to-come blockers those platforms will add to prevent download.

Just like uBlockOrigin was a way to centralize all the "just run this greasemonkey script" comments, I can see this getting a huge following for people who really value transcriptions.

by iririririr

5/21/2026 at 12:55:50 AM

I appreciate the perspective! higher ceiling than I'd put on it, but if it gets there awesome. PRs welcome!

by mrkn1

5/21/2026 at 3:38:05 PM

I thought ONNX models were only for text-to-speech? How does one tell them apart if I find some files online?

by majorchord

5/21/2026 at 3:39:35 PM

[dead]

by mrkn1

5/21/2026 at 11:24:39 AM

Very cool, I'm also working on a captioning/subtitling project for the lecture recordings for the university I work at.

My biggest challenge is finding a proper language model that is fast enough and accurate enough since I have to caption about 600 hours of video per week and I preferably want to run all of this on a tiny server (2 cores 4 GB memory). This tool could easily do that with the kroko model but I'll have to test if the accuracy is good enough.

Also in my own scripts I'm using ffmpeg to download just the audio of the videos that I want to caption, which saves a lot of bandwith and speeds up the whole process. As far as I can see this tool doesn't do that, that would be a nice functionality to add, plus an option to turn the output into a working .srt file.

by jorritpr

5/21/2026 at 1:53:37 PM

thanks! making a note of the feature request

by mrkn1

5/21/2026 at 12:53:34 AM

Had Claude test it out on 3 videos. Worked at 5-8x realtime. The beauty of it is that it works on all videos, not just the one with transcripts. Combine it with YouTube search and LLM takeaways from transcripts, and you have super-efficient content consumption. There are SaaS products that charge 1 cent per video for those with transcripts. There is a viable product in here somewhere, methinks.

by niraj-agarwal

5/21/2026 at 12:56:56 AM

thanks for running it Niraj. I see something similar on my machine, which still surprises me every time lol

by mrkn1

5/21/2026 at 4:31:58 AM

Wouldn't it still be more efficient to do GPU transcriptions anyways? is this something we could actually put the effectively useless NPUs to use in modern laptops?

by HDBaseT

5/21/2026 at 6:03:13 AM

yes GPU is significantly faster, but cpu only lets you do it anywhere - wasm in the browser, any server etc.

NPUs - definitely a good use case for at least part of it, there are ports of whisper that use coreML/ANE with less power and 3x speed of CPU only

by dharma1

5/21/2026 at 5:16:20 AM

Possibly, but you may want to use the GPUs for other things, or have under-utilized CPU-only servers lying around.

by KingMob

5/21/2026 at 1:16:18 PM

How is this so much faster than even GPU-based whisper?

by ranger_danger

5/21/2026 at 1:55:08 PM

small, ONNX-optimized models designed specifically for low-latency CPU streaming, so it avoids overhead of large transformer arch and GPU memory transfers

by mrkn1

5/21/2026 at 3:40:36 AM

Nice. Can it do speaker diarization?

by canadiantim

5/21/2026 at 1:56:38 PM

will work on it, that would be neat. I love pyannote but not happening on CPU at reasonable speeds lol

by mrkn1

5/21/2026 at 9:53:33 AM

Tis is very simple and very cool! Just installed it on my Hetzner box where I run a remote controlled local agent so now I can basically chat/email a video link to get a summary and/or ask questions. The only issue was YouTube's PO Token requirement (web/mweb clients refuse to serve formats from datacenter IPs without a valid Proof-of-Origin token.) So I had to find a client that still work without PO Token first. Thanks for sharing!

by 7777777phil

5/21/2026 at 1:56:03 PM

thank you! good use case, what hetzner box specs have you chosen?

by mrkn1

5/21/2026 at 5:23:56 AM

Now make it distinguish speakers and we really have something. As far as I know, that's significantly harder though.

by dmos62

5/21/2026 at 1:56:45 PM

in the roadmap!

by mrkn1

5/21/2026 at 1:38:00 AM

How can we transcribe other languages besides English?

by ranger_danger

5/21/2026 at 1:46:15 AM

Just download the model for your preferred language, all hosted on the Kroko-ASR collection here: https://huggingface.co/Banafo/Kroko-ASR/tree/main Right now you have Dutch, French, Portuguese, Spanish, German, Italian, Swedish, Swiss German, Hebrew, and Turkish. Grab the one that matches your audio, point yapsnap at it with --model (or set KROKO_MODEL), and you're set!

by mrkn1

5/21/2026 at 1:15:59 PM

Was hoping for CJK languages but I don't see any there. Thanks anyway

by ranger_danger

5/21/2026 at 12:20:49 AM

Most of these platforms already have transcriptions built in.

by charcircuit

5/21/2026 at 12:23:58 AM

Youtube has transcripts on most videos, not all. The others don't expose them. If you mean the "transcript APIs" for TikTok/IG/X, they are all transcribing audio like yapsnap does. If you have a way to pull native ones, let me know, genuinely curious.

by mrkn1

5/21/2026 at 12:54:45 AM

YouTube's is transcribing the audio too. The other do expose them as subtitles as the video is playing.

by charcircuit

5/21/2026 at 12:58:17 AM

Yes fair point, asr cached and exposed. I meant to draw the line more on fetchable or not.

by mrkn1

5/21/2026 at 6:04:47 PM

[flagged]

by photonair

5/20/2026 at 11:59:40 PM

[dead]

by xnx

5/21/2026 at 7:02:45 AM

[flagged]

by chris_explicare