Making deep learning go brrrr from first principles (2022)

5/23/2026 at 4:31:44 PM

This post is a classic! Also recommended: Horace also gave a related talk (covering the high-level picture of modern ML Systems) at Jane Street in Dec 2024 https://www.youtube.com/watch?v=139UPjoq7Kw

by ollin

5/23/2026 at 8:59:29 PM

One thing people seems not to acknowledge, and this post made it super clear is that NVIDIA kept their lead extremely well in a few years of very high growth. The TFLOPs, the bandwidth, the interconnect mentioned in this post continues to grow at exponential rate with no sign of stopping yet. This is a 30-year-old incumbent reminding you. The willingness to compete from NVIDIA is just simply remarkable.

by liuliu

5/24/2026 at 11:19:32 AM

the real lesson: GPUs win on memory bandwidth not just FLOPs. batching ops keeps VRAM fed at 2TB/s instead of tripping to RAM at 50GB/s for every operation

by cold_harbor

5/23/2026 at 12:23:05 PM

> in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS

wild

by tosh

5/23/2026 at 1:07:51 PM

Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.

"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)

by patmorgan23

5/23/2026 at 2:36:36 PM

> Why are we comparing a programing language and a GPU.

You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.

When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.

It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.

by gchamonlive

5/23/2026 at 2:25:40 PM

the sentence is ambiguous because "Python" can mean python + a certain library and even a different Python implementation

but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup

in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)

another analogy:

I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)

I'm also interested in how much energy is needed, how much the hw costs and so on

Often there are many ways to do things, comparing is a great starting point for learning more

by tosh

5/23/2026 at 2:29:00 PM

related to the truck analogy: an advantage of the way slower Python approach is: it does not need a GPU

that said: Python can get to more FLOPs by changing the representation: https://docs.python.org/3/library/array.html

by tosh

5/23/2026 at 3:52:13 PM

> This is a category error.

Okay, but surely you know what they actually mean right, or are you being willfully obtuse? They are comparing CPython (the main python implementation)'s implementation that runs on the CPU with a kernel running on the GPU.

by smasher164

5/23/2026 at 4:14:28 PM

I’m not 100%, in context. Sorry for the big quote:

> Overhead is when your code is spending time doing anything that's not transferring tensors or computing things. For example, time spent in the Python interpreter? Overhead. Time spent in the PyTorch framework? Overhead. Time spent launching CUDA kernels (but not executing them)? Also... overhead.

> The primary reason overhead is such a pernicious problem is that modern GPUs are really fast. An A100 can perform 312 trillion floating point operations per second (312 TeraFLOPS). In comparison, Python is really slooooowwww. Benchmarking locally, Python can perform 32 million additions in one second.

> That means that in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS.

> Even worse, the Python interpreter isn't even the only source of overhead - frameworks like PyTorch also have many layers of dispatch before you get to your actual kernel. If you perform the same experiment with PyTorch, we can only get 280 thousand operations per second. Of course, tiny tensors aren't what PyTorch is built for, but... if you are using tiny tensors (such as in scientific computing), you might find PyTorch incredibly slow compared to C++.

Emphasis mine.

It’s all a bit jumbled up. I get that he was going for an informal tone and this isn’t exactly a benchmark. But I’m still not sure, based on the second emphasized part I think the “bad” measurements are coming from Python+PyTorch but with too-small workloads, and dispatching to CPU, maybe? But the first one looks like naive Python loops.

by bee_rider

5/23/2026 at 12:48:22 PM

This statement makes zero sense

by p1esk

5/23/2026 at 2:12:05 PM

re comments:

yes of course this is apples to oranges but that's kind of the point

it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU

the interesting thing is why that is so

CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …

by tosh

5/23/2026 at 2:33:09 PM

A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.

AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

by p1esk

5/23/2026 at 7:13:27 PM

EPYC 9965: 614GBps of 12-channel DDR5-6400

A100: 1935GBps of HBM2e

Most of those FLOPS are constrained by memory bandwidth.

by zzzoom

5/23/2026 at 9:39:26 PM

> Most of those FLOPS are constrained by memory bandwidth

I believe inference with large enough batch size is almost always compute bound, simply due to algorithmic complexity.

Each step of tiled matric multiplication with square tiles of size N^2 takes O(N^2) memory loads and O(N^3) compute operations. With N = 32 or 64, you will likely saturate compute even on iGPUs with DDR4 or DDR5 memory pretending to be VRAM.

by Const-me

5/24/2026 at 1:03:41 AM

Prefill (GEMM) is compute bound, decode (GEMV) is memory bound.

by zzzoom

5/24/2026 at 6:56:18 AM

> decode (GEMV) is memory bound

Decode with batch size 1 is GEMV. Batching makes the decode GEMM too.

by Const-me

5/23/2026 at 2:46:17 PM

A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

by tosh

5/23/2026 at 3:11:12 PM

Intel Xeon 6980P: 128 cores x 1024 FP16 FLOP/cycle/core x 3.2 GHz: 419 TFLOP/s

by p1esk

5/23/2026 at 4:08:25 PM

I'm not saying "GPU more brrt than CPU"

I found the comparison interesting

on Intel Xeon 690P with 419 TFLOP/s it is still (maybe even more?) interesting to ask:

how much throughput can you reach with Python, Python with lib x, y, z, with C++ like this, with C++ like that etc etc and why?

no?

by tosh

5/23/2026 at 4:31:11 PM

No one in their right mind would use pure Python to do matrix multiplication. It’s like using a screwdriver to hammer nails into wood.

But this discussion is even more bizarre than comparing a screwdriver to a hammer, it’s like comparing a screwdriver to a nail.

by p1esk

5/23/2026 at 7:45:05 PM

That's also a CPU that came out four years later than the A100. The contemporaneous B200 is not optimized for FP32 and does 74.45 TFLOP/s. For FP16 it's at ~2 PFLOP/s.

by aesthesia

5/23/2026 at 9:07:29 PM

The point is that modern CPUs are not as slow as most DL people think. Roughly 10x slower but with a lot more memory.

by p1esk

5/23/2026 at 2:57:28 PM

Which, lets be honest, is probably still being orchestrated by Python somewhere.

Python is 9.75 million times faster than Python.

by itishappy

5/23/2026 at 3:01:44 PM

I was researching if there was much benefit to using Rust or C++ over Python for AI, and turns out, the GPU doesn't care once the instructions are in because its an entirely different spec running on the GPU. The only thing you might save on is "startup" costs of getting your code into the GPU I guess? I assume that time cost is miniscule though, once its all in memory, nobody cares that you spent any time "booting it up" any more than how long Windows takes these days.

by giancarlostoro

5/23/2026 at 3:53:45 PM

As long as you don't keep calling out to the CPU, that is.

Tool calling, searches, cache movement if used, and even debug steps all stall the GPU waiting for the CPU.

There was a test of turning one of the under 1B Qwen3+ models into a kernel that didn't stall by the CPU as one GPU pass that saw quite a bit f perf lift over vLLM, I believe, showing this is an issue still.

Its been a month, so I don't remember more details than this.

by BillStrong

5/23/2026 at 5:17:13 PM

Pytorch dataloaders are often horribly inefficient, a lot of stuff there can benefit from Rust/C++

by jmalicki

5/23/2026 at 5:49:22 PM

you can port anything python is doing with a couple prompts into rust/c++, including parity validation. when the barrier to migrating is that thin, you are losing money and time even continuing to talk about it. python is miserably slow, so dont let it touch any part of your system. no snakes in the house.

by hashmap

5/23/2026 at 12:31:45 PM

Single core vs multi core accounts for much of this

by xyzsparetimexyz

5/23/2026 at 12:55:45 PM

Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.

The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.

See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.

by cdavid

5/23/2026 at 2:37:36 PM

Theoretical FP32 performance of AMD EPYC 9965 is double that of A100: 41.2 TFLOP/s vs 19.5 TFLOP/s

by p1esk

5/23/2026 at 10:15:44 PM

Isn't that because the A100 is optimizing for memory bandwidth per TF?

by fc417fc802

5/23/2026 at 8:04:02 PM

I'd want to see more about the failure modes. Production systems need graceful degradation more than optimal performance.

by xiaod

5/23/2026 at 5:42:58 PM

I feel like there is no portable advice for performance. A torch model exported as onnx is a different model.

That onnx model run using onnxruntime with cuda ep is a different model than the one run with TRT ep.

And even among the same runtime, depending on the target hardware and the memory available during tuning, the model behaves differently. It is a humongous mess

by ThouYS

5/24/2026 at 12:58:58 AM

That's interesting as I was considering GGUF --> ONNX conversions (via Olive), but if this creates unknown distortions in the effectiveness and stability, it might be a dead-end idea.

by mycall

5/23/2026 at 4:30:46 PM

How does x.cos().cos() work faster than doing two cos calls separately? Like the first cos call returns a tensor either way, the only difference is that it's not assigned to a variable. But how is it even possible know that difference in python?

by big-chungus4

5/24/2026 at 6:09:25 AM

The author forgot to add "fused" here, like they did in other parts of the same section.

Non-fused:

  foreach i
    y[i] = cos(x[i])
  foreach i
    z[i] = cos(y[i])

Fused, no intermediate variable:

  foreach i
    t = cos(x[i])
    z[i] = cos(t)

The temporary "t" doesn't leave the GPU. Sweeping the array twice makes you twice as dependent on memory bandwidth.

by teo_zero

5/23/2026 at 4:41:06 PM

It’s really not a concept you can express in idiomatic Python very easily. This comes from the actual generated assembly involving copies from global GPU memory into registers (slow, bandwidth saturates quickly) and back in between the cosines. If you can avoid the intermediate roundtrip that cuts the cost approximately in half.

by vrm

5/23/2026 at 4:51:41 PM

Yeah, that part should not be read literally; `x.cos().cos()` and `x1 = x.cos(); x2 = x1.cos()` both launch the same number of kernels (two in unfused/eager mode, one in fused/torch.compile, see this test notebook [1]). I think the author chained the two cos calls to symbolize the idea of combining them (without exposing the intermediate result), but chaining the two cos calls doesn't literally trigger operator fusion.

[1] https://colab.research.google.com/drive/13a4Y-ko6QLMPAhBz64c...

by ollin

5/23/2026 at 12:41:38 PM

Right now, all I know how to do is pull models from Hugging Face, but someday I want to build my own small LLM from scratch

by jdw64

5/23/2026 at 1:20:30 PM

If you aren't already aware, Karpathy has several videos that could get you there in a few hours https://www.youtube.com/@AndrejKarpathy

by kflansburg

5/23/2026 at 1:22:47 PM

very thanks!

by jdw64

5/23/2026 at 4:37:47 PM

Also check out his nanochat repo. I used the repo, claude and shadeform to train my own mini model for about $300. Would have been less but I screwed up and let the cloud gpu rental run for a few hours even though the training run errored out.

Of course the model was dumber than GPT2 but still it was a great learning experience.

by lancekey

5/23/2026 at 1:31:20 PM

If you want a written resource I have a blog post about the mathematics behind building a feed forward from scratch, https://max-amb.github.io/blog/the_maths_behind_the_mlp/. Kinda focuses on translation from individual components to matrix operations.

by max-amb

5/23/2026 at 1:16:41 PM

It’s just linear algebra. Work your way from feed forward to CNN to RNN to LSTM to attention then maybe a small inference engine. Kaparthy’s llama2.c is only ~300 lines on the latter and it pragma simds so you don’t need fancy GPUs

by glouwbug

5/23/2026 at 4:32:23 PM

Needs 2022 in title

by axpy906

5/23/2026 at 5:03:12 PM

Deep learning is just glorified linear algebra. Master the progression: Feed-forward CNN RNN LSTM Attention. You don't even need a GPU to understand the climax; Karpathy’s llama2.c implements a full transformer inference engine in just ~300 lines of C using SIMD pragmas for CPU execution.

by marketingan

5/23/2026 at 5:40:52 PM

I wish more people pursued that approach to teaching neural networks.

First teach what the network does and why, writing it as a loopy, inference-only Python function. Explain training only in an abstract way, E.G. with the "take a random weight, twist it a little and see if the loss improves" algorithm. This lets you focus on the architecture and on why it is what it is.

Then, teach the intuitions behind derivatives and gradient descent. You don't need the entirety of calculus, there's no benefit to knowing how a sequence or limit works if you ) only want to understand neural networks. With autograd, you won't be manually doing derivatives of weird functions either, so intuitive understanding is a lot more important than doing dozens of traditional calculus exercises on paper like it's the 1800s. You could probably explain the little bit of calculus you need in an hour or two, even to somebody with a 12-year-old's understanding of math and a good bit of programming knowledge.

Only when people understand the training and inference, implemented with loops and descriptive variable names, teach the tensor, explain how a modern CPU and GPU works (because many programmers still think a modern computer is just a much faster 6502), and then teach the tricks we use to make it fast.

by miki123211

5/23/2026 at 9:05:35 PM

I just assume that people who are going to do useful things in ML have basic foundation in math and science. If you don’t know what a derivative is what are we doing talking about multi-variable optimization.

And it’s not about gate-keeping it’s really about being able to reason about these concepts. What this looks like in programming is people memorizing a million clean code rules and not being able to write binary search.

by groundzeros2015

5/24/2026 at 5:35:10 PM

When you learn calculus, you learn three things: the intuitions behind the concepts, the formal definitions of those concepts, and the techniques to efficiently solve problems using these concepts without a computer; things like integration by parts or by substitution.

If what you want to understand is neural networks, even at a deep level, you need a very good intuitive grasp of what derivatives are (without necessarily understanding what a limit is, if you really want to show a definition, teach the infinitesimal). You also need to understand the rules of derivation, which you can relatively easily explain if you explain derivatives. You don't need other calculus concepts (like limits, sequences or integrals). You don't need the formal definitions. You don't need to solve large derivatives on paper, and you certainly don't need to be fast at it and be able to do it in a closed-book exam setting.

by miki123211

5/23/2026 at 10:09:00 PM

There's a wide gulf between knowing what a derivative is and proficiently working out the derivatives of arbitrary functions. The extent of understanding required for most applied ML is "rate of change".

by fc417fc802

5/23/2026 at 10:14:49 PM

Is it that wide though? For example, how do you explain why you cannot autograd through sampling (and thus you use either a reparameterization trick, or gumbel). Sure, instead of relying on differentiability, you can intuitively explain it "the output changes only when you literally reach the next threshold, so all the way in between you don't really get a good direction", but how far are you going to take this?

I agree with your general point, that we don't need insane levels of math, but I would say a college level of calculus, linalg and probability is baseline.

A basic benchmark off the top of my head:

Being able to pick up, without stumbling on the fundamentals

- what LoRA is doing

- how a RBF-kernel SVM works

- why KL and reverse-KL are different

- why using mean squared error is equivalent to MLE on a gaussian

Not saying the four above pieces are all necessary, but that you should be able to learn them on demand without needing to revisit what a basis vector is.

"Working out derivatives of arbitrary functions" is school level.

by porridgeraisin

5/23/2026 at 10:20:28 PM

Rate of change -> it is flat -> that is not a useful signal. I don't see the issue?

We aren't talking about doing cutting edge research, just educating people on the basics of how ML does what it does. I agree that the things you list should follow at some point in the sequence for any rigorous education. But it's a question of at what point those things should come up and what the corresponding depth of education is.

For the initial introduction I think everything you listed is entirely out of scope. You don't need any of that to get a basic MLP working using a for loop and naive gradient descent.

by fc417fc802

5/23/2026 at 11:39:25 PM

> For the initial introduction I think everything you listed is entirely out of scope.

Who are we giving an intro to who doesn’t have 2 years of stem education?

by groundzeros2015

5/24/2026 at 6:41:30 AM

> You don't need any of that to get a basic MLP working using a for loop and naive gradient descent.

Well sure. Your initial statement was about "most applied ML".

> Rate of change -> it is flat -> that is not a useful signal. I don't see the issue?

It's not going to be zero if you sample in your practicum setting. You're gonna get RuntimeError: element 0 doesn't require grad and doesn't have a grad_fn. So yeah.

by porridgeraisin

5/23/2026 at 10:12:03 PM

So you created a new account to blatantly plagiarize another comment from this same page? What's even going on here?

by fc417fc802

5/23/2026 at 10:37:35 PM

[flagged]

by hottrends

5/23/2026 at 12:24:23 PM

>For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model.

https://arxiv.org/abs/1912.02292

by noosphr

5/23/2026 at 12:33:50 PM

Generally, posting a link-only reply without further elaboration comes across as a bit rude. Are you providing support for the above point? Refuting it? You felt compelled to comment, a few words to indicate what you’re actually trying to say would go a long way.

by appplication

5/23/2026 at 12:38:23 PM

>We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.

by noosphr

5/23/2026 at 1:43:30 PM

Right, isn't double descent one of the reasons why modern Extremely Large Language Models work at all? I think I heard somewhere that basically all today's "smart" (reasoning, solving math problems, etc) LLMs are trained in the "double descent" territory (whatever this means, I'm not entirely sure).

by ForceBru

5/23/2026 at 4:25:17 PM

No, there are more training tokens than parameters in LLMs. They are in the classical first descent setting.

by mxwsn

5/23/2026 at 2:11:06 PM

No, double descent is a symptom of whatever it is that makes the deep models work at all. It's just the name for something you see happen when it works. The reason it works has something to do with how all those extra dimensions work as a regularisation term in the fit.

by SiempreViernes

5/23/2026 at 2:53:45 PM

Does this mean that if your model is "overfitting", the solution is to train for even more epochs?

by smallerize

5/23/2026 at 10:13:18 PM

Maybe. Just means that the conventional wisdom was wrong and substantially over training can be a good thing. No one I knew at the time suspected that, including the people who wrote the paper.

by noosphr