Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

5/29/2026 at 10:36:49 AM

This looks very interesting. Possible to get those rates without exotic hardware.

But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison.

We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.

by mungoman2

5/29/2026 at 10:59:35 AM

Great points.

We strived to be fair as possible in the benchmark, but it's indeed not perfect. Taalas should have been added in the dedicated hardware section, even though they use 3-bit quantization when we are on FP16 (to be fair in both directions) and they burn the model directly on the card.

Our tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds: - At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). - DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so in theory we could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.

Check out the math at the end of our blog post:

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

by gaeld

5/29/2026 at 1:04:36 PM

Your playground/write-up is very interesting and I would be really interested when you can have something like Deepseek V4 Flash model (49B) running as you are suggesting.

I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model.

That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work!

by Imustaskforhelp

5/29/2026 at 3:48:20 PM

Consumer inference scenarios tend to be highly bespoke so it's difficult to apply a monokernel approach based on deep manual optimization. I suppose this could become applicable to rare scenarios where both the model and the hardware are fixed and self-contained, e.g. I'm running Apple's AI model on the latest Apple Silicon hardware. Then this becomes a viable approach even for 'consumer' use.

The authors' approach also encompasses multi-node approaches that won't apply easily to consumer inference since consumer GPUs have very low-performance interconnects, hence why layer parallelism is usually favored. (But that doesn't work very well with the monokernel approach, since it involves running distinct logic on each separate GPU. It also doesn't speed up single inference, though you can get that throughput back by pipelining small minibatches.)

by zozbot234

5/29/2026 at 5:47:49 PM

scenarios where both the model and the hardware are fixed and self-contained

That's basically antirez's DS4 and it works pretty well because there are few leading models and few hardware platforms (Apple, GB10, Strix Halo) that are worth using.

by wmf

5/29/2026 at 1:38:15 PM

Thanks for the comment and the question!

The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.

Also worth noting that our results are currently for standard datacenter GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.

by gaeld

5/29/2026 at 10:56:17 AM

They got 1K tok/s with Deepseek v4 Pro. That's kinda cool..

by kirtivr

5/29/2026 at 8:01:45 PM

No they didn't, they predict they'll get that much. Also worth noting the prediction assumes running at MXFP4/FP8 quantization.

by gbnwl

5/29/2026 at 11:00:25 AM

Thanks. To be fair, this number is what we expect to get once we port DeepSeek V4 in our engine on the upcoming generation of GPUs!

by gaeld

5/30/2026 at 12:05:21 PM

> single-request decode speed is now the metric that matters

Benched at 96 input tokens, 4000 output tokens.

by Terretta

5/29/2026 at 12:19:35 PM

Fallacies look interesting ? Like if we aren't getting dubious claims every day ?

by hirako2000

5/29/2026 at 10:52:03 AM

likely the small model makes whatever fuzzer they designed to poke the gpus much faster optimizations.

they seem to think it scales up because theyre shortening the stack.

by cyanydeez

5/29/2026 at 11:09:04 AM

Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

by gaeld

5/29/2026 at 2:46:22 PM

It looks like DTP is a distinct architectural choice that would require training new models accordingly? This wouldn't be able to just run inference for existing models.

by zozbot234

5/29/2026 at 3:06:07 PM

Totally, though DTP is not required for these kind of speeds. Standard TP works also.

DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters.

For 1k to 5k tokens/s, regular TP still works because we are able to optimize the inter-GPU all-reduce collectives at under 3 µs, which allows to continue streaming model weights in shared memory, registers and caches while GPUs exchange data.

by gaeld

5/31/2026 at 7:04:29 AM

> Test the speed in our live coding playground: playground.kog.ai

> Dsatur in Haskell

  #include <iostream>
  #include <vector>
  #include <algorithm>
  using namespace std;
  int main() {
    int n;
    cin >> n;
    vector<int> v(n);
    for (int i = 0; i < n; i++) cin >> v[i];
    sort(v.begin(), v.end());
    int i = 0, j = n - 1;
    while (i < j) {
      while (i < j && v[i] == v[j]) i++;
      while (i < j && v[j] == v[i]) j--;
      cout << v[i] << " ";
      i++;
      j--;
    }
    return 0;
  }

Haha. It was fast though.

by mrkeen

5/29/2026 at 10:56:30 AM

When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..

by 0-bad-sectors

5/29/2026 at 11:09:29 AM

Yeah, it should have been "Datacenter GPUs" or "Nvidia and AMD GPUs".

by roosgit

5/29/2026 at 10:58:51 AM

what did you have in mind when you read "Standard GPUs"?

by Oras

5/29/2026 at 1:14:18 PM

The GPU in my desktop. (A normal-ish decent gaming machine that runs LLMs and txt2img well enough.)

In contrast, not enterprise GPUs that cost as much as a car.

by yjftsjthsd-h

5/29/2026 at 11:04:03 AM

I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

by gaeld

5/29/2026 at 2:10:25 PM

What a lot of use on here are salivating for is the ability to run these on prosumer hardware at home. So we tend to jump to the conclusion that "standard" means "consumer-grade" because that's what we want to see. Still, very cool work!

by deflator

5/29/2026 at 3:15:05 PM

thank you deflator, I understand this now! much appreciated

by gaeld

5/30/2026 at 2:59:54 AM

A consumer "Standard GPU" could mean about a 6-8gb VRAM GPU still in support by the manufacturer, independent of CUDA/etc proprietary technology.

Recent Steam hardware survey top GPU list is:

- RTX 3060 (6 or 12gb VRAM)

- RTX 4060 (8 or 16gb)

- RTX 3050 (6 or 8gb)

- RTX 5070 (12gb)

- RTX 5060 (8gb)

- GTX 1650 (4gb!)

That list only covers about 22% of survey respondents but sets a 6-8gb VRAM baseline for consumer GPUs.

Can this run on an RX 570 8gb form 2017? Maybe that's a ways back. A 1660 6gb from 2019? Intel? They had a decent budget run in recent years.

https://store.steampowered.com/hwsurvey/videocard/

by selicos

5/29/2026 at 3:20:21 PM

How would you classify a datacenter GPU as standard/non-standard? That doesn't seem to be a meaningful distinction. It's click bait.

by nightski

5/29/2026 at 4:49:53 PM

The blog makes it clear that "standard" GPU here is in opposition to purpose-built hardware like Cerebras. The selling point is reaching the same order of magnitude in generative speed as those approaches.

by averne_

5/29/2026 at 8:21:06 PM

Certainly not 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200

by felooboolooomba

5/29/2026 at 2:13:33 PM

You know, Radeon 9800 pro ago

by bcjdjsndon

5/29/2026 at 2:45:18 PM

This is very cool.

I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.

It's great to see that with proper care on the inference engine implementation the relationship can be restored.

by stymaar

5/29/2026 at 10:41:49 AM

> Standard GPUs

> 8× NVIDIA H200

by 867-5309

5/29/2026 at 10:59:54 AM

as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?

by Oras

5/29/2026 at 11:10:04 AM

I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.

Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.

by embedding-shape

5/29/2026 at 11:13:00 AM

I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.

by ismailmaj

5/29/2026 at 11:45:45 AM

Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).

All our work at Kog is about removing these bottlenecks.

by gaeld

5/29/2026 at 5:15:44 PM

Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?

by dr_kiszonka

5/29/2026 at 6:13:32 PM

I'm sure there are, and I really hope we can work on consumer-grade GPUs at some point.

It should be possible to apply the same methodology (digging deep into the hardware details to understand all its little characteristics, and rethinking the inference stack around that).

by gaeld

5/29/2026 at 2:12:46 PM

That doesn't clarify anything lol. It's a bit click baity.

by bcjdjsndon

5/29/2026 at 2:09:44 PM

> Did you expect a single GPU chip to reach 3k tps?

Did the article headline not say Standard GPU?

by bcjdjsndon

5/29/2026 at 11:56:42 AM

so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU

by WithinReason

5/29/2026 at 10:48:27 AM

Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s

by imputation

5/29/2026 at 2:32:13 PM

Don't miss trying their demo: https://playground.kog.ai/

Feels like a preview of the future

by rashkov

5/29/2026 at 4:17:52 PM

You can get something pretty fast right now with a Cerebras Coder subscription, sadly I think the best model they had last I checked was the somewhat dated GLM 4.7: https://inference-docs.cerebras.ai/models/overview

I feel like if they got DeepSeek V4 Flash and Pro running on their hardware, even if at less than 1000 tok/s, they’d still be crushing it with any subscription they’d provide, given how generous their token limits were.

by KronisLV

5/29/2026 at 4:04:06 PM

As for the demo it's fast and extremely dumb like expected for 2B. I asked how to stop drinking habit and in just one follow-up message it recommended trying 8% ABV. Hilarious.

by __natty__

5/29/2026 at 4:06:22 PM

it's also a coding model

by gaeld

5/29/2026 at 4:36:38 PM

Nah. it says it can't even write python code

by nethi

5/29/2026 at 4:47:05 PM

I tried with some simple prompts (fibonacci, linked list manipulation) and it worked nicely.

by averne_

5/29/2026 at 4:44:18 PM

https://chatjimmy.ai/ from Taalas also feels like that.

by WASDx

5/29/2026 at 10:36:44 AM

Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.

by ilaksh

5/29/2026 at 11:03:03 AM

thanks! we explain how it scales to larger models in the last section the OP blog post

by gaeld

5/29/2026 at 2:14:12 PM

Shame you stopped short of actually benchmarking that scale though, eh?

by bcjdjsndon

5/29/2026 at 3:10:22 PM

will do - we are a small team and it takes time to implement and optimize a new model, whatever the size.

by gaeld

5/29/2026 at 9:10:55 PM

Oh

by bcjdjsndon

5/29/2026 at 9:03:05 PM

You don't even need to train the model just to see if you can infer it at the claimed speed

by lostmsu

5/29/2026 at 9:24:33 PM

True, and for third-party models we'll just re-use their public open weights.

There is a time-consuming part, though, that is performed manually by our (human) team: implement the logic of the model in C++ and assembly code in a super-optimized way, co-designed for each specific hardware card.

This can take months.

We hope to accelerate the process with AI agents, but we're not there yet.

by gaeld

5/29/2026 at 5:44:17 PM

For me it's 3.4k tok/s of pure nonsense, the model is bad, you tell it it's wrong, it acknowledge it's wrong and repeats the same nonsense. It reminds me my nephew though. Ask it something like: "I want to play the guitar on the surface of the Moon. What speakers do you suggest." and then "But Moon has no atmosphere, how the sound will travel?".

by blindr

5/29/2026 at 6:10:36 PM

Note that this coding model is trained on programming use cases, and is also not tuned for multi-turn chat.

You can ask it to implement an algorithm; we provide suggested prompts you can test.

Also, this tech preview is really about the speed of the inference engine (not the model itself) so I'm glad you got 3.4k tok/s!

by gaeld

5/30/2026 at 11:04:17 AM

That's what I tested first and the model failed, even after suggestions that it was wrong and how to fix it's errors.

by blindr

5/29/2026 at 11:03:43 AM

NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house.

I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.

I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.

If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.

by irishcoffee

5/29/2026 at 11:05:13 AM

I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

Sorry for the confusion

by gaeld

5/29/2026 at 11:27:54 AM

Do you think maybe changing your articles title from "Real-time LLM Inference on Standard GPUs" to "Real-time LLM Inference on Standard Datacenter GPUs" might make sense here? Given more people seem confused by the title than not, and you could clear this up relatively easily, at least on your website although might be late to fix the HN title.

by embedding-shape

5/29/2026 at 11:37:35 AM

YES - I just updated the title of our article according to your suggestion.

by gaeld

5/29/2026 at 11:35:46 AM

Oh, it isn't confusing, it is misleading. A standard GPU lets you connect a monitor. A datacenter GPU lets you do headless math.

by irishcoffee

5/29/2026 at 11:38:09 AM

I updated the article title accordingly

by gaeld

5/29/2026 at 2:15:36 PM

Standard != Datacentre

by bcjdjsndon

5/30/2026 at 3:13:35 PM

This blog post clearly targets VCs, but what they are doing is legit and can improve the performance of local models on low-end hardware as well, especially since their priority is to optimize non-batched inference.

by dandanua

5/29/2026 at 2:08:56 PM

H200 isn't a standard GPU at all

by bcjdjsndon

5/29/2026 at 3:12:18 PM

I think they accidentally left out “standard data-center GPUs” from the title. That probably needs fixing. My “standard” GPU is still a 3090

by infocollector

5/29/2026 at 6:42:53 PM

[dead]

by mhast

5/29/2026 at 11:56:45 AM

Looks super promising! A couple of questions:

For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?

It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?

Any plans to release it open source?

Congratz again for the release

by CastFX

5/29/2026 at 12:53:03 PM

Thanks a lot! Much appreciated.

To answer your questions:

- yes, we rewrite the whole model code (while keeping the same logic) in CUDA/HIP and assembly, in order to optimize by hand for each GPU type. It's quite tedious for sure, but I guess this is the price to pay to get this kind of results.

- the batching question is a great one. In agentic systems, there is probably a trade-off between sequential thinking/iterations vs parallel exploration of multiple solutions. Also, there could just be multiple independent tasks running in parallel, depending on the use case.

We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.

Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.

- open sourcing: maybe, maybe not. I'm still undecided on this. We are a small startup and I'm told that giving our IP away might be shooting ourselves in the feet. On the other side, I think it could be of great benefit to the community and for us... we'll see

by gaeld

5/29/2026 at 2:30:17 PM

Congrats gaeld and team

The demo is very impressive!

disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware

by cataflam

5/29/2026 at 12:01:07 PM

Making these claims on a 2B parameter model seems a bit like seeing linear scalability from 1 to 4 cores and then assuming 256 cores will give you a 256x speedup. Or demonstrating massive improvement on datasets that fit in cache and then assuming the same improvements will be present on problem sizes that span the memory of multiple machines. Something tells me that scaling to larger models will be more difficult than assumed.

by robmccoll

5/29/2026 at 12:11:48 PM

Yeah, I agree: I'm actually not expecting it to be easy, and there will certainly be several unknown unknowns we'll discover along the way.

Our process has been, and will continue to be, a sequence of (tedious) R&D experiments where the GPU never behaves as expected when pushed to its limits in ways no-one really tested before (I still have nightmares of the L3 cache cross-IOD bottlenecks on MI300X).

IMHO, we did solve the multi-GPU memory bandwidth scaling problem, and thus the linear scaling of the size of the model towards infinity. But the main difficulties will come from keeping the speed, with steady and continuous memory streaming, while implementing the much more complex architecture of modern frontier MoEs (attention compression tricks, hash layers, routing logic, etc.)

by gaeld

5/29/2026 at 5:10:45 PM

Do you think the work will still apply to speculative/alternative decoding methods like MTP and block diffusion, which are making batch=1 decoding less memory bound? Kernel launch overhead and memory transfer become less and less significant as a % of time when computing multiple tokens at once.

by joefourier

5/29/2026 at 5:32:32 PM

Why not, it's one way to look at it! Although I have yet to see other work with speculative decoding higher than ~1,000 tokens/s., because the other bottlenecks start to matter at that point, and they need to be solved to go further.

Our view is that MTP / speculative decoding could help getting a X multiplier (X = 2 to 6) on the tokens per second speed we currently achieve.

We are a bit greedy, we want to stack optimizations on top of each other to get the maximum speed possible.

It involves additional compute to verify the predicted tokens during the forward pass (it's like a small batch), which should be totally doable for dense models, and will be more tricky for MoEs because it could mean activating more experts and thus more active parameters.

by gaeld

5/30/2026 at 12:43:23 AM

I know all is relative but when I think of > 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200

I still find it mind boggling. That's a lot of compute power and still considered "low end" for the purpose it serves.

by BiraIgnacio

5/29/2026 at 4:00:30 PM

Huh, interesting. Some parts of this do generalize even to an RTX 6000 Pro Blackwell, I imagine, though we're going to be solidly bottlenecked then on inter-card throughput through the PCIe interface.

by arjie

5/29/2026 at 4:00:36 PM

An article with a title saying tokens per second throughput without any qualifier e.g. what size the model is should immediately be classified as spam.

by dchftcs

5/29/2026 at 10:38:43 AM

I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.

by LoganDark

5/29/2026 at 10:52:20 AM

Fair point - this tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds.

At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).

DeepSeek V4 Flash has 13B in mixed FP4/FP8.

Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

by gaeld

5/29/2026 at 1:19:28 PM

[dead]

by foobar10000

5/29/2026 at 5:33:34 PM

>This preview runs a 2B model

I guess with 1B or 500M model inference would be even faster?

by DeathArrow

5/29/2026 at 6:55:10 PM

In theory yes, although not in a linearly proportional way, because in practice our memory streaming is not yet perfect. There are still some fixed costs that we did not fully optimize (for now).

by gaeld

5/29/2026 at 2:16:42 PM

I had to test it myself to believe this unreal inference speed.

each time getting 3300+ tps.

by paul-rohan

5/29/2026 at 10:47:53 AM

I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput.

For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.

by kirtivr

5/29/2026 at 11:29:50 AM

That sounds a little bit like the 64kb memory is enough, then someone invented electron ;P

But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.

You could literaly implement the same solution 100x and benchmark all of them and get only the best result.

You could build and architecture a whole stack in parallel.

You could do massive thinking token / chain of thought.

You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.

We could start doing some type of monte-carlo search with this.

by Gomotono

5/29/2026 at 12:48:16 PM

Title is pure bait. Where is Datacenter GPU gone?

by ekianjo

5/29/2026 at 1:47:58 PM

I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.

by frankensteins

5/29/2026 at 3:13:06 PM

Token generation speed matters for sequential agentic workflows, like software engineering / vibe coding, where a lot of reasoning tokens, code generation, refactoring, testing, etc. happen in a loop before an actual outcome is served to the user.

About model performance, we plan to support the latest frontier models (this tech preview is about the speed of the engine)

by gaeld

5/29/2026 at 4:02:59 PM

[flagged]

by hannune

5/29/2026 at 12:15:45 PM

Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?

by bartkappenburg

5/29/2026 at 6:45:17 PM

Who cares about token speed? What is the quality of the results like? I don't know why people are so fixated on token speed, since no one cares how quickly it can spew garbage. Most reasonable people are happier waiting a bit more for accurate results.

by freediddy

5/29/2026 at 7:35:50 PM

It matters on consumer hardware since barely any model runs at reasonable speed.

by esafak

5/29/2026 at 7:55:50 PM

It also matters for thinking models and for agentic workflows, especially in software engineering, where a lot of tokens need to be output in iterative loops before the user sees any result.

This is our main use case.

by gaeld

5/29/2026 at 11:26:41 PM

What a crap https://drive.google.com/file/d/1ud0yYmkSBrTDAOkLx8K7RFWHMJq...

by gitowiec

5/29/2026 at 9:17:17 PM

a "standard" GPU would be an nvidia geforce RTX 5090

by pdntspa

5/29/2026 at 7:36:00 PM

cool.. so now approximations of copies of copies can be approximated and copied faster?

by nickphx

5/31/2026 at 2:14:36 PM

[dead]

by eddysir

5/29/2026 at 10:01:47 AM

[dead]

by nryoo

5/29/2026 at 1:12:27 PM

[dead]

by mikdan

5/29/2026 at 10:58:03 AM

[flagged]

by Jimmymenk2

5/29/2026 at 12:56:16 PM

That's really nice of them.

That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.

Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.

Sorry for the sarcasm. Looks like interesting work.

by Hfuffzehn