5/17/2026 at 3:18:53 AM
I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.
by jandrewrogers
5/17/2026 at 10:58:47 AM
I think what a SIMD library does, above all else, is get the programmer to write code in a way that can be directly translated into SIMD instructions. A big issue that compilers have to contend with is that they aren't allowed (unless you enable ffast_math) to rearrange floating point operations. Putting an add or a multiply in the wrong place can spoil SIMD optimizations that the compiler could otherwise pull off.But the problem is as you state. For people that really care about that sort of thing, they are likely going to have the exact SIMD sequence they want to execute in mind anyways. That leaves you with a definition that is doomed to be both not low level enough and too low level.
I think what this is useful for is a fallback description of the desired SIMD operations. It won't be ideal on non-targeted platforms, but it will be something.
by cogman10
5/17/2026 at 3:31:21 AM
For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.by mgaunard
5/17/2026 at 3:48:09 AM
For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
by jandrewrogers
5/17/2026 at 4:39:56 AM
For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.
We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.
by mgaunard
5/17/2026 at 4:25:57 AM
NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?by mattip
5/17/2026 at 10:51:49 AM
Numpy is interesting in that regard since its dispatch mechanism adds up to a lot of overhead. There are a lot of problems where a naive list comprehension is faster, even when SIMD could be used to great effect.by hansvm
5/17/2026 at 5:30:24 AM
The data layout can often be done dynamically based on your target architecture.by camel-cdr
5/17/2026 at 7:19:43 AM
Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.by portly
5/17/2026 at 8:59:55 AM
The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.
The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.
by imtringued
5/17/2026 at 9:39:22 AM
This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.by janwas
5/17/2026 at 9:22:00 AM
> The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.This will work only for the most basic SIMD usages.
> CPU vendors must offer a basic set of vector instructions that is supported on all architectures.
This will take decades because you cannot change existing architectures/processors.
by SkiFire13
5/17/2026 at 10:29:44 AM
> This will take decades because you cannot change existing architectures/processors.I think once, AVX-512, SVE and RVV are wide spread enough, you'll have a rather powerfull baselevel you can target. But this will take a lot of time.
by camel-cdr
5/17/2026 at 10:50:52 AM
> AVX-512Which subset though? Some of them are not supported by some recent CPUs (e.g. 2024).
Not to mention Alder Lake not supporting AVX512.
by SkiFire13
5/17/2026 at 12:05:23 PM
Yeah AVX-512 is basically dead as a universal target for x86, the future is now AVX-10. But I believe there is a reasonable subset that will work on both.by sgerenser
5/17/2026 at 6:13:48 PM
It's a little dramatic to say avx512 is dead versus 10 - rather, I would say that avx10 finalizes a universally available set of avx512 extensions. For AVX 10.1, there's essentially, no difference after Intel backed out of reducing the vector length.For at least the next decade AVX 512 will be the high performance target, reaching all of the zen4/5/6 CPUs as well as whatever avx-10 enabled CPUs Intel producers.
by Remnant44
5/17/2026 at 11:25:38 AM
what you effectively said is "there should be only one isa".Because if that was all it took, why wouldn't it also apply to every other instruction set too?
by vrighter
5/17/2026 at 3:52:21 AM
> I think a legitimate criticism is that it is unclear who std::simd is for.I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
by mpyne
5/17/2026 at 4:28:29 AM
You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.
An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.
The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.
SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.
That said, I love that silicon has become so much more expressive.
by jandrewrogers
5/17/2026 at 6:02:49 AM
IMO what's needed is ISPC like guided autovec with a lot of hinting support to control codegen (e.g. hint for generating an unrolled version only or an unrolled and non-unrolled version).Basically something like #pragma omp SIMD, but actually designed for the SIMD model, not parallel one, that erros when vectorization isn't possible.
Ideally it would support things like reductions, scans, reference of elements from other iterations (e.g. out[i] = in[i-1]+in[i+1]), full gather scatter, early break, conditional execution control (masking or also a fast-path, when no active elements), latency vs throughput sensitive (don't unroll or unroll to max without spilling), data dependent termination (fault-only-first load or page aligned for thigs like strlen), ...
by camel-cdr
5/17/2026 at 9:21:25 AM
Have you considered our Highway library? Runtime dispatch need not be a PITA :) It's basically portable intrinsics, and a much more complete set (>300) than the ~50 in std.by janwas
5/17/2026 at 2:40:19 PM
I hadn't but it would make sense for doing my own personal programming challenges.Given the ongoing disasters around the software supply chains I've been fighting the creeping NPM-ism that people are trying to introduce to C++, where you just FetchContent 20 different libraries to build your own app upon.
I do use gtest, fmt and a few others though, so something as broadly used as Highway would probably be fine by that standard as well. But I'd still like it better if there was a Good Enough solution that was part of C++ stdlib to reduce the number of external integrations that are deemed required for a modern C++ program.
by mpyne
5/17/2026 at 4:06:55 PM
Fair point. If it helps, our security team has called Highway critical infrastructure and helped to harden the repo. The flip side of standardization is that it would be much harder and slower to add ops as the need arises, which we do regularly.by janwas
5/17/2026 at 10:05:40 AM
Does it have fallback paths for everything, though? Scalar if necessary?Projects that depend on Highway drop support for CPUs not listed in the Highway documentation, saying that they can't support these CPUs because they are incompatible with Highway: https://google.github.io/highway/en/master/README.html#curre...
Are these projects somehow mistaken?
by fweimer
5/17/2026 at 10:48:39 AM
Yes, the EMU128 target is scalar only, with for loops. This is a fun way to see how well autovectorization works, with the same source code. That works on any CPU. Curious which projects have such concerns, any link?by janwas
5/17/2026 at 12:53:04 PM
People reported challenges building V8 (whether upstream or the Node.js variant) on s390x with z13 support. I don't know if it was discussed on the porters mailing list because it's not public: https://groups.google.com/g/v8-s390-portsElsewhere, some people interpreted https://github.com/google/highway/issues/1895 as meaning that Highway code does not work on z13 at all.
by fweimer
5/17/2026 at 4:12:37 PM
Thanks for sharing. The first link seems non public indeed. I can imagine there is some compile issue we could reasonably fix, with the help of someone who has Z13 access. Please encourage them to raise an issue. I will be back on May 26. After that, it should at least be able to use the scalar fallback. The issue with Z14 is that it lacks fp32 support. Would their usage be integer only?by janwas
5/17/2026 at 4:40:07 AM
> it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructionsThis is one complaint I toss back at Intel and AMD.
If an instruction/intrinsic is universally worse than the P90/P95/P99 use case where it's going to be used to another set of instrinsics, then it shouldn't exist. Stop wasting the die space and instruction decode on it, if not only the developer time wasted finding out that your dot product instruction is useless.
There are a lot of smart people that have worked on compilers, optimized subroutines for LAPACK/BLAS, and designed the decoders and hardware. A lot of that effort is wasted because no one knows how to program these weird little machines. A little manual on "here's how to program SIMD, starting from linear algebra basics" would be worth more to Intel than all the money they've wasted trying to improve autovectorization passes in ICC and now, LLVM.
by duped
5/17/2026 at 9:30:12 AM
:) I agree a tutorial would be helpful. We are working on one with Fastcode.by janwas
5/17/2026 at 2:06:36 PM
A manual is not a tutorial, and having AI anywhere near this task is actively harmful. Please do not build this.by duped
5/17/2026 at 4:13:36 PM
?? Where did you see mention of AI?by janwas
5/17/2026 at 5:02:28 PM
I searched the name "fastcode" and the only results were AIby duped
5/17/2026 at 9:50:42 AM
In such discussions, whenever you mention abstractions are universally "pretty poor", to the extent anyone is listening, I think this hyperbole can do real damage. Maybe it prevents people from getting relevant performance gains, even if not 100% of the optimum, which is anyway unattainable. And what is the alternative? Not many projects can afford to hand write intrinsics for all platforms. And are you aware that Highway is basically a thin wrapper over intrinsics, which you can still drop down to where it helps?by janwas
5/18/2026 at 1:33:07 AM
I am aware of Highway. It doesn’t add much value for the kind of SIMD code I write. I have better abstractions because I don’t have to consider portability nearly as much. Some useful constructions don’t have a good expression on weaker SIMD architectures.by jandrewrogers
5/17/2026 at 10:03:07 AM
> 100% of the optimum, which is anyway unattainable.Can you expand on this? Sounds like an interesting discussion.
by CoastalCoder
5/17/2026 at 4:02:48 PM
:) I figure there is always something left to improve. For some kernels which really want to keep 30+ live registers, the compiler might not do as good a job as careful manual tuning, so intrinsics can have a bit of a cost. But I also figure optimization time is limited, so better to get 90% of several kernels rather than one to 99%.by janwas
5/17/2026 at 11:15:10 AM
Not who you asked but I think the meaning is that since intrinsics for simd are different in each platform, being able to have something that is portable and sometimes works faster is something, while writing for Intel, ARM and a zoo of instruction sets is not an option for some.by germandiago
5/17/2026 at 12:01:05 PM
Besides Spolsky's law of leaky abstractions, "abstractions" can also result in "lowest common denominator" situations, which are the opposite of performance optimization. Talking negatively about abstractions is not what deals damage; you are shooting the messenger here. It's the abstractions themselves that deal damage when misplaced. "Zero-cost abstractions" is the true hyperbole.by astrobe_
5/17/2026 at 4:37:32 PM
Is this a good faith reply? The particular abstraction we built, and is being discussed, is manifestly and obviously not a lowest common denominator. Looks like you are deploying a second straw man, that of zero cost. In other comments here I acknowledge a cost to intrinsics.by janwas
5/17/2026 at 9:06:29 AM
Yep, same here and agree.Compilers have definitely got better though: another issue in the past (maybe still is to a degree? although compilers have got a lot better at this in the past 15 years, but it used to be one of the things only Intel's ICC actually got right), that if you wrapped the base-level '__m128' or 'float32x4_t' in a struct/union in order to provide some abstraction, the compiler would often lose track of this when passing the struct/union through functions (either by value or const ref), and would often end up 'spilling' (not entirely the correct terminology in this context, but...) the variable from registers, and just producing asm which ended up uselessly loading the variable again from a stack address further up the call stack, when it didn't actually need to do that. So that was the situation even when using intrinsics within custom wrappers.
From 2011 to around 2013 ICC seemed to be the only compiler on amd64 which wouldn't do this. If you passed the actual '__m128' down the function call chain instead, clang and gcc would then do the right thing.
by pixelesque
5/17/2026 at 10:38:26 AM
Part of that could be ABI constraints. There are some surprising calling convention differences between a vector and a struct or union with vectors in it, and they vary platform to platform. E.g. on ARM a struct with two 128-bit vectors will pass in two registers where on x86 it must pass via the stack.Using __attribute__ to tweak calling conventions can often really clean this up, but that's just as obscure and non-portable as the problem it fixes. So you either end up writing weird non-portable code one way or weird non-portable code another... Code working with these types doesn't get to benefit from zero-cost abstraction to the degree we're used to with normal scalar code.
by mtklein
5/17/2026 at 4:55:25 PM
That's an ABI constraint of the x86 32-bit API.People invented x32 to fix this. Or just use amd64.
by mgaunard
5/17/2026 at 5:59:47 PM
This was with amd64.ICC was at the time the only compiler that would not do that.
by pixelesque
5/17/2026 at 10:22:55 AM
Do you say that from the perspective of compiled languages? I hear good things about .net core wrt SIMD, but that has the advantage it can decide at JIT.by exceptione
5/17/2026 at 10:47:34 AM
I'm not the person you're asking, but I share that opinion for both compiled languages and JIT solutions, including .net core specifically. All but the most trivial use cases can't be autovectorized, by JIT or otherwise. One of the recent things I worked on (reed-solomon decoding) offers basically zero opportunities for autovectorization unless the compiler reinterprets certain scalar loops as dedicated galois instructions on AVX512F hardware, but that optimization isn't implemented, it wouldn't help other architectures anyway, and it's still 10x slower than a well thought out vectorized approach.by hansvm
5/17/2026 at 10:53:23 AM
Thanks, your are talking about using plain loops with regular arrays, or do you mean the specific types like here <https://learn.microsoft.com/en-us/dotnet/standard/simd>?EDIT: A bit more background @<https://medium.com/@meriffa/net-core-concepts-simd-avx-intri...>
by exceptione
5/17/2026 at 11:24:52 AM
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.why? at least I see that I will start with std::simd in my pets. If this would not enough, I would go forward to intrinsics. But, I think, starting with std::simd would be much simpler for beginner.
by feelamee
5/17/2026 at 4:10:20 AM
Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?by cortesoft
5/17/2026 at 4:43:07 AM
The full scope of what SIMD is used for is much larger than parallelizing evaluation of numeric types and algorithms.For example, it is used for parallel evaluation of complex constraints on unrelated types simultaneously while packed into a single vector. Think a WHERE clause on an arbitrary SQL schema evaluated in full parallel in a handful of clock cycles. SIMD turns out to be brilliant for this but it looks nothing like auto-vectorization.
None of the SIMD libraries like Google Highway cover this case.
by jandrewrogers
5/17/2026 at 5:35:39 AM
I don't quite get how something like highway doesn't cover this, while intrinsics do.Can you explain the usecase more concretely?
by camel-cdr
5/17/2026 at 6:33:34 AM
Almost literally what I stated. Consider a row in Postgres table or similar. Convert the entire WHERE clause across all columns in that table into a very short sequence of SIMD instructions against the same memory. All of the columns, regardless of type, are evaluated simultaneously using SIMD. For many complex constraints you can match rows in single digit clock cycles even across many unrelated types. This is much faster than using secondary indexes in many cases.It isn’t hypothetical, I’ve shipped systems that worked this way. You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth.
by jandrewrogers
5/17/2026 at 6:49:45 AM
OK, I thought it couldn't be that, because that should be doable with std::simd or a SIMD abstraction. Well, unless you JIT it, in which case intrinsics wouldn't help either.> You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth
Do I underatand it correctly, that this would only work, if you have multiple of the same comparisons (e.g. equality check with same sized data) in the WHERE clause and the relevant collumns are within one multiple of the SIMD width of each other?
by camel-cdr
5/17/2026 at 7:22:25 AM
Every column has its own independent constraint: equality, order, range intersection, bit sets, etc that is evaluated concurrently in single operations. Independent per column in parallel. It does require handling the representation of columns to enable it but that isn’t onerous in practice.It isn’t intuitive but it is one of those things that is obvious in hindsight once you see how it works. The gap is that people struggle to understand how to make this something SIMD native, especially in high-performance systems.
by jandrewrogers
5/17/2026 at 7:48:25 AM
Ah, so you're just doing SoA or AoSoA layout? It sounded like you where doing something more special than the standard SIMD usecase.This does easily work with SIMD abstractions and even length-agnostic vector ISAs, unless you're doing AoSoA and your storage format has to match your memory format and it has the be the same on all machines. In which case you probably want to do something like 4K blocks anyways, in which case you can make it agnostic for all vector length anybody reasonably cares about for this type of application anyways.
by camel-cdr
5/17/2026 at 4:19:11 AM
Google Highway gets mentioned in the article.by loeg
5/17/2026 at 4:23:29 AM
There is google’s highway, that provides an abstraction layer. It is used by NumPy.by mattip
5/17/2026 at 8:23:43 AM
Autovectorisation is the main way SIMD hardware gets put into use, whether you think it's pretty poor or not.SIMD came to mainstream in 1995 Pentium MMX and has been proven rather difficult for compilers to target, but after 30+ years is doing a bit better despite PLT conspiring against it. (see eg CUDA, Futhark etc)
by fulafel
5/17/2026 at 1:32:11 PM
I think the main way SIMD hardware gets put to use is probably memcpy.by saagarjha
5/17/2026 at 10:03:18 AM
In my limited experience with looking at autovectorisation compiler output, gcc is quite bad unless you hold its hand, and clang tries to autovectorise everything it sees.by secondcoming
5/17/2026 at 6:49:19 PM
> I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?There is plenty of vectorization that are simple enough to be done with std::simd today and that will still bring any autovectorizer begging on its knees for various reasons.
As an anecdote, I currently got a 8x speedup with std::simd (AVX2 & SVE2) on a rather trivial parser of mine recently that autovectorizer failed miserably to do properly.
Would I have get better result using intrinsics ? Likely, yes.
Did I want to suffer the maintainability and portability pain associated with it for a simple parser ? Certainly not.
For these use case, std.simd does the job. And will probably do a better and wider job with time when it get enriched by the committee.
The blog brings some valid criticism but really looks like a flame war trying to destroy an already opened door.
(1) Is there more performant solutions that std::simd for vectorization ?
Yes, of course. The STL evolves slow, its main goal is to provide a generic and portable implementations of a set of algorithms. Not to provide the best implementation in existence.
The best implementation of most algorithms (including SIMD patterns) evolves every 6 month, you can not expect a standard library with 3 different implementation to keep up with that.
(2) Is the future of vectorization ISPC ?
Nope. ISPC has been around for > 10y and is still niche. There is very good reasons to that: Yes it can generate better code but in most use case, adding a massive dependency of a compiler + an arbitrary LLVM version + a DSL on your project is not worth it.
Specially considering that it is an Intel project and that Intel (almost) abandonned the project multiple time (In pure Intel fashion).
So yes, criticism is easy, and yes std::simd is full of problems.
But I am glad it exists, and thanks to the people that made it happen... Because it is useful, even in the current state.
by adev_
5/17/2026 at 4:40:54 AM
what about Google highway project?by synergy20
5/17/2026 at 4:08:33 AM
> I think a legitimate criticism is that it is unclear who std::simd is forIt's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
by paulddraper
5/17/2026 at 4:35:31 AM
My criticism could mostly be summarized similarly. The scope of what a portable std::simd can do is almost exactly the scope that you would expect auto-vectorization to subsume over time. SIMD, to the extent it is covered by std::simd, is the part of SIMD that should be pretty simple to learn.There isn’t an obvious path to elevate it above what auto-vectorization should theoretically be capable of in a portable way. This leads to a potential long-term outcome where std::simd is essentially a no-op because scalar code is automagically converted into the equivalent and it is incapable of supporting more sophisticated SIMD code.
by jandrewrogers
5/17/2026 at 4:44:04 AM
[dead]by kent-tokyo