Agreed, fixed with vectors needs to be a language feature, better compile times and would solve issues for most people.
Personally, I think that like Clang way to adding GLSL like vectors and semantics would've gone a long way. SVE might be an elegant design, but in reality there are probably a multiple factor of game and other 3d code being written that needs vectors compared to other fields, and there limited vector sizes aren't really a problem.
And honestly, considering the story of AVX512.. with 512 bit vectors being removed from mainstream by Intel, do we really really need longer ones despite it being from a "scalable design"?
The point about the optimizer only seeing "opaque templates and function calls" makes little sense.
First off, templates are the opposite of opaque due to the fundamental requirement that the implementation be visible to every translation unit using a template. This makes any function calls trivially inlinable.
Second, and the reason for the above requirement, templates are compiled by monomorphization – making a distinct, separately optimizable copy of each concrete instantiation of a template. By the time the compiler backend sees the intermediate representation, there’s nothing about templates left.
There are of course reasons why highly abstracted template code may be difficult to optimize, for instance if function call chains are so deep that the inliner gives up. There are also legitimate reasons why a fully language-based solution might beat a library-based one. But one of the points of adding a library to the std is that the standard library is allowed to cheat as much as it wants. It can be deeply integrated to the compiler and implemented entirely using compiler magic if necessary.
std::simd may be too little, too late for many reasons, but I doubt any of them is that the compiler can’t see through the code.
it's just a compiler hint just like all other hints, parity with other languages.
I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.
I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.
In such discussions, whenever you mention abstractions are universally "pretty poor", to the extent anyone is listening, I think this hyperbole can do real damage. Maybe it prevents people from getting relevant performance gains, even if not 100% of the optimum, which is anyway unattainable. And what is the alternative? Not many projects can afford to hand write intrinsics for all platforms. And are you aware that Highway is basically a thin wrapper over intrinsics, which you can still drop down to where it helps?
> 100% of the optimum, which is anyway unattainable.
Can you expand on this? Sounds like an interesting discussion.
Do you say that from the perspective of compiled languages? I hear good things about .net core wrt SIMD, but that has the advantage it can decide at JIT.
Yep, same here and agree.
Compilers have definitely got better though: another issue in the past (maybe still is to a degree? although compilers have got a lot better at this in the past 15 years, but it used to be one of the things only Intel's ICC actually got right), that if you wrapped the base-level '__m128' or 'float32x4_t' in a struct/union in order to provide some abstraction, the compiler would often lose track of this when passing the struct/union through functions (either by value or const ref), and would often end up 'spilling' (not entirely the correct terminology in this context, but...) the variable from registers, and just producing asm which ended up uselessly loading the variable again from a stack address further up the call stack, when it didn't actually need to do that. So that was the situation even when using intrinsics within custom wrappers.
From 2011 to around 2013 ICC seemed to be the only compiler on amd64 which wouldn't do this. If you passed the actual '__m128' down the function call chain instead, clang and gcc would then do the right thing.
Part of that could be ABI constraints. There are some surprising calling convention differences between a vector and a struct or union with vectors in it, and they vary platform to platform. E.g. on ARM a struct with two 128-bit vectors will pass in two registers where on x86 it must pass via the stack.
Using __attribute__ to tweak calling conventions can often really clean this up, but that's just as obscure and non-portable as the problem it fixes. So you either end up writing weird non-portable code one way or weird non-portable code another... Code working with these types doesn't get to benefit from zero-cost abstraction to the degree we're used to with normal scalar code.
For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.
For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.
In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.
That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.
We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.
NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?
The data layout can often be done dynamically based on your target architecture.
[deleted]
Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.
The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.
The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.
This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.
> The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
This will work only for the most basic SIMD usages.
> CPU vendors must offer a basic set of vector instructions that is supported on all architectures.
This will take decades because you cannot change existing architectures/processors.
> This will take decades because you cannot change existing architectures/processors.
I think once, AVX-512, SVE and RVV are wide spread enough, you'll have a rather powerfull baselevel you can target. But this will take a lot of time.
> I think a legitimate criticism is that it is unclear who std::simd is for.
I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.
The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.
An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.
The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.
SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.
That said, I love that silicon has become so much more expressive.
IMO what's needed is ISPC like guided autovec with a lot of hinting support to control codegen (e.g. hint for generating an unrolled version only or an unrolled and non-unrolled version).
Basically something like #pragma omp SIMD, but actually designed for the SIMD model, not parallel one, that erros when vectorization isn't possible.
Ideally it would support things like reductions, scans, reference of elements from other iterations (e.g. out[i] = in[i-1]+in[i+1]), full gather scatter, early break, conditional execution control (masking or also a fast-path, when no active elements), latency vs throughput sensitive (don't unroll or unroll to max without spilling), data dependent termination (fault-only-first load or page aligned for thigs like strlen), ...
Have you considered our Highway library? Runtime dispatch need not be a PITA :) It's basically portable intrinsics, and a much more complete set (>300) than the ~50 in std.
Does it have fallback paths for everything, though? Scalar if necessary?
> it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions
This is one complaint I toss back at Intel and AMD.
If an instruction/intrinsic is universally worse than the P90/P95/P99 use case where it's going to be used to another set of instrinsics, then it shouldn't exist. Stop wasting the die space and instruction decode on it, if not only the developer time wasted finding out that your dot product instruction is useless.
There are a lot of smart people that have worked on compilers, optimized subroutines for LAPACK/BLAS, and designed the decoders and hardware. A lot of that effort is wasted because no one knows how to program these weird little machines. A little manual on "here's how to program SIMD, starting from linear algebra basics" would be worth more to Intel than all the money they've wasted trying to improve autovectorization passes in ICC and now, LLVM.
:)
I agree a tutorial would be helpful. We are working on one with Fastcode.
Autovectorisation is the main way SIMD hardware gets put into use, whether you think it's pretty poor or not.
SIMD came to mainstream in 1995 Pentium MMX and has been proven rather difficult for compilers to target, but after 30+ years is doing a bit better despite PLT conspiring against it. (see eg CUDA, Futhark etc)
In my limited experience with looking at autovectorisation compiler output, gcc is quite bad unless you hold its hand, and clang tries to autovectorise everything it sees.
Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?
The full scope of what SIMD is used for is much larger than parallelizing evaluation of numeric types and algorithms.
For example, it is used for parallel evaluation of complex constraints on unrelated types simultaneously while packed into a single vector. Think a WHERE clause on an arbitrary SQL schema evaluated in full parallel in a handful of clock cycles. SIMD turns out to be brilliant for this but it looks nothing like auto-vectorization.
None of the SIMD libraries like Google Highway cover this case.
I don't quite get how something like highway doesn't cover this, while intrinsics do.
Can you explain the usecase more concretely?
Almost literally what I stated. Consider a row in Postgres table or similar. Convert the entire WHERE clause across all columns in that table into a very short sequence of SIMD instructions against the same memory. All of the columns, regardless of type, are evaluated simultaneously using SIMD. For many complex constraints you can match rows in single digit clock cycles even across many unrelated types. This is much faster than using secondary indexes in many cases.
It isn’t hypothetical, I’ve shipped systems that worked this way. You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth.
OK, I thought it couldn't be that, because that should be doable with std::simd or a SIMD abstraction. Well, unless you JIT it, in which case intrinsics wouldn't help either.
> You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth
Do I underatand it correctly, that this would only work, if you have multiple of the same comparisons (e.g. equality check with same sized data) in the WHERE clause and the relevant collumns are within one multiple of the SIMD width of each other?
Every column has its own independent constraint: equality, order, range intersection, bit sets, etc that is evaluated concurrently in single operations. Independent per column in parallel. It does require handling the representation of columns to enable it but that isn’t onerous in practice.
It isn’t intuitive but it is one of those things that is obvious in hindsight once you see how it works. The gap is that people struggle to understand how to make this something SIMD native, especially in high-performance systems.
Ah, so you're just doing SoA or AoSoA layout? It sounded like you where doing something more special than the standard SIMD usecase.
This does easily work with SIMD abstractions and even length-agnostic vector ISAs, unless you're doing AoSoA and your storage format has to match your memory format and it has the be the same on all machines. In which case you probably want to do something like 4K blocks anyways, in which case you can make it agnostic for all vector length anybody reasonably cares about for this type of application anyways.
Google Highway gets mentioned in the article.
There is google’s highway, that provides an abstraction layer. It is used by NumPy.
what about Google highway project?
> I think a legitimate criticism is that it is unclear who std::simd is for
It's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
My criticism could mostly be summarized similarly. The scope of what a portable std::simd can do is almost exactly the scope that you would expect auto-vectorization to subsume over time. SIMD, to the extent it is covered by std::simd, is the part of SIMD that should be pretty simple to learn.
There isn’t an obvious path to elevate it above what auto-vectorization should theoretically be capable of in a portable way. This leads to a potential long-term outcome where std::simd is essentially a no-op because scalar code is automagically converted into the equivalent and it is incapable of supporting more sophisticated SIMD code.
[dead]
The linked[1] "six reasons to use std::simd" was just what I needed after a long week. Hilarious!
It should have been "eight reasons to use std::simd". Inefficient.
isn't that just QoI issues? There's a reason why the libstdc++ folks labelled their implementation as experimental.
That certainly convinced me. When I was doing my taxes recently and had to watch those forced loading animations, I kept asking myself "why can't my compiler do this?" Thanks to std::simd, now it can!
I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).
Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.
There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.
> There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language.
I think this is the best solution for truely portable SIMD. Sure it doesn't cover everything, but it makes autovec explicit, guaranteed and more powerfull.
One of the biggest problems with "portable" SIMD libraries, is that when it's used for simple things, often autovec is better, as it has access to the direct ISA semantics and can much easier do things like unrolling.
To me it’s clear adding the ability to express intent to parallelise is the Right Thing. This is the only way the compiler can actually know what you want it to do.
Trying to abstract over SVE with a SIMD library is a bit of a fool's errand. The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it. All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
Frankly, the length agnostic stuff is a mistake that I hope hardware designers will eventually see the light on, like delay slots.
> Trying to abstract over SVE with a SIMD library is a bit of a fool's errand
It reallt isn't. You just make the default SIMD-width agnostic and anything less portable opt-in.
You can still specialize for a specific width pn scalabe vector ISAs.
> The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it.
Such as?
> All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
google highway doesn't. And while Arm is stuck with 128-bit SVE, because they alsp have to implement NEON as fast as possible to be competitive, RVV already has a large diversitly of hardware with different vector length available 128,256,512,1024.
Such as?
I have a database that has big columns that get functions applied to them to compute the result set. This is a perfect case for length agnostic instructions, except out ends up horribly memory bound. A nice optimization is to only compute those lanes containing rows that might actually be in the result set by keeping track of a sparse record that depends on the lane size. But the cnt instructions are optional, and this also inhibits compiler optimizations in that lookup.
CNT and CNTP don't seem to be optional for SVE, from what I found. (unless you mean HISTCNT)
It seems to me like you want tp use CNTP on a bitset that tells you, which rows are relevant, skipping them if CNT is 0?
Is that what you where describing?
I'm no C++ dev, but as an outsider, it sure reads like the whole "int is variable length" mistake again.
That abstraction is occasionally usable in low level systems code, that is why Go, Rust, D and C# support it as well.
Also to note that is C not C++.
That's a mistake for ABI visible types, yes.
In a way it's worse because at least with int you're not really expecting to run the same binary on architectures with different int lengths, and also for several decades there have only been two realistic options (32 or 64), which makes it easy to deal with.
With RVV (and SVE I assume) there are a wider range of realistic options - at least 128, 256 and 512. The RVV spec allows up to 65536! Also it's totally reasonable to want a single binary to work with all of them so then you're into compiling parts of your code multiple times with runtime dispatch which is a right pain.
I haven't looked into how Highway does it but I don't really know you you write length-agnostic code in high level languages. It's easy in assembly, but it sucks if you have to do it in assembly.
There is a bit of boilerplate to get dynamic dispatch working, but apart from that it's quite simple to use.
I don't know how SVE works but I thought the point of it was to let implementations pick a larger size than the CPU supports and then get an automatic speedup from better processors with more vector lanes.
Currently experimental, but looks like the first Intel arch will arrive in the next release in about 3 months. They are also going to support a portable layer.
Wondering what people here think about the approach the Go team is taking; I think they would appreciate more eyeballs on their design. (I’m not competent in this space (yet))…
Looks like that isn't a portable SIMD abstraction, but more similar to adding architecture-specific SIMD intrinsics support to go, with nicer syntax.
Unnecessarily negative article. Lets not forget how awful C++98 was for years. Standardisation doesn't mean useful.
GCC already solved it:
https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
The operations behave like C++ valarrays. Addition is defined as the addition of the corresponding elements of the operands. For example, in the code below, each of the 4 elements in a is added to the corresponding 4 elements in b and the resulting vector is stored in c.
Those type attributes are also used for the x86 intrinsics API, and they override default C behaviors like promotions and presumptions around aliasing (ironically they make type punning easier, though maybe it was just the few use cases I explored, and this isn't an area where I have alot of experience). C23 also gained the _BitInt type, which discards all the old promotion rules, which should help autovectorization.
I think ISPC is still the proper way to go. But these days everybody wants One Language to Rule Them All along with standard libraries for doing everything out-of-the-box. And while in principle ISPC's approach could be stitched into C or C++ in a fairly clean manner (perhaps with well-defined and enforced segregation of constructs to minimize complexity), it's just not gonna happen: C++ is too enamored with constructing libraries through deeply complex templated types (hammer, nail, yada yada), and C is just too conservative (though if GCC or clang went the distance with a full implementation, there's a good chance the C committee would adopt it).
Supports the GCC, OpenCL, AltiVec, NEON, SVE and RVV vector extensions”
Thanks!
I'd actually rather just have the compiler give some guarantees on producing SIMD code when you write regular C++ code doing sums, multiplications, etc... in a particular way. And perhaps add a few more operators/keywords to the language for modern CPU instructions (we got things like popcount, countl_zero and fma, but what about e.g. pext, pdep, aes, ...)
Just write inline asm for x86 and aarch64 (if you care about that) and not care about the rest. Is it even useful to do simd on other processors?
Compiler optimizing even the code around the simd code based on the semantics of arithmetic or other things sounds silly after writing some of this kind of code
So you "just" write 4 assembly implementations?
If you thought std::simd was a library nobody asked for, just wait until you hear about <linalg>. I feel like half the people looking forward to that think they're just going to get standard C++ bindings to LAPACK, when instead they're probably going to get an unoptimized, slapdash implementation of LAPACK written by people who aren't good at BLAS.
As for SIMD itself, designing a good SIMD library is difficult because there are several different SIMD approaches and some of them work poorly for certain use cases. For example, you can take an HPC-ish approach of "vectorize this loop" (à la #pragma omp simd) and have the compiler take care of a fairly mechanical transformation. Or you can take an opposite approach of treating a 128-bit SIMD vector as a fundamental data type in your language. Which approach is better depends on your use case.
Just wait until you hear about std::hive.
The work of one obsessive author, who never gave a good explanation for why the thing needed to be in the standard library instead of an external one. The committee was apathetic about the proposal and kept bringing up various trivial issues, in a clear attempt to stall him, but he refused to take the hint. So eventually they relented. Outside coverage I have seen so far seems to be to the tune of "WTF is this weird thing?" and quickly glosses over it.
I wonder if it's going to end up like the export keyword.
I feel like std::hive fits right in to the C++ stdlib group of collections
The least stupid is std::vector which is just the typical O(1) amortized growable array type found in most modern languages, with a mediocre API. 8/10 could do better.
std::array is just the built-in array type C++ should have but doesn't. This shouldn't be a library type, that's embarrassing.
std::deque looks like you're getting something like Rust's VecDeque but you aren't, it's a weird hybrid optimisation which presumably made sense on some 1980s hardware. I asked STL once to explain what it's even for and they didn't know. [[For reference STL is the name of the guy in charge of Microsoft's implementation of the C++ standard library, Microsoft also calls that library STL for reasons we needn't address]]
std::list is the extrusive doubly linked list. This type makes sense in a DSA class. Why is it in the C++ standard library? I dunno, maybe C++ is intended only as a teaching language?
std::forward_list is the extrusive singly linked list. You know, for a different seminar in that same DSA class. You might want the intrusive linked list, you don't want this.
std::map and std::set are probably red-black trees. OK, you might need those and for some reason not care about the details (which aren't specified here)
std::multimap and std::multiset are even less obviously useful. I have never seem them used in real software. Why are they in the standard library?
std::unordered_all_of_the_above_maps_and_sets look like the simplistic hash table you'd be shown in an intro DSA class either taught by somebody who doesn't know the subject well or aiming to cover the basics and get back to their research. This will perform poorly on any hardware with features like a cache.
The C++ stdlib carries broken garbage basically indefinitely. C++ doesn't have the same library stability promise that Rust has, but in practice stuff that nobody cares about is never removed.
I'm not sure what the argument is here?
These are in the standard library because someone proposed their inclusion.
They're fine for the majority of people who really don't want to roll their own data structures each time.
They're not compulsory to use, you're still free to roll your own.
[deleted]
The article's point in a nutshell:
> The problem is that std::simd in 2026 is the 2012 solution arriving after the world moved on. The committee spent a decade polishing a library-based approach while compilers solved the easy cases automatically and ISPC solved the hard cases with language-level support.
I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
> I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
The main reason why people attend WG21 meetings is to get their pet features into the C++ language or the associated standard library. To some extent you can further that goal by shooting down other people's suggestions, especially if they would conflict ‡
C++ is a vast sprawling language. There are no genuine "C++ experts" for the same reason there aren't any people who know all of mathematics. There are a lot of people who are experts on some corner of the language or its libraries, and some who know a little bit about almost everything but no overarching experts.
‡ A good way to do this work would try to have such rivals all work together to improve the language, a sort of "yes, and" collaborative approach but although this has occasionally been able to work in C++ the whole WG21 structure works against it, in particular they vote to achieve consensus, which is not what the word "consensus" means and rewards appeasing haters much more than it does finding out what the problems are and working to fix them.
Why not just writing inline assembly is not enough?
You optimize for a specific target.
The problem is that you cannot be cross-platform. Sure.
But that is why software is incremental.
I write for my HW, not yours. You can write for yours.
Make folders with implemntations
x86_v1
x86_v2
arm64
riscv64
...
...
...
and include
sadly inline assembly is still at the ergonomics of "one compiler doesn't support it in x64 mode" and "you can choose between the readable syntax (which is a black box to the compiler) and the unreadable syntax (which can specify I/O/clobber regs)"
sigh
C++ sits on that weird abstraction level where it wants to be a higher level language but it keeps grinding their gears on stuff like pointer sizes, pointer arithmetic or vector sizes and at the same time wants to keep being C compatible and needs that interface with the lower level world
Now compare with how numpy does things: you care about the data size but not the implementation.
Still, I didn't expect less (of a crap fest) from the C++ committee as presented here
numpy is a python wrapper over a C library written by people who have ground those gears
Yes but not all of them
It would be easy to push complexity up at the level of Numpy/Pytorch/Tensorflow but it mostly gets hidden
(also a lot of it relies on LAPACK which is Fortran - which kinda works with SIMD better than C/C++)
Slop.
Nobody should read that AI slop article. Nobody.
Maybe there's an interesting story in there, it's certainly possible. But the "author" could not be bothered to write it, and so why should we waster our time reading it?
I love people praise Claude for doing their work, every day on HN, while at the same time complaining about AI in articles.
Who says these are the same people?
Statistics.
Glad to see the classic goomba fallacy in action even here on HN.
I praise Claude and hate AI articles because I could've asked Claude to dumb down the debate if I wanted.
Articles should be high information density and summarizable with Claude.
Some would argue code should be the product of craftsmanship and vibe coding has no place in it.
I hate AI in code, I hate AI in articles, I hate when AI sticks to the bottom of my shoe.
Overly wordy and repetitive - taking 3x the amount of words if a human had written it.
Agreed, fixed with vectors needs to be a language feature, better compile times and would solve issues for most people.
Personally, I think that like Clang way to adding GLSL like vectors and semantics would've gone a long way. SVE might be an elegant design, but in reality there are probably a multiple factor of game and other 3d code being written that needs vectors compared to other fields, and there limited vector sizes aren't really a problem.
And honestly, considering the story of AVX512.. with 512 bit vectors being removed from mainstream by Intel, do we really really need longer ones despite it being from a "scalable design"?
The point about the optimizer only seeing "opaque templates and function calls" makes little sense.
First off, templates are the opposite of opaque due to the fundamental requirement that the implementation be visible to every translation unit using a template. This makes any function calls trivially inlinable.
Second, and the reason for the above requirement, templates are compiled by monomorphization – making a distinct, separately optimizable copy of each concrete instantiation of a template. By the time the compiler backend sees the intermediate representation, there’s nothing about templates left.
There are of course reasons why highly abstracted template code may be difficult to optimize, for instance if function call chains are so deep that the inliner gives up. There are also legitimate reasons why a fully language-based solution might beat a library-based one. But one of the points of adding a library to the std is that the standard library is allowed to cheat as much as it wants. It can be deeply integrated to the compiler and implemented entirely using compiler magic if necessary.
std::simd may be too little, too late for many reasons, but I doubt any of them is that the compiler can’t see through the code.
it's just a compiler hint just like all other hints, parity with other languages.
I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.
I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.
In such discussions, whenever you mention abstractions are universally "pretty poor", to the extent anyone is listening, I think this hyperbole can do real damage. Maybe it prevents people from getting relevant performance gains, even if not 100% of the optimum, which is anyway unattainable. And what is the alternative? Not many projects can afford to hand write intrinsics for all platforms. And are you aware that Highway is basically a thin wrapper over intrinsics, which you can still drop down to where it helps?
> 100% of the optimum, which is anyway unattainable.
Can you expand on this? Sounds like an interesting discussion.
Do you say that from the perspective of compiled languages? I hear good things about .net core wrt SIMD, but that has the advantage it can decide at JIT.
Yep, same here and agree.
Compilers have definitely got better though: another issue in the past (maybe still is to a degree? although compilers have got a lot better at this in the past 15 years, but it used to be one of the things only Intel's ICC actually got right), that if you wrapped the base-level '__m128' or 'float32x4_t' in a struct/union in order to provide some abstraction, the compiler would often lose track of this when passing the struct/union through functions (either by value or const ref), and would often end up 'spilling' (not entirely the correct terminology in this context, but...) the variable from registers, and just producing asm which ended up uselessly loading the variable again from a stack address further up the call stack, when it didn't actually need to do that. So that was the situation even when using intrinsics within custom wrappers.
From 2011 to around 2013 ICC seemed to be the only compiler on amd64 which wouldn't do this. If you passed the actual '__m128' down the function call chain instead, clang and gcc would then do the right thing.
Part of that could be ABI constraints. There are some surprising calling convention differences between a vector and a struct or union with vectors in it, and they vary platform to platform. E.g. on ARM a struct with two 128-bit vectors will pass in two registers where on x86 it must pass via the stack.
Using __attribute__ to tweak calling conventions can often really clean this up, but that's just as obscure and non-portable as the problem it fixes. So you either end up writing weird non-portable code one way or weird non-portable code another... Code working with these types doesn't get to benefit from zero-cost abstraction to the degree we're used to with normal scalar code.
For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.
For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.
In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.
That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.
We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.
NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?
The data layout can often be done dynamically based on your target architecture.
Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.
The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.
The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.
This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.
> The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
This will work only for the most basic SIMD usages.
> CPU vendors must offer a basic set of vector instructions that is supported on all architectures.
This will take decades because you cannot change existing architectures/processors.
> This will take decades because you cannot change existing architectures/processors.
I think once, AVX-512, SVE and RVV are wide spread enough, you'll have a rather powerfull baselevel you can target. But this will take a lot of time.
> I think a legitimate criticism is that it is unclear who std::simd is for.
I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.
The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.
An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.
The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.
SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.
That said, I love that silicon has become so much more expressive.
IMO what's needed is ISPC like guided autovec with a lot of hinting support to control codegen (e.g. hint for generating an unrolled version only or an unrolled and non-unrolled version).
Basically something like #pragma omp SIMD, but actually designed for the SIMD model, not parallel one, that erros when vectorization isn't possible.
Ideally it would support things like reductions, scans, reference of elements from other iterations (e.g. out[i] = in[i-1]+in[i+1]), full gather scatter, early break, conditional execution control (masking or also a fast-path, when no active elements), latency vs throughput sensitive (don't unroll or unroll to max without spilling), data dependent termination (fault-only-first load or page aligned for thigs like strlen), ...
Have you considered our Highway library? Runtime dispatch need not be a PITA :) It's basically portable intrinsics, and a much more complete set (>300) than the ~50 in std.
Does it have fallback paths for everything, though? Scalar if necessary?
Projects that depend on Highway drop support for CPUs not listed in the Highway documentation, saying that they can't support these CPUs because they are incompatible with Highway: https://google.github.io/highway/en/master/README.html#curre...
Are these projects somehow mistaken?
> it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions
This is one complaint I toss back at Intel and AMD.
If an instruction/intrinsic is universally worse than the P90/P95/P99 use case where it's going to be used to another set of instrinsics, then it shouldn't exist. Stop wasting the die space and instruction decode on it, if not only the developer time wasted finding out that your dot product instruction is useless.
There are a lot of smart people that have worked on compilers, optimized subroutines for LAPACK/BLAS, and designed the decoders and hardware. A lot of that effort is wasted because no one knows how to program these weird little machines. A little manual on "here's how to program SIMD, starting from linear algebra basics" would be worth more to Intel than all the money they've wasted trying to improve autovectorization passes in ICC and now, LLVM.
:) I agree a tutorial would be helpful. We are working on one with Fastcode.
Autovectorisation is the main way SIMD hardware gets put into use, whether you think it's pretty poor or not.
SIMD came to mainstream in 1995 Pentium MMX and has been proven rather difficult for compilers to target, but after 30+ years is doing a bit better despite PLT conspiring against it. (see eg CUDA, Futhark etc)
In my limited experience with looking at autovectorisation compiler output, gcc is quite bad unless you hold its hand, and clang tries to autovectorise everything it sees.
Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?
The full scope of what SIMD is used for is much larger than parallelizing evaluation of numeric types and algorithms.
For example, it is used for parallel evaluation of complex constraints on unrelated types simultaneously while packed into a single vector. Think a WHERE clause on an arbitrary SQL schema evaluated in full parallel in a handful of clock cycles. SIMD turns out to be brilliant for this but it looks nothing like auto-vectorization.
None of the SIMD libraries like Google Highway cover this case.
I don't quite get how something like highway doesn't cover this, while intrinsics do.
Can you explain the usecase more concretely?
Almost literally what I stated. Consider a row in Postgres table or similar. Convert the entire WHERE clause across all columns in that table into a very short sequence of SIMD instructions against the same memory. All of the columns, regardless of type, are evaluated simultaneously using SIMD. For many complex constraints you can match rows in single digit clock cycles even across many unrelated types. This is much faster than using secondary indexes in many cases.
It isn’t hypothetical, I’ve shipped systems that worked this way. You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth.
OK, I thought it couldn't be that, because that should be doable with std::simd or a SIMD abstraction. Well, unless you JIT it, in which case intrinsics wouldn't help either.
> You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth
Do I underatand it correctly, that this would only work, if you have multiple of the same comparisons (e.g. equality check with same sized data) in the WHERE clause and the relevant collumns are within one multiple of the SIMD width of each other?
Every column has its own independent constraint: equality, order, range intersection, bit sets, etc that is evaluated concurrently in single operations. Independent per column in parallel. It does require handling the representation of columns to enable it but that isn’t onerous in practice.
It isn’t intuitive but it is one of those things that is obvious in hindsight once you see how it works. The gap is that people struggle to understand how to make this something SIMD native, especially in high-performance systems.
Ah, so you're just doing SoA or AoSoA layout? It sounded like you where doing something more special than the standard SIMD usecase.
This does easily work with SIMD abstractions and even length-agnostic vector ISAs, unless you're doing AoSoA and your storage format has to match your memory format and it has the be the same on all machines. In which case you probably want to do something like 4K blocks anyways, in which case you can make it agnostic for all vector length anybody reasonably cares about for this type of application anyways.
Google Highway gets mentioned in the article.
There is google’s highway, that provides an abstraction layer. It is used by NumPy.
what about Google highway project?
> I think a legitimate criticism is that it is unclear who std::simd is for
It's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
My criticism could mostly be summarized similarly. The scope of what a portable std::simd can do is almost exactly the scope that you would expect auto-vectorization to subsume over time. SIMD, to the extent it is covered by std::simd, is the part of SIMD that should be pretty simple to learn.
There isn’t an obvious path to elevate it above what auto-vectorization should theoretically be capable of in a portable way. This leads to a potential long-term outcome where std::simd is essentially a no-op because scalar code is automagically converted into the equivalent and it is incapable of supporting more sophisticated SIMD code.
[dead]
The linked[1] "six reasons to use std::simd" was just what I needed after a long week. Hilarious!
[1]: https://github.com/NoNaeAbC/std_simd
It should have been "eight reasons to use std::simd". Inefficient.
isn't that just QoI issues? There's a reason why the libstdc++ folks labelled their implementation as experimental.
That certainly convinced me. When I was doing my taxes recently and had to watch those forced loading animations, I kept asking myself "why can't my compiler do this?" Thanks to std::simd, now it can!
I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).
Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.
There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.
> There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language.
I think this is the best solution for truely portable SIMD. Sure it doesn't cover everything, but it makes autovec explicit, guaranteed and more powerfull.
One of the biggest problems with "portable" SIMD libraries, is that when it's used for simple things, often autovec is better, as it has access to the direct ISA semantics and can much easier do things like unrolling.
To me it’s clear adding the ability to express intent to parallelise is the Right Thing. This is the only way the compiler can actually know what you want it to do.
Trying to abstract over SVE with a SIMD library is a bit of a fool's errand. The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it. All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
Frankly, the length agnostic stuff is a mistake that I hope hardware designers will eventually see the light on, like delay slots.
> Trying to abstract over SVE with a SIMD library is a bit of a fool's errand
It reallt isn't. You just make the default SIMD-width agnostic and anything less portable opt-in.
You can still specialize for a specific width pn scalabe vector ISAs.
> The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it.
Such as?
> All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
google highway doesn't. And while Arm is stuck with 128-bit SVE, because they alsp have to implement NEON as fast as possible to be competitive, RVV already has a large diversitly of hardware with different vector length available 128,256,512,1024.
CNT and CNTP don't seem to be optional for SVE, from what I found. (unless you mean HISTCNT)
It seems to me like you want tp use CNTP on a bitset that tells you, which rows are relevant, skipping them if CNT is 0? Is that what you where describing?
I'm no C++ dev, but as an outsider, it sure reads like the whole "int is variable length" mistake again.
That abstraction is occasionally usable in low level systems code, that is why Go, Rust, D and C# support it as well.
Also to note that is C not C++.
That's a mistake for ABI visible types, yes.
In a way it's worse because at least with int you're not really expecting to run the same binary on architectures with different int lengths, and also for several decades there have only been two realistic options (32 or 64), which makes it easy to deal with.
With RVV (and SVE I assume) there are a wider range of realistic options - at least 128, 256 and 512. The RVV spec allows up to 65536! Also it's totally reasonable to want a single binary to work with all of them so then you're into compiling parts of your code multiple times with runtime dispatch which is a right pain.
I haven't looked into how Highway does it but I don't really know you you write length-agnostic code in high level languages. It's easy in assembly, but it sucks if you have to do it in assembly.
Here is a highway example: https://gcc.godbolt.org/z/7sdPr61W6
There is a bit of boilerplate to get dynamic dispatch working, but apart from that it's quite simple to use.
I don't know how SVE works but I thought the point of it was to let implementations pick a larger size than the CPU supports and then get an automatic speedup from better processors with more vector lanes.
Curious if people here have looked at the upcoming SIMD support in Go: https://go.dev/doc/go1.26#simd
Currently experimental, but looks like the first Intel arch will arrive in the next release in about 3 months. They are also going to support a portable layer.
Wondering what people here think about the approach the Go team is taking; I think they would appreciate more eyeballs on their design. (I’m not competent in this space (yet))…
Looks like that isn't a portable SIMD abstraction, but more similar to adding architecture-specific SIMD intrinsics support to go, with nicer syntax.
Unnecessarily negative article. Lets not forget how awful C++98 was for years. Standardisation doesn't mean useful.
GCC already solved it: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html The operations behave like C++ valarrays. Addition is defined as the addition of the corresponding elements of the operands. For example, in the code below, each of the 4 elements in a is added to the corresponding 4 elements in b and the resulting vector is stored in c.
Those type attributes are also used for the x86 intrinsics API, and they override default C behaviors like promotions and presumptions around aliasing (ironically they make type punning easier, though maybe it was just the few use cases I explored, and this isn't an area where I have alot of experience). C23 also gained the _BitInt type, which discards all the old promotion rules, which should help autovectorization.
I think ISPC is still the proper way to go. But these days everybody wants One Language to Rule Them All along with standard libraries for doing everything out-of-the-box. And while in principle ISPC's approach could be stitched into C or C++ in a fairly clean manner (perhaps with well-defined and enforced segregation of constructs to minimize complexity), it's just not gonna happen: C++ is too enamored with constructing libraries through deeply complex templated types (hammer, nail, yada yada), and C is just too conservative (though if GCC or clang went the distance with a full implementation, there's a good chance the C committee would adopt it).
And these are also available in clang. https://clang.llvm.org/docs/LanguageExtensions.html#vectors-...:
“Vectors and Extended Vectors
Supports the GCC, OpenCL, AltiVec, NEON, SVE and RVV vector extensions”
Thanks!
I'd actually rather just have the compiler give some guarantees on producing SIMD code when you write regular C++ code doing sums, multiplications, etc... in a particular way. And perhaps add a few more operators/keywords to the language for modern CPU instructions (we got things like popcount, countl_zero and fma, but what about e.g. pext, pdep, aes, ...)
Just write inline asm for x86 and aarch64 (if you care about that) and not care about the rest. Is it even useful to do simd on other processors?
Compiler optimizing even the code around the simd code based on the semantics of arithmetic or other things sounds silly after writing some of this kind of code
So you "just" write 4 assembly implementations?
If you thought std::simd was a library nobody asked for, just wait until you hear about <linalg>. I feel like half the people looking forward to that think they're just going to get standard C++ bindings to LAPACK, when instead they're probably going to get an unoptimized, slapdash implementation of LAPACK written by people who aren't good at BLAS.
As for SIMD itself, designing a good SIMD library is difficult because there are several different SIMD approaches and some of them work poorly for certain use cases. For example, you can take an HPC-ish approach of "vectorize this loop" (à la #pragma omp simd) and have the compiler take care of a fairly mechanical transformation. Or you can take an opposite approach of treating a 128-bit SIMD vector as a fundamental data type in your language. Which approach is better depends on your use case.
Just wait until you hear about std::hive.
The work of one obsessive author, who never gave a good explanation for why the thing needed to be in the standard library instead of an external one. The committee was apathetic about the proposal and kept bringing up various trivial issues, in a clear attempt to stall him, but he refused to take the hint. So eventually they relented. Outside coverage I have seen so far seems to be to the tune of "WTF is this weird thing?" and quickly glosses over it.
I wonder if it's going to end up like the export keyword.
I feel like std::hive fits right in to the C++ stdlib group of collections
The least stupid is std::vector which is just the typical O(1) amortized growable array type found in most modern languages, with a mediocre API. 8/10 could do better.
std::array is just the built-in array type C++ should have but doesn't. This shouldn't be a library type, that's embarrassing.
std::deque looks like you're getting something like Rust's VecDeque but you aren't, it's a weird hybrid optimisation which presumably made sense on some 1980s hardware. I asked STL once to explain what it's even for and they didn't know. [[For reference STL is the name of the guy in charge of Microsoft's implementation of the C++ standard library, Microsoft also calls that library STL for reasons we needn't address]]
std::list is the extrusive doubly linked list. This type makes sense in a DSA class. Why is it in the C++ standard library? I dunno, maybe C++ is intended only as a teaching language?
std::forward_list is the extrusive singly linked list. You know, for a different seminar in that same DSA class. You might want the intrusive linked list, you don't want this.
std::map and std::set are probably red-black trees. OK, you might need those and for some reason not care about the details (which aren't specified here)
std::multimap and std::multiset are even less obviously useful. I have never seem them used in real software. Why are they in the standard library?
std::unordered_all_of_the_above_maps_and_sets look like the simplistic hash table you'd be shown in an intro DSA class either taught by somebody who doesn't know the subject well or aiming to cover the basics and get back to their research. This will perform poorly on any hardware with features like a cache.
The C++ stdlib carries broken garbage basically indefinitely. C++ doesn't have the same library stability promise that Rust has, but in practice stuff that nobody cares about is never removed.
I'm not sure what the argument is here?
These are in the standard library because someone proposed their inclusion.
They're fine for the majority of people who really don't want to roll their own data structures each time.
They're not compulsory to use, you're still free to roll your own.
The article's point in a nutshell:
> The problem is that std::simd in 2026 is the 2012 solution arriving after the world moved on. The committee spent a decade polishing a library-based approach while compilers solved the easy cases automatically and ISPC solved the hard cases with language-level support.
I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
> I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
The main reason why people attend WG21 meetings is to get their pet features into the C++ language or the associated standard library. To some extent you can further that goal by shooting down other people's suggestions, especially if they would conflict ‡
C++ is a vast sprawling language. There are no genuine "C++ experts" for the same reason there aren't any people who know all of mathematics. There are a lot of people who are experts on some corner of the language or its libraries, and some who know a little bit about almost everything but no overarching experts.
‡ A good way to do this work would try to have such rivals all work together to improve the language, a sort of "yes, and" collaborative approach but although this has occasionally been able to work in C++ the whole WG21 structure works against it, in particular they vote to achieve consensus, which is not what the word "consensus" means and rewards appeasing haters much more than it does finding out what the problems are and working to fix them.
Why not just writing inline assembly is not enough?
You optimize for a specific target.
The problem is that you cannot be cross-platform. Sure.
But that is why software is incremental.
I write for my HW, not yours. You can write for yours.
Make folders with implemntations
x86_v1 x86_v2 arm64 riscv64 ... ... ...
and include
sadly inline assembly is still at the ergonomics of "one compiler doesn't support it in x64 mode" and "you can choose between the readable syntax (which is a black box to the compiler) and the unreadable syntax (which can specify I/O/clobber regs)"
sigh
C++ sits on that weird abstraction level where it wants to be a higher level language but it keeps grinding their gears on stuff like pointer sizes, pointer arithmetic or vector sizes and at the same time wants to keep being C compatible and needs that interface with the lower level world
Now compare with how numpy does things: you care about the data size but not the implementation.
Still, I didn't expect less (of a crap fest) from the C++ committee as presented here
numpy is a python wrapper over a C library written by people who have ground those gears
Yes but not all of them
It would be easy to push complexity up at the level of Numpy/Pytorch/Tensorflow but it mostly gets hidden
(also a lot of it relies on LAPACK which is Fortran - which kinda works with SIMD better than C/C++)
Slop.
Nobody should read that AI slop article. Nobody.
Maybe there's an interesting story in there, it's certainly possible. But the "author" could not be bothered to write it, and so why should we waster our time reading it?
I love people praise Claude for doing their work, every day on HN, while at the same time complaining about AI in articles.
Who says these are the same people?
Statistics.
Glad to see the classic goomba fallacy in action even here on HN.
I praise Claude and hate AI articles because I could've asked Claude to dumb down the debate if I wanted.
Articles should be high information density and summarizable with Claude.
Some would argue code should be the product of craftsmanship and vibe coding has no place in it.
I hate AI in code, I hate AI in articles, I hate when AI sticks to the bottom of my shoe.
Overly wordy and repetitive - taking 3x the amount of words if a human had written it.
I read it and found it interesting
... because it makes some decent points?