Both GCC and Clang generate strange/inefficient code

The OP should try with -march=native so the compiler can use vector instructions.

Slightly off-topic but I like this way to test if memory is all zeroes: https://rusty.ozlabs.org/2015/10/20/ccanmems-memeqzero-itera... (see "epiphany #2" at the bottom of the page) I really wish there was a standard libc function for it.

It's common for compilers to generate mildly unusual code because they translate high-level code into an abstract intermediate notation, run a variety optimization steps on that notation, and then emit machine-specific code to perform whatever the optimizations yielded. There's no constraint along the lines of "but select the most logical opcode for this task".

The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks. Other times, a "weird" way of zeroing a register may actually take up less space in memory, etc.

> The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks.

I had a case back in the 2010s where I was trying to optimize a hot loop. The loop involved an integer division by a factor which was common for all elements, similar to a vector normalization pass. For reasons I don't recall, I couldn't get rid of the division entirely.

I saw the compiler emitted an "idiv [mem]" instruction, and I thought surely that was suboptimal. So I reproduced the assembly but changed the code slightly so I could have "idiv reg" instead. All it involved was loading the variable into an unused register before the loop and use that inside the loop.

So I benchmarked it and much to my surprise it was a fair bit slower.

I thought I might have been due to loop target alignment, so I spent some time inserting no-ops to align things in various supposedly optimal ways, but it never got as fast. I changed my assembly to mirror what the compiler had spit out and voila, back to the fastest speed again...

Tried to ask around, and someone suggested it had to do with some internal register load/store contention or something along those lines.

At that point I knew I was done optimizing code by writing assembly. Not my cup of tea.

Not to suggest you weren't competent, but did you consider and try and control for the fact that your measurement could be the problem?

Not going to dismiss it, but I did try to not do stupid stuff. I used QueryPerformanceCounter outside the loop, pinned the benchmark thread to a single core, and the array of elements it processed was fairly large. So I don't think overhead and throttling was an issue. The measurements were very consistent and repeatable.

Fair enough, I've only really ever found assembly level optimization on embedded microcontrollers to make any degree of sense. Performance optimization usually means something along the lines of "convince co-workers not to implement their own bubble sort" in my lines of work

It's also rarely worth being optimal in scalar code anymore, particularly at compilation speed cost. The exception here is memory accesses and branches that will miss. So the writing of useless zeros is egregious but other stuff just isn't usually worth caring about these days. It's "good enough" in an age where even in embedded land I can run a 48mhz cortex m0 for 10 years on a battery and not worry about a few extra ANDS. I'm much more likely to hit size than speed limitations.

Not to mention for anything not super battery limited you can get a m55 running at 800mhz with a separate 1ghz npu, hardware video encoders, etc.

This is before you move into the rockchip/etc space.

We really just aren't scalar compute limited in tons of places these days. There are certainly places but 10-15 years ago missing little scalar optimizations could make very noticeable differences in the performance of lots of apps and now it just doesn't anymore

> The claim that the code is inefficient is really not substantiated well in this blog post.

I didn't run benchmarks, but in the case of clang writing zeros to memory (which are never used thereafter), there's no way that particular code is optimal.

For the gcc output, it seems unlikely that the three versions are all optimal, given the inconsistent strategies used. In particular, the code that sets the output value to 0 or 1 in the size = 3 version is highly unlikely to be optimal in my opinion. I'd be amazed if it is!

Your point that unintuitive code is sometimes actually optimal is well taken though :)

Stefan Kanthak has previously noted that GCC's code generator is quite horrible, in these extensive investigations:

https://skanthak.hier-im-netz.de/gcc.html

In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].

Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.

[0] https://godbolt.org/z/vso7xbh61

[1] https://godbolt.org/z/MjcEKd9Tr

To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443x1c shows that if you write imperfect C++ clang is perfectly capable of optimizing it.

I see yeah that makes sense. I wanted to highlight that "magic" will, on average, give the optimizer a harder time. Explicit offset loops like that are generally avoided in many C++ styles in favor of iterators.

What's strange is I'm finding that gcc really struggles to correctly optimize this.

This was my function

    for (auto v : array) {
        if (v != 0)
            return false;
    }
    return true;

clang emits basically the same thing yours does. But gcc ends up just really struggling to vectorize for large numbers of array.

Here's gcc for 42 elements:

https://godbolt.org/z/sjz7xd8Gs

and here's clang for 42 elements:

https://godbolt.org/z/frvbhrnEK

Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.

Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.

I;ve had to coerce gcc to emitting SIMD code by using int instead of bool. Also, the early return may be putting it off.

Except that the C++ version doesn't need to be like that.

Abstractions are welcome when it doesn't matter, when it matters there are other ways to write the code and it keeps being C++ compliant.

Compilers also like to unnecessarily copy data to stack: https://github.com/llvm/llvm-project/issues/53348 Which can be particularly annoying in cryptographic code where you want to minimize number of copies of sensitive data.

Sure, the code is strange, but it is not necessarily inefficient. The only way to determine whether it is inefficient is to profile the generated code. And perhaps, compare the performance of compiler-generated code with tweaked or hand-generated assembler code that you think might be better.

GCC and Clang both have highly detailed models of processor execution pipelines that are used to perform optimization and instruction scheduling. This allows them to perform optimizations that mere mortals can only do with assistance from tools like Intel VTune, which provides insight into how execution pipelines are running, and where and when they are stalling.

It's entirely possible that the multiple memory fetches may fuse in the execution pipeline, and that seemingly unnecessary instructions may dual issue, and execute in parallel. Or that minor variations in generated instructions may allow four instructions to be decoded in parallel instead of three at a critical moment in the code. These are the kind of insights that GCC and Clang have into how the code will actually execute that you do not.

Both GCC and Clang have highly detailed models of the processor's execution pipeline for literally hundreds of processors. These models allow the compiler to determine which instructions execute in parallel, and to predict and avoid stalls at various places in the processor's execution pipeline.

Counter-intuitively, many code optimization problems rely not upon the the instructions being executed, and not even on how many instructions are being executed, since pretty much every instruction in passably well-optimized code will execute in parallel with at least one other instruction. The actual problem becomes one of predicting whether any operation in a 7- to 20-stage execution pipeline will stall or not, and whether there are ways to schedule instructions so that the stalls either don't occur at all, or don't matter at all.

Optimizations that are dependent on memory access are particularly perilous. Modern processors have elaborate and sometimes unpredictable methods for optimizing memory accesses: not just cache optimizations, but also fusing of reads and writes, optimizations for streaming reads and writes, single-cycle reads and writes for memory operations that look like they are stack-related, strategies for scheduling reads and writes to avoid bus-turnaround time, and probably others. Very often, the only thing that matters is the memory access stalls, with all other instructions operating in parallel in the time that it takes for the memory reads and writes to complete. (Does your processor have handling in the execution pipeline that prevents a potentially expensive branch misprediction in that tight code loop? I don't know. But GCC and Clang do!).

For a human to compete with GCC or Clang code, intuition about how code executes isn't sufficient. If you are not using sophisticated profiling tools like Intel VTune, you really won't have insight into whether your hand-generated assembler is stalling in the execution pipeline. And that is typically the problem that determines how well code executes. How the data must flow is invariant from input to output. In this case, the input array must be read, and a register must be set to zero or one on output. And both compilers, and processor execution pipelines are capable of doing quite extraordinary things to maximize opportunites for parallel execution and pipelining. So. The ONLY way to tell whether any of that generated code is inefficient is to benchmark it. Intuition is not remotely sufficient.

As far as I can see, both compilers have done quite heroic and spectacular jobs of optimizing code. It is not at all clear whether the compilers know something about how memory operations fuse in the instruction pipeline that you don't. The only oddity is the extra memory write to initialize the zero array that shows up in a single case, which, in fairness, occurs because you have introduced a faux optimization in the original code. One of the compilers heroically (and probably correctly) optimized the bulk of the code, and tragically missed an opportunity to remove a faux optimization that YOU have introduced. Even then, it's still not clear that an extra memory write is going to execute slower. (A write to l0 cache (either one or two cpu clock cycles), followed by a bunch of reads from l0 cache -- does the cache controller allow parallel reads and writes or does it not? I don't know, but GCC and Clang do! Obviously not a good thing, but does it ACTUALLY impact performance? I don't know. And the only way to tell, is for you to actually profile the code.

Also worth mentioning in passing: if you are not compiling with --march=native, all your code is being optimized for some prehistoric ancient least-common-denominator Intel processor, probably a 1990's-era 486, that nobody actually has anymore that has god-only-knows what inadequacies in its execution pipeline. So make sure you are.

- Credentials: professional programmer with 45 years experience, including extensive experience optimizing and profiling high-performance graphics device drivers, and audio plugin code, some of which was done in the era where humans actually could speed up compiler-generated code by (typically) 2 or 3%, in an industry when 2 or 3% improvements in benchmark scores could increase profits by millions of dollars. Currently of the opinion that any optimization that produces less than a 25% performance improvement is just not worth the extra effort and risk.

With `u32` as the element type, rustc 1.93 (with `-O`) does the correct thing for size=1, checks both elements separately (i.e. worse than in the article) for size=2, checks all three elements separately (i.e. not being crazy like in the article) for size=3, and starts doing SIMD at size=4.

https://godbolt.org/z/5PETM5bbn

This code is not equivalent to the C++ version. You can directly use `*x == [0_u32; SIZE]`. The code generated by the two is different. (But the iterator version not producing optimal code is also an issue.)

Very good point! Thanks!

With the correction, it interestingly enough produces the good behavior also at size=2. It also delays SIMD until size=5. But then it bizarrely stops doing SIMD again after size=64.

https://godbolt.org/z/P979nY4nf

The iterator version stays SIMD-y also after size=64, but stops at some point. What?! I don't know enough to understand what's going on. Anyone?