Inlining – The Ultimate Optimisation

There was a weird period in JavaScript’s history where the threshold for inlining was rather fixed and counted comments as part of the function weight. So there was code that would go faster if you deleted the comments.

I believe it was counting AST nodes rather than bytes, otherwise that would have also created problems for descriptive function names as well and that would have been what we heard about instead of comments.

I ran into a similarish issue in C++ (MSVC++) where a small change that improved an error message led to a 10% slowdown.

The function was something like:

  int ReturnAnswerIfCharIsValid(char* c)
  {
    if(c == nullptr)
      throw std::exception("ERROR!");

    return 42;
  }

The exception line was changed to something like:

  throw std::exception("Char is not valid, please fix it!"); // String is now longer

The performance of this hot-path function went down the drain.

I fixed it by replacing the exception call with yet another function call:

  if(c == nullptr)
     ThrowException();

Other fixes might have included something like __forceinline in the function signature.

Is there a name for duplicating function calls such that different optimizations for the same function can be compiled, but they are not fully duplicated at every call site?

I think GCC calls the IPA (inter-procedural analysis) clones.

https://gcc.gnu.org/onlinedocs/gccint/IPA-passes.html https://gcc.gnu.org/onlinedocs/gccint/Regular-IPA-passes.htm... https://gcc.gnu.org/onlinedocs/gccint/Late-IPA-passes.html

I think that is called specialization (https://www.linkedin.com/posts/compilers-lab_compiler-progra...).

Even if the compiler doesn’t explicitly do it, it can happen when doing subsequent optimization steps after inlining such as constant folding and dead code elimination.

Specialization is one of the reasons my call trees are just a little bit deeper than what one would expect given my loud but moderate stance on function splitting. Uncle Bob is nuts for espousing one line functions. But the answer to Bob being a lunatic is not two page functions. I think you can say a lot in five to six lines, and not overshoot meaningful names into word salad because you’ve run out of ideas. That’s still small enough for branch prediction, inlining, and specialization to kick in per call site, particularly if some callers follow one conditional branch and the others favors the other.

I think this is what the C++ world calls template specialization?

If I understand what you're asking for correctly, function cloning.

If you have f(x, y) and the compiler realizes the function optimizes nicely when y == 2 it can create a clone of f with a fixed argument y == 2, optimize that, and rewrite the appropriate call sites to call the clone.

Compilers aren't as good at doing that one unfortunately.

[deleted]

specialization - i don't know if general purpose compilers do this but ML compilers specialize the hell out of kernels (based on constants, types, tensor dimensions, etc).

EDIT: i'm le dumb - this is the whole point of JIT compilers.

That's the reason why polymorphism is sometimes described as slow. It's not really slow... But it prevents inlining and therefore always is a function call as opposed to sometimes no function call. It's not the polymorphism is slow. It's that alternatives can sometimes compile to zero

Every 8 methods knock out at least 1 cache line for you (on x64, at least). You're probably not calling 8 adjacent methods on the same exact type either, you're probably doing something with a larger blast radius. Which means sacrificing even more of caches. And this doesn't show up in the microbenchmarks people normally write because they vtables are hot in the cache.

So you're really banking on this not affecting your program. Which it doesn't, if you keep it in mind and do it sparingly. But if you start making everything virtual it should hit you vs. merely making everything noinline.

On the other hand, if the compiler can prove at compile-time what type the object must have at run-time, it can eliminate the dynamic dispatch and effectively re-enable inlining.

Which is why runtime polymorphism in Rust is very hard to do. The its focus on zero-cost abstractions means that the natural way to write polymorphic code is compiled (and must be compiled) to static dispatch.

Compilers will also speculatively devirtualize under some circumstances.

https://hubicka.blogspot.com/2014/02/devirtualization-in-c-p...

Pedantic, but I assume you're referring to virtual methods?

Ad hoc polymorphism (C++ templates) and parametric polymorphism (Rust) can be inlined. Although those examples are slow to compile, because they must be specialized for each set of generic arguments.

C++ tools can also devirtualize when doing whole-program optimization or tools like BOLT can promote indirect calls generated by any language.

force inline and related attributes are critical to get the right decision made consistently.

There's also flatten; unfortunately no equivalent with MSVC.