ZML - High performance AI inference stack

What would be the benefit of using ZML instead of relying on StableHLO/PJRT? Because the cost of porting models is for sure high.

ZML the zig library is mostly a wrapper of StableHLO/pjrt. But it's a high quality wrapper, and the tagged tensor syntax is really helpful to write complex ops like dot, or gather.

And ZML the framework also resolve issues with the complex dependency graph of stablehlo/pjrt.

First of all, great job! I think the inference will become more and more important.

That being said, I have a question regarding the ease of use. How difficult it is for someone with python/c++ background to get used to zig and (re)write a model to use with zml?

Hi co-author here. Zig is way simpler than C++. Simple like in an afternoon I was able to onboard in the language and rewrote the core meat of a C++ algorithm and see speed gains (fastBPE for reference).

Coming from Python, the hardest part is learning memory management. What helps with ZML is that the model code is mostly meta programming, so we can be a bit flexible there.

We have a high level API, that should feel familiar to Pytorch user (as myself), but improves in a few ways

pretty easy, usually the hardest part is figuring out what the python code is doing

Hi ya! Want to say this looks awesome :) really interested in the sharded inference demo!!! You said it was experimental, is it in the examples folder at all?? (On phone atm, so apologies for not investigating further)

Given that the focus is performance, do you have any benchmarks to compare against the likes of TensoRT-LLM.

It' s a bit early to compare directly to TensorRT because we don't have a full-blown equivalent.

Note that our focus is being platform agnostic, easy to deploy/integrate, good performance all-around, and ease of tweaking. We are using the same compiler than Jax, so our performances are on par. But generally we believe we can gain on overall "tok/s/$" by having shorter startup time, choosing the most efficient hardware available, and easily implementing new tricks like multi-token prediction.

I second this, it would help to justify the time investment into a framework if its clear how it stacks up!

my dreams have come true. hardware-agnostic ml primitives in a typed, compiled language.

my only question is: is zig stable enough to base such a project on?

Zig has been relatively stable for the past few years for the main Zig code. What has changed the most is the `build.zig` build system (which we aren't using).

We are also looking ahead at Zig roadmap, and trying to anticipate upcoming breaking changes, and isolate our users from that.

Stable as in unchanging, no.

Stable as in reliable enough, I’d say so.