Go-msquic: A new QUIC/HTTP3 library for Go

msquic is one of the most performant QUIC protocol library out there. go-msquic is a wrapper around msquic so you can use it inside your Go project.

https://github.com/noboruma/go-msquic

The project is quite new, but we have seen good performance results with it so far. PRs are welcome!

Their dashboard [1] shows the library only gets ~7 Gbps on what is presumably a single core (given that it is single connection benchmark). Is that considered fast?

The memory bus on that system probably pushes at least ~400 Gbps, and even a fairly simple memcpy() implementation could probably push ~100 Gbps even at standard MTU (i.e. small) sizes. So that would be ~14 copies worth of execution cost per bit. That seems extremely high especially given how much people complain about the number of copies which, by this metric, would only cost ~7% per extra full copy.

[1] https://microsoft.github.io/msquic/

QUIC involves a lot more than copying packet data under a specific header so the bounds provided by memcpy() end up saying very little. E.g. QUIC mandates TLS 1.3, you'll have congestion and loss control (not every packet you send is guaranteed to make it), the tests are measuring goodput not network bandwidth, latencies for switching between/waiting for tasks to complete on a single core may be limiting.

Whether it's "considered fast" is better answered by simply comparing the practical numbers, like in that dashboard, rather than supposition. This kind of question is more aimed at trying to find out "how fast could it possibly be".

> Is [~7 Gbps] considered fast?

With Quic, yes. I don’t have hands on experience with this lib but throughput is not the strength of quic. In fact, that’s true for untuned UDP in general. Oftentimes CPU will bottleneck without bespoke platform-specific hacks to avoid the one syscall per 1260 bytes default/portable status quo.

I don’t know that bench but throughput varies wildly between environments, cross traffic and a number of mystical parameters.

Do we know if that is true for whatever Google is using internally on their servers? Are they also only getting ~7 Gbps per core or are they going much faster either due to software or hardware?

What are the actual exemplary implementations because ~7 Gbps per core seems really slow for an actual quality production implementation based on my read of the QUIC standard.

> ~7 Gbps per core seems really slow for an actual quality production implementation

Not really? With a 128-core machine, that's in the ballpark of 900 Gbps; you're hitting other bottlenecks far earlier than that. And in practice, we're talking about 256 hardware threads for a dual-socket Epyc Milan server, which has been the machine-of-the-day at Google for years now. AMD server processors are so big these days that you could spend a quarter of your compute on QUIC without blinking an eye, be able to serve 450 Gbps on paper per machine, and ultimately hit bottlenecks because disk I/O isn't able to feed the NIC that fast for YouTube serving.

One of the biggest things holding QUIC performance back is a chicken-and-egg problem: vendors don't want to implement NIC offload because there aren't enough companies that want it, and companies don't want to use QUIC because it represents such a big performance drop relative to TCP's decades of tuning with the Linux kernel (in part because of the lack of NIC offload).

You are arguing: “Why should Google care about wasting 25% of their compute costs?” I do not know how much that is, but presumably it is in the billions per year. A 1% saving would be tens of millions per year and that would only require a 4% implementation improvement.

Having done the majority of a QUIC implementation myself, achieving (on the non-encryption portion) 10 Gbps (1.5x faster) seems trivial, 30 Gbps per core (4x faster) seems straightforward, and 100 Gbps per core (15x faster) looks possible.

I was looking for benchmarks of professional implementations to see the limits of the protocol, but all I see are rates in the single digit Gbps which I assumed were toy re-implementations based on my analysis of what should be possible. But apparently these are state of the art implementations so now I am trying to figure out if anybody knows the specific reasons for the performance disparity.

> You are arguing: “Why should Google care about wasting 25% of their compute costs?”

So your estimate is that 25% of Google's compute costs are spent on terminating QUIC connections? I'd be very curious to hear how you arrived at that estimate.

Is that estimate excluding time spent on encryption and networking, as per your other posts?

> Having done the majority of a QUIC implementation myself, achieving (on the non-encryption portion) 10 Gbps (1.5x faster) seems trivial, 30 Gbps per core (4x faster) seems straightforward, and 100 Gbps per core (15x faster) looks possible.

100Gbps goodput with typical internet MTU sizes will mean about 10M ingress packets per second on the receiving side. That gives you a time budget of about 100 nanoseconds per packet, i.e. a single cache miss takes up the entire budget. Just computing a hash for the 5-tuple to look up the connection in the socket table will be like 10ns.

> But apparently these are state of the art implementations so now I am trying to figure out if anybody knows the specific reasons for the performance disparity.

It would be a performance disparity if you had a working implementation that was as fast as you claim, but as far as I can tell you don't have one yet?

The person I responded to stated: "AMD server processors are so big these days that you could spend a quarter of your compute on QUIC without blinking an eye". I was pointing out that seems ridiculous since even minor performance improvements would be valuable if you were, in fact, burning 25% of your compute. Even just restricting that to Youtube video delivery would be significant.

I am aware of the performance characteristics demanded by 100 Gbps. That is why I said possible, not "trivial". I am also talking about the protocol implementation itself, not the entire network stack. I doubt ~7 Gbps per core is hitting their UDP stack bottlenecks, so I doubt that is what is actually limiting their performance. And if they were hitting syscall or OS network stack limits, then this entire line of questioning is easily answered by saying that. But then I would wonder why their network stacks are so slow since UDP handling is even more trivial than managing the QUIC data plane, so should not constitute a bottleneck in any sanely designed full stack.

It is a performance disparity if people complain about "copies" and push for zero-copy over 1-copy transport stacks when you can do a full payload copy at a 7% overhead. Copies are basically irrelevant at that protocol overhead. I also happen to have enough technical ability to evaluate a problem and observe that the implementations seem to fall extremely short of what seems should be possible and can ask for technical clarification on what the practical limitations are.

But sure, I do not have a full implementation at the speeds I theorize should be possible. I only have a data plane implementation that I consider a toy going at ~5 Gbps with all optimizations turned off and having done zero of the planned performance work. I would frankly be shocked if I can not get a 4-6x improvement, but that is, still, only speculative so maybe it will happen.

> The person I responded to stated: "AMD server processors are so big these days that you could spend a quarter of your compute on QUIC without blinking an eye".

Ah, gotcha. Sorry about missing that context. I agree that nobody would be wasting that kind of compute on protocol overhead.

> I doubt ~7 Gbps per core is hitting their UDP stack bottlenecks, so I doubt that is what is actually limiting their performance.

Everything will limit the performance, and those limits will add up. Amdahl's Law is a harsh mistress.

> But then I would wonder why their network stacks are so slow since UDP handling is even more trivial than managing the QUIC data plane, so should not constitute a bottleneck in any sanely designed full stack.

A single cheap system call will likely cost you around 200 ns. A typical server would need like three system calls per packet (a poll, a read, and a write to ack it).

You can get much lower overhead with various kinds of kernel-bypass networking, but then the deployment story gets a lot harder.

> I only have a data plane implementation that I consider a toy going at ~5 Gbps with all optimizations turned off and having done zero of the planned performance work.

Right, but it sounds like both the current results and the projected ones are with neither encryption nor network I/O? I'm pretty sure that nobody else is publishing benchmark results from that kind of setup. They'd be sending the traffic over a network (at least looping back over a network card) using standard operating system functionality, as well as doing encryption. And still doing it at 7Gbps.

Three system calls per packet is not a serious design.

First of all, QUIC supports ack ranges which allow ack coalescing, so you do not need to average one ack per packet.

Second of all, QUIC supports frame packing, so you can piggyback on the opposing flow, though in these one-way benchmarks that should not matter.

Third of all, even 1 syscall per packet is not a serious design. You should be doing batch packet reads and writes to amortize that overhead and such APIs have existed for over a decade.

And again, these other aspects should apply zero meaningful performance impact when comparing to ~7 Gbps unless they, themselves, are outrageously slow. But then these would not be QUIC benchmarks, they are “my outrageously slow encryption and UDP stack” benchmarks. And it still falls back to my overarching point which is that ~7 Gbps per core for your end to end networking seems awfully slow. If QUIC is not your bottleneck, then why not benchmark non-bottlenecked systems. If QUIC is your bottleneck, then that seems really slow.

GGGGP here. I just don’t think throughput is a current design goal. To me, it’s critical. But to Google and many others, it was more about latency and connectivity when moving across networks. If you have a search or LLM application, bandwidth isn’t gonna be a constraint.

I am not saying that we are wasting 25% of our compute costs on QUIC, but rather that we hit many other bottlenecks far before we hit that point.

> Having done the majority of a QUIC implementation myself, achieving (on the non-encryption portion) 10 Gbps (1.5x faster) seems trivial

Encryption is a huge part of the cost, especially with the general lack of NIC offload as I mentioned. Since the packet number changes across retransmissions, lossy environments exacerbate the problem since you need to re-encrypt retransmitted packets (which is not a problem you have in TCP). If you're not doing any sender-side pacing, then you're likely to overwhelm your NIC with microbursts.

100 Gbps per core is quite frankly unrealistic; the only way you get that even in TCP is if you have your smart NIC talking directly to your SSDs and bypassing main memory and the kernel TCP stack entirely, and all of your encryption is done by the NIC. But as mentioned, we don't have that level of NIC support.

Even 30 Gbps per core doesn't leave you much wiggle room if you have to perform AES via your CPU.

Why would you assume QUIC is equivalent to copying memory?

Most of the benchmarks comparing msquic with other libraries are showing it on top.

That's the reason we decided to go ahead and see how it performs within our Go lang code base.

On our setup we are seeing a 50% latency reduction compared with other implementations. Definitely worth a try if you are looking for performance.

Are these benchmarks published anywhere?

> what is presumably a single core

I would guess that it's not a single core benchmark and that's the speed of the overall multi-threaded system.

> Is that considered fast?

You can squeeze out around 5GB/s/core with current fastest standard tls1.3 algorithm (AES128GCM). 10+GB/s is possible with aegis variants that are somewhat popular as an extension to TLS libs.

5 GB/s per core would still be 40 Gbps per core, so only ~15% of their time would be spent in encryption. They spend 5x longer doing the non-encryption stuff.

Also, it would be silly to bottleneck your protocol implementation benchmark on encryption that would be shared amongst implementations because that does not highlight your overhead advantages. In addition, the benchmarking RFC explicitly allows for the null encryption case in benchmarking for exactly that reason.

> Also, it would be silly to bottleneck your protocol implementation benchmark on encryption that would be shared amongst implementations because that does not highlight your overhead advantages

It would be great if benchmarks with no encryption were a thing.

There's massive overheads, and I explicitly avoided saying whether it's "fast" or not because to a lot of people serving 1000req/s seems "fast" and TLS is basically the main algorithmic complexity you'd expect from a data transfer protocol.

Network throughput and memory throughput are quite different.

I assume their network protocol benchmark meant to show how fast their protocol implementation is relative to others is not being bottlenecked on their network bandwidth because that would be positively idiotic.

Given the machine the benchmarks are running on has a 50 Gbps NIC, that would be double stupid since their bottleneck would then need to be the production side not producing enough data to show how fast their implementation is or bad network configuration doing the same.

And that all assumes they are not bypassing the NIC entirely since they are benchmarking the protocol implementation so the source of packets is largely irrelevant except for making sure they are not still in the cache after the producer synthesizes them.

Beyond that we look to memory bandwidth as a fundamental limiting factor on data shuffling from packets to the protocol client to attempt to bound the theoretical maximum throughput so we can see how far off a protocol implementation is from the theoretical maximum.

Awesome! FYI, I see some folks referencing data from our old dashboard (sorry I hadn't updated the link). Here is data from the new dashboard (https://microsoft.github.io/netperf/dist/), for download tests:

Windows TCP/TLS: 21.6 Gbps, Linux TCP/TLS: 19.7 Gbps, Windows QUIC (kernel): 14.3 Gbps, Windows QUIC (user): 12.8 Gbps, Linux QUIC: 7.33 Gbps, Windows QUIC (xdp): 6.86 Gbps

I would also like to add that work on these tests are very much still a work in progress, and we likely can get some/all of these numbers higher with tuning.

> CGO_ENABLED=1

It's not Go then

To be fair, the first line in the README.md says

    go-msquic is a Go wrapper for the Microsoft's QUIC library

So it's only the HN headline that is (technically) wrong.

And tbf it says for Go, not in Go

Never claimed the library to be full Go implementation, it is a library for Go. If you want a pure Go implementation, it already exists: quic-go.

It is mentioned & advised to use that one in the README. However if you need more perf, you might want to give a try at go-msquic.