61

vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep

As a user of a lot of coding tokens I’m most interested in latency - these numbers are presumably for heavily batched workloads. I dearly wish Claude had a cerebras endpoint.

I’m sure I’d use more tokens because I’d get more revs, but I don’t think token usage would increase linearly with speed: I need time to think about what I want to and what’s happened or is proposed. But I feel like I would be able to stay in flow state if the responses were faster, and that’s super appealing.

an hour agovessenes

Impressive performance work. It's interesting that you still see these 40+% perf gains like this.

Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.

2 hours agokingstnap

Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast.

an hour agowhoevercares

Love vLLM!

2 hours agodanielhanchen

Now all we need is better support for AMD gpus, both CDNA and RDNA types

an hour agoandroiddrew