59

Fault Tolerant Llama training

[deleted]
2 hours ago

Hey, nice to see this here!

I'm the primary author so happy to answer any questions you might have!

16 hours agod4l3k

Why isnt there more investments into semi-synchronous training - is it that the convergence is iffy ? Also, it would be great to refactor this code into a typed language, so it is easier to reason about and maintain.

2 hours agobwfan123

This is awesome, can’t wait to try out these techniques. At least a week a year of my time for the past few years has gone towards recovering from a fault crashing a training run. Sometimes environment related, sometimes shared storage, sometimes just because a slightly faulty IB cable.

10 hours agozxexz

This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.

15 hours agobjt12345

Ultra Ethernet will do almost nothing. It’s a rubber stamped version of Broadcom’s design and Marcel/Cisco/etc will just add it to their asics. Remains to be seen if SpecrumX will or Connectix. If not, none of it matters.

These chips are $30m-$100m projects a pop. After the embarrassingly brutal failure of Barefoot nobody is going to do ASICs.

11 hours agofoobiekr

What kind of failures are you typically concerned with here?

7 hours agoanonymousDan

300 L40s? What's this, 1998?

16 hours agotimzaman

Hey Tim, how's it going?

Interested in lending PyTorch some compute? :)

torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.

Stay tuned though -- planning on doing some much larger demos on B200s!

15 hours agod4l3k