The Canva outage: another tale of saturation and resilience

We had a similar CDN problem with releasing major Warframe updates: our CDN partner would inadvertently DDoS our origin servers when we launched an update because thousands of cold edges would call home simultaneously when all players players relogged at the same time.

One CDN vendor didn't even offer a tiered distribution system so every edge called home at the same time, another vendor did have a tiered distribution system designed to avoid this problem but it was overwhelmed by the absurd number of files we'd serve multiplied by the large user count and so we'd still end up with too much traffic hitting the origin.

The interesting thing was that no vendor we evaluated offered a robust preheating solution if they offered one at all. One vendor even went so far as to say that they wouldn't allow it because it would let customers unfairly dominate the shared storage cache at the edge (which sort of felt like airlines overbooking seats on a flight to me).

These days we run an army of VMs that fetch all assets from every point of presence we can cover right before launching an update.

Another thing we've had to deal with mentioned in the article is overloading back-end nodes; our solution is somewhat ham-fisted but works quite well for us: we cap the connection counts to the back end and return 503s when we saturate. The trick, however, is getting your load-balancer to leave the client connection open when this happens -- by default multiple LBs we've used would slam the connection closed so that when you're serving up 50K 503s a second the firewall would buckle under the runaway connection pool lingering in TIME_WAIT. Good times.

Something I've been wondering for a while is if BitTorrent or other P2P protocols are ever a consideration for pushing game updates? Naively, it seems like an ideal fit since a large swarm of leechers quickly turns into a large swarm of (partial) seeders mostly chattering amongst themselves. I recall Facebook and Twitter used to internally torrent their updates in the 2010s and BT scales just fine to thousands of peers and tens of GB files at least, but I think I've only ever played one game whose updater was a torrent client so I'm guessing it's a nonstarter for one reason or another. Are game publishers just allergic to it due to the piracy association? Are end-user upload speeds too slow to meaningfully make a difference? Are swarms of ~100k just too large to manage?

Edit: Silly me for posting while sleep deprived. It's not the update itself that you're saying is causing thundering herd issues, but the log-ins all being synced up afterwards much like in TFA, duh. My curiosity wrt the apparent lack of P2P game updaters still stands though.

See my related comment. It was a popular idea around 2005-10. As mentioned Red Swoosh was primarily sold as a “p2p” CDN, was bought up by akamai for a billionty dollars, and promptly disappeared. AWS S3 also implemented a torrent interface early on. AFAIK they keep it alive in name at least, but its effectively deadcode with $0 revenue as far back as Ive ever known. A handful of private companies built p2p themselves, but eventually moved off. As an example p2p is where spotify started in this time range and then moved to a CDN (us) for better consistency and not having to deal with it themselves.

The primary business problem is one of visibility and control. The customer UX would be entirely out if your control, and exceedingly variable, based on factors you (the provider) cant even see. At the same time CDNs were pushing down to cents per GB delivered by 2010, and ~1¢/GB by 2015. At a penny per GB distribution for higher throughout, better visibility, and control CDN distribution costs started to not matter compared to other costs and priorities.

Oh! Porn delivery companies, theyre an interesting content distribution case. AFAIK commercial CDNs are still way too expensive to meet their business model needs. My recollection is that they all built their own in house CDNs, like GPs “run a bunch of VMs” approach, or used a peers. This was accelerated as all of those companies consolidated ala MindGeek in the 2010s.

One reason for Spotify's move away from p2p was it was absolutely a no-go on mobile platform, which was rapidly becoming dominant at the time.

Microsoft Store and Xbox games/updates are distributed with a proprietary P2P protocol, which also includes ISP appliances. afaik it's the largest P2P network in the world. https://learn.microsoft.com/en-us/windows/deployment/do/mcc-...

Steam recently introduced LAN-based P2P to complement their significant appliance/CDN infrastructure, but idk if anyone has pulled it apart yet. and I don't think it does tunnelling like the msft network

Blizzard used to have p2p support, they removed it around 2015. It’s not hard to think of a bunch of problematic cases which become absolute hell to diagnose because they’re client side.

Their downloaders for classic games still have the options to enable peer to peer. Though it failed to initialise, but I'm not sure if that's because their tracker is down or because it demands upnp. I recently did this with Diablo 2 and it's expansion.

Windows Update has the option to download signed updates from Microsoft and any other computer that has downloaded it. And it says that 38% (247MB) of all windows update bytes have been downloaded form "PCs on the internet" and I have uploaded 340MB to "PCs on the Internet"

around 2010, we (Zynga at the time) used torrent to distribute the MafiaWars code/assets to all servers in a couple of data centers. Worked without much challenge.

As someone who worked on a major CDN I have some perspective.

> thousands of cold edges would call home simultaneously when all players players relogged at the same time.

Our more mature customers (esp console gaming) would enable early background downloads, spaced out over a few hours, the day/hours before 'launch'. Otherwise adhoc/jit is definitely best effort, though we did a few things to help:

Conceptually each CDN POP is ~3 logical layers 1) a client-request-terminating 'hot' cache distributed across all nodes in the POP 2) a shared POP cache segmented by content/resource ID 3) a shared origin-request-facing egress layer. Every layer would attempt to perform request coalescing, with 90% efficacy or more. eg, 10 client requests to the same layer 1 node _should_ only generate a single request to the segmented layer 2 cache. The same layer 2 node would we serving multiple requests to the layer 1 nodes, while making a single request back towards the origin.

Some exceptional behavior occurred, or was driven by, 'load' and trying to account for 1) head of line blocking 2) tail latencies etc from inequal load distribution. Based on load for an object, or a nodes current total load, we used forward signaling to distribute requests to peers. That is a 'busy' layer 2 node would signal to the layer 1 nodes to use additional/alternate peers. This increased the number of copies of a popular object in the segmented cache, increasing the total throughput available to populate the 'hot' L1 cache nodes _or_ to serve objects that were not consistently popular enough to stay in that distributed L1 cache. And relevant to your example we had similar problems when going back to the origin; In the first case we want to minimize the number of new TCP/TLS connections, which have a large RTT setup penalty, by reusing active & idle 'layer 3' connections to the origin. This, however, introduces hotspots and head of line blocking for those active origin connections. Which, again, based on 'load' would be forward signaled so that additional layer 3 nodes/processes would be used to fetch _additional_ origin content.

Normally this all means 1 origin request can serve a few orders of magnitude more concurrent client requests. For very large content, or exceedingly large client numbers, you'd see the CDN 'scale out' on concurrency in an effort to minimize blocking and maximize throughput in the system.

> One CDN vendor didn't even offer a tiered distribution system so every edge called home at the same time, another vendor did have a tiered distribution system designed to avoid this problem

See above on request coalescing. In the vast vast majority of cases it was effective in reducing the problem by a few orders of magnitude; AFAIK every CDN does/did that. _In addition_ we did have an distributed hierarchal system for caching between edge POPs and origins _but_ it was non-public/invite/managed by us for a long time. The reason being that the _vast_ majority of customers incurred additional latency (& cost to us) without meaningful benefit from this intermediate cache layer.

> The interesting thing was that no vendor we evaluated offered a robust preheating solution if they offered one at all.

This is interesting to me. AFAIK Akamai Netstorage was sold to solve the origin distribution angle, _and_ drove something like 50% of the revenue from large object distribution customers. For us the customer use case of 'prefetch' was perennial 'top 5' but never one that would drive revenue, and conflicted with other system tenets.

> One vendor even went so far as to say that they wouldn't allow it because it would let customers unfairly dominate the shared storage cache at the edge

That could have been us. And yes a huge problem is that you're fundamentally asking for control over a shared resource so that you can bias performance to _your content_ at the expense of _all other customers_. Even without intentional 'prefetch' control we had still had some customers with pseudo-degenerate access patterns that might consume 25-50% of the shared cache space in a POP. We did build shared quotas and such but (when I was there) we couldn't see a way to align the pricing & incentives to confidently expose that to customers. It also felt very very bad to tell a customer 'pay us $$$ to care about your bits' when that was our entire job, and what we were doing to the best extent possible already.

> we cap the connection counts to the back end and return 503s when we saturate.

Depending on the CDN you may be able to use `max-age` or `s-maxage` to implement psuedo backoff from the CDN. For us at least those 'negative hits' would be cached with a short (seconds by default) TTL to act a dampener in failure scenarios. Ensure that your client can handle/recover from the 503 as well, I'd expect the CDN to return those all the way through in the response.

> Otherwise adhoc/jit is definitely best effort, though we did a few things to help

I should also give a sense of scale here. Hundreds of GB/s to multi TB/s of throughput for a single customer was pretty normal a decade ago. CDNs, classically, are also biased towards latency & throughput. Once you have millions of client requests per second and pushing that kind of volume its kind of expected/implied that the origin is capable of meeting the demand necessary to maximize that throughput.

While cost efficiency maximizing CDNs _were_ a thing they kind of died out with Red Swoosh AFAIK. We repeatedly investigated 'follow the moon' use cases to maximize the diurnal cycle. Outside of a handful of game companies there wasnt any real interest, and the price/revenue wasnt worth investing compared to other priorities. The market wanted better performance, not minimal costs, in the 2000-10s.

I have always found it remarkable with how well Warframe handles updates, I've seen other games do the "Update live now everyone restart!" and then no one can get in due to thundering herd.

But you close Warframe after the red text and the game updates pretty fast, even if its a massive update like 1999 was, and then you are back in the game (Unless you say yes to Optimising download cache, that takes an absolute age for some reason plsfix), definitely a pretty amazing engineering achievement.

I remember I liked the Fastly API because they seemed to offer preheating, but this was a long time ago, and perhaps not sufficient for your needs.

Really one of those “has anyone that built this tried using it for its intended purpose?” things. Not having a carefully considered cache warning solution* is like…if someone built a CDN based on a description someone gave them, instead of actually understanding the problem a CDN sets out to solve.

* EDIT: actually, any solution that at least attempts to mitigate a thundering herd. I am at least somewhat empathetic to the “indiscriminately allowing pre-warming destroys the shared cache” viewpoint. But there are still plenty of things that can be done!

The easiest solution to the pre-warming problem is charge quite a bit for it. Then only those who really need it will pay (or you’ll collect more money to build out the cache).

This problem is similar to what electric utilities call "load takeup". After a power outage, when power is turned back on, there are many loads that draw more power at startup.

The shortest term effects are power supplies recharging their capacitors and incandescent bulbs warming up. That's over within a second.

Then it's the motors, which have 2x-3x their running load when starting as they bring their rotating mass up to speed. That extra load lasts for tens of seconds.

If power has been off for more than a few minutes, everything in heating and cooling which normally cycles on and off will want to start. That high load lasts for minutes.

Bringing up a power grid is thus done by sections, not all at once.

If you're subject to peak load billing it's also a good idea to bring your loads online in sections, too. My family owns a small grocery store. I was taught the process for "booting-up" the store after a power outage. It basically amounted to a one-by-one startup of the refrigeration compressors, waiting between each for them to come up to operating pressure and stabilize their current demand.

An insightful share. You might be interested to know that startup current is called 'inrush current'. For a Direct On Line (DOL) start, (no soft starters or variable speed drives) electrical engineers usually model it as 6x normal full load current.

Other electrical devices such as transformers and long overhead power lines also exhibit inrush when they are energised.

I live in a somewhat rural area and we got bit hard by this last winter.

Our road used to have a handful of houses on it but now has around 85 (a mix of smaller lots around an acre and larger farming parcels). Power infrastructure to our street hasn't been updated recently and it just barely keeps up.

We had a few days that didn't get above freezing (very unusual here). Power was out for about 6 hours after a limb fell on a line. The power company was actually pretty quick to fix it, but the power went out 3 more times in pretty quick succession.

Apparently a breaker kept blowing as every house regained power and all the various compressors surged on. The solution at the time was for them to jam in a larger breaker. I hope they came back pretty quickly to undo that "fix" but we still haven't had any infrastructure updates to increase capacity.

"The solution at the time was for them to jam in a larger breaker"

I've seen some cowboy sh!t in my time but jeez, that's rough.

That’s “it can’t keep tripping if I jam in a penny instead” level of engineering from the utility! Wow!

Good thing none of your houses burnt down.

It'd have likely been the equipment in the street. That said, in Winter, you can overload this a bit. After all the failure mode would be the wires getting so hot they begin to melt. If you know they're covered in ice, or are currently being rained on in near-freezing air temperatures, you can push more current than they'd be able to at 2pm on a hot summer's day.

The whole incident report is interesting, but I feel like the most important part of the solution is buried here [0]:

* "We're adding timeouts to prevent user requests from waiting excessively long to retrieve assets."

When you get to the size of Canva, you can't forget your AbortController and exponential backoff on your Fetch API calls.

0: https://www.canva.dev/blog/engineering/canva-incident-report...

I happened to prefer the original article: https://www.canva.dev/blog/engineering/canva-incident-report...

Why didn't they do the first obvious thing - roll back the deployment to the previous checkpoint? Those older files were readily available on the edge nodes, so the problem would be solved.

The incident report said, “the growth of off-heap memory” was a cause for the OOM.

Why would have too much traffic caused that to increase specifically? The overhead of a connection in the kernel isn’t that high.

To reduce pressure in the future, they could smear the downloading of new assets over time by background fetching. E.g. when canary release of a new canva release starts they probabilistically could download the asset in the background for the existing version, so when they switch, there’s nothing new to download.

Features like collapse forwarding and stale-while-revalidate are powerful features for CDN’s, but there are these non-intuitive failure modes that you have to be aware of. Anything that synchronizes huge numbers of requests is dangerous to stability.

Guessing that it was either off Java heap memory OR was memory allocated to Java stacks that eventually was paged in as threads began using allocated space: https://docs.oracle.com/en/java/javase/22/core/heap-and-heap... https://neo4j.com/developer/kb/linux-out-of-memory-killer/

I see few blind spots from the write up.

1. Traffic for a new version was loaded up too quickly. I usually lobby for releasing updates slowly. This alone would have prevented the issue.

1. Tasks cannot fail under load. Load Shedding should be in place exactly for this reason. You don't take more than you can chew. If more arrives you slowly and politely refuse the request. You need to be both, slow and polite, so that the client will slowly retry and you won't incur in the herding issue.

1. The monitoring issue should have triggered (most likely) an increase of latency. That should have been enough to not complete the deployment and rollback carefully.

I am sure engineers in canva had their reason, and that the write up does not account for everything. Just some food for thought for other engineers.

This is about penny pinching. If you have created a system that cannot autoscale fast enough, then the triggers for when it does scale up should be much lower.

I also think that enormous amounts of headache can be saved by spinning up beefy instances and including scaling it up before scaling out.

A big nice beefy instance gets over 50% of whatever metric is used spin up a new one. Make it an even beefer version.

Scaling "just in time", persumably to lower costs, is much more of a gamble and a lot more complicated.

Many outages can be summarized simply as "Too many clients attempting to perform an action at the same time." This is a common situation after a sudden crash or reboot... After recovery, sometimes clients try to reconnect to the servers so quickly that it crashes the servers again and the cycle repeats... Particularly problematic with WebSockets and other stateful connections; hence we use mechanisms like exponential backoff with randomization to spread out the load over time.

As the OG post states, CF uses "Concurrent Streaming Acceleration" to batch those "270,000+" requests into one to the origin.

Now, let's grant that the public Internet is not CF's private backbone … but TFA makes it out to be more akin to a mobile connection in a tunnel than the Internet? Like transferring across the planet isn't going to be amazing … but that fails to explain how a download couldn't complete at all over multiple minutes…?

The term of art is normally “request coalescing” or “collapse forwarding”; I believe the later came from the 90s/00s via squid or ocean.

Yes, multiple minutes to complete is very believable. Cloudflare reported 60% packet loss over ~100ms distance. Thats going kill window sizes and goodput. I wouldnt be surprised by this pathological case also exposing problems in their concurrent streaming window access between so many clients as well.

> Yes, multiple minutes to complete is very believable. Cloudflare reported 60% packet loss over ~100ms distance. Thats going kill window sizes and goodput.

You're begging the question: that 60% packet loss is exactly what I'm questioning. That's not normal for public Internet connectivity, so we need something beyond "oops, we routed the request over the public Internet" in order to fully explain the outage.

Sure, given 66% packet loss, "multiple minutes to complete is very believable" and "Thats going kill window sizes and goodput" (sic), I agree with those points. But it's the premise — that packet loss on the external link was also absurd — that needs more explaining?

(… this is where I wish Canva would have linked that quote to its source. AFAICT, Cloudflare never published that, so IDK if that's a private correspondence, or what.)

[deleted]

fuck canva, I remember visiting it from Georgia and being greeted a non-working page and a banner shaming me for the war in Ukraine

I know there's probably some US sanctions list somewhere which the company had to adhere to. But experiencing it in Georgia, where streets are covered with Ukrainian flags and people are very open with their opinion on the war is just surreal

that indeed sounds remarkably puzzling, so much so that i find it a bit hard to believe

They are mentionning the country, not the US state.

Supposedly Georgia asked to be part of UE since the Ukraine invasion so it somehow implies at the very least empathy towards Ukraine and not support for the war.

Having said that and taking into account that IP Geolocation is a fantasy and not something that really work reliably in practice, I would totally understand that some people living in Georgia would be geolocalized in Russia because their ISP is a russian company or is using IPs associated with Russia.

I am regularly geolocalized by some websites more that 3000km away from my home. My ISP headquarters and datacenters are in a different country and I guess some of the IP range they use are geolocalized there.

> They are mentionning the country, not the US state.

Yes, I know :) I don't think IP geolocation is so poor that it'd put Georgian residents into Russia. Could be wrong though, of course.

Then why is it so poor that it sometimes put me in Romania while I am in Spain and closer to Africa than most other european countries but Portugal?

> Then why is it so poor that it sometimes

it being a company that estimates the location based on publicly available information like "This ASN belongs to this corporate entity which is registered in this country/related to this association" and so on.

There is no official hashmap with "IP => Geographical Location", they're all guesses and estimates.

a large chunk of Georgian territory is occupied by Russia, Abkhazia is one (which essentially functions as basically a breakaway state but is de facto russian controlled), and South Ossetia (which essentially functions as a de facto Russian oblast). That's probably the issue.

I think OP would have mentioned that if their goal was to have an honest discussion.

that was in Tbilisi

Maybe your IP was mistakenly seen as being in Russia ? Obviously should never have happened

"obviously?" I've seen Georgia in US embargo list, although it's hard to comprehend what's actually embargoed https://www.bis.gov/ear/title-15/subtitle-b/chapter-vii/subc...

The distinction between resilience and robustness strikes me as a useful one. Really great article overall.

Perhaps a canary deployment per region might help in such situations? Prime the CDN assets with a smaller set of users.

So what is the suggestion at the end of the post? Did I understand correctly that a sandboxed-replica simulator with the fundamental training would harden the system design? Cool! Can you run the simulator based on the basic but complete input architectural drawing? I’d be curious to know if LLMs are able to go and abstract it all across the public network and come back with an attention for all possible known scenarios. Frankly, you can even serve the scenarios into financial forecast models to serve and move the right levers for appropriate actions.

These blind spots are exploits waiting to be discovered.