the article offers a simplified world model: Poisson arrivals and infinite queue, which is fine as a math model.
In the real world however, the bursts can be correlated, due to factors like timeouts/retries, thundering herd, correlated bursts.
so the real economics of load-balanced system is a simple reliability story: being able to reasonably serve the peak traffic, which leads to over-provisioning of those systems.
using cloud allows some form of scale up/down of resources, but doesn't completely solve the problem. I think the migration away from synchronyous systems towards async systems and letting clients gradually absorb the delays is a better approach (rather than forcing infrastructure to be dynamically scaled up/down and be billed per request-second by your cloud provider)
Another technique to add to the mix if you can handle the additional complexity is to load or feature shed. If you can delay or just drop additional expensive application features during the exact time you need to scale or handle a burst, then your system has additional core app logic to handle requests. This can prevent the system getting wedged in a positive feedback loop.
See also the gamedev technique of having sacrificial assets or code, so when you need to free up space late in the schedule to ship, you have something you can actually shed.
>In the real world however, the bursts can be correlated
Very true, as application-layer load-balancing often explicitly pre-bakes the traffic schedule to several hundred distributed IPs for data locality. Essentially bypassing the functional need for DNS and local round-robin traffic balancers.
One trades concurrent bandwidth for slightly higher latency, and dynamically adapted capacity as traffic load changes. =3
If your clients are all this well behaved, then you’re definitely not exposed to the public internet.
The global edge networks that I’m aware of all use L4 LBs and L7 LBs. Cloudflare picks anycast over DNS LB, but DNS LB is still widely used.
I don’t see these things changing.
A dead comment says:
> Of course, this assumes independent events. World Cup, super bowls, etc break these assumptions.
Yes, this is very true. The model here works for Poisson arrivals and exponential service time (the M/M), which are poor approximations of real-world traffic patterns (which tend to be non-stationary and non-ergodic, and include substantial seasonality). However, the frequency of that seasonality is typically rather low (e.g. daily cycles), and so these stronger assumptions are quite defensible for short time periods.
A better approach is to do simulation with real traffic patterns, or even with more sophisticated parametric models, and get better answers (e.g. https://stability-sim.systems/). The good news is that kind of simulation is cheaper to do than ever before.
It's not surprising if one has the mental model of the probability that the request gets enqueued. Then when you add variable time to process requests it becomes more clear why some requests can take unexpectedly long (there is a >0 probability that a request gets queued behind several of the slowest endpoints, for example). So even if 90% of the endpoints are fast and most of the requests aren't even queued, there will still be some that end up being quite slow.
What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service. Yes, having a queue becomes less important the more backend servers you have, but here even with 10 servers the plot shows your latency remains >25% worse than it would be with a queue.
Also missing is discussion of how the variance in processing times affects you when you rely on load balancing alone.
> What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service.
As in between the service and the load balancer? There's already an infinite queue in the load balancer. You can try that out on https://stability-sim.systems/ to see the effect, but the short version is that (in this model) it makes things worse.
If you're saying that the queue in the load balancer should be limited in size to reduce tail latency, then I agree.
No, I mean when you have a queue broker that the backends can pull work from when they become idle, rather than relying on load balancing which will send work to backends while they're still busy.
This scenario already works that way. The very first sentence says "servers, each of which can only handle a single concurrent request, and has no internal queuing". This implies that the load balancer waits for a server to finish a request then immediately sends the next one.
I don't believe it does. As I understand it, the load balancer has a queue in which it can buffer infinite requests, but it drains that queue by pushing work to the backend servers in what's probably a round-robin fashion. So there is secondary queueing at each server. Even the "least connections" strategies available through some load balancers do not usually behave as you might expect (by always sending the next request to a server that's idle). Pull-based load balancing via a queue has its own downsides but the big upside is to make latency essentially a constant low overhead regardless of the number of servers in the typical case.
Of course, this assumes independent events. World Cup, super bowls, etc break these assumptions.
Still, queuing theory is so cool.
Why would anyone think that it would get linearly worse? What's the (wrong) assumption there?
I thought the same thing. But, should we be surprised about what people believe in these days?
I think that the issue is in part due to the variables. Plotting the mean request time is less intuitive than plotting throughput.
If you plot throughput vs number of servers, it'll be a straight line. And asking people that, I think most would agree on a straight line. But who knows!
One explanation would be that more load could mean higher (absolute) variance in queue length, and therefore higher latency especially at higher percentiles. It doesn't work out that way (for reasons that Erlang actually writes about in one of his original works), but it's not an entirely unreasonable intuition.
I think author made it up just to have something more to show up on graph.
It was a poll on Twitter, do you really expect good responses?
Seemingly inconsequential article on hacker news and assume it probably is the kind of article that describes a profound idea with a naive title. And turns out it's actually very confusing as it puts overweight dramaticity over mundane intuition. Those type of writing belongs to literature sphere, not technology writing.
the article offers a simplified world model: Poisson arrivals and infinite queue, which is fine as a math model.
In the real world however, the bursts can be correlated, due to factors like timeouts/retries, thundering herd, correlated bursts.
so the real economics of load-balanced system is a simple reliability story: being able to reasonably serve the peak traffic, which leads to over-provisioning of those systems.
using cloud allows some form of scale up/down of resources, but doesn't completely solve the problem. I think the migration away from synchronyous systems towards async systems and letting clients gradually absorb the delays is a better approach (rather than forcing infrastructure to be dynamically scaled up/down and be billed per request-second by your cloud provider)
Another technique to add to the mix if you can handle the additional complexity is to load or feature shed. If you can delay or just drop additional expensive application features during the exact time you need to scale or handle a burst, then your system has additional core app logic to handle requests. This can prevent the system getting wedged in a positive feedback loop.
See also the gamedev technique of having sacrificial assets or code, so when you need to free up space late in the schedule to ship, you have something you can actually shed.
>In the real world however, the bursts can be correlated
Very true, as application-layer load-balancing often explicitly pre-bakes the traffic schedule to several hundred distributed IPs for data locality. Essentially bypassing the functional need for DNS and local round-robin traffic balancers.
One trades concurrent bandwidth for slightly higher latency, and dynamically adapted capacity as traffic load changes. =3
If your clients are all this well behaved, then you’re definitely not exposed to the public internet.
The global edge networks that I’m aware of all use L4 LBs and L7 LBs. Cloudflare picks anycast over DNS LB, but DNS LB is still widely used.
I don’t see these things changing.
A dead comment says:
> Of course, this assumes independent events. World Cup, super bowls, etc break these assumptions.
Yes, this is very true. The model here works for Poisson arrivals and exponential service time (the M/M), which are poor approximations of real-world traffic patterns (which tend to be non-stationary and non-ergodic, and include substantial seasonality). However, the frequency of that seasonality is typically rather low (e.g. daily cycles), and so these stronger assumptions are quite defensible for short time periods.
A better approach is to do simulation with real traffic patterns, or even with more sophisticated parametric models, and get better answers (e.g. https://stability-sim.systems/). The good news is that kind of simulation is cheaper to do than ever before.
It's not surprising if one has the mental model of the probability that the request gets enqueued. Then when you add variable time to process requests it becomes more clear why some requests can take unexpectedly long (there is a >0 probability that a request gets queued behind several of the slowest endpoints, for example). So even if 90% of the endpoints are fast and most of the requests aren't even queued, there will still be some that end up being quite slow.
What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service. Yes, having a queue becomes less important the more backend servers you have, but here even with 10 servers the plot shows your latency remains >25% worse than it would be with a queue. Also missing is discussion of how the variance in processing times affects you when you rely on load balancing alone.
> What's conspicuously missing is the plot of performance when you do have a well tuned queue in front of the service.
As in between the service and the load balancer? There's already an infinite queue in the load balancer. You can try that out on https://stability-sim.systems/ to see the effect, but the short version is that (in this model) it makes things worse.
If you're saying that the queue in the load balancer should be limited in size to reduce tail latency, then I agree.
No, I mean when you have a queue broker that the backends can pull work from when they become idle, rather than relying on load balancing which will send work to backends while they're still busy.
This scenario already works that way. The very first sentence says "servers, each of which can only handle a single concurrent request, and has no internal queuing". This implies that the load balancer waits for a server to finish a request then immediately sends the next one.
I don't believe it does. As I understand it, the load balancer has a queue in which it can buffer infinite requests, but it drains that queue by pushing work to the backend servers in what's probably a round-robin fashion. So there is secondary queueing at each server. Even the "least connections" strategies available through some load balancers do not usually behave as you might expect (by always sending the next request to a server that's idle). Pull-based load balancing via a queue has its own downsides but the big upside is to make latency essentially a constant low overhead regardless of the number of servers in the typical case.
Of course, this assumes independent events. World Cup, super bowls, etc break these assumptions.
Still, queuing theory is so cool.
Why would anyone think that it would get linearly worse? What's the (wrong) assumption there?
I thought the same thing. But, should we be surprised about what people believe in these days?
I think that the issue is in part due to the variables. Plotting the mean request time is less intuitive than plotting throughput.
If you plot throughput vs number of servers, it'll be a straight line. And asking people that, I think most would agree on a straight line. But who knows!
One explanation would be that more load could mean higher (absolute) variance in queue length, and therefore higher latency especially at higher percentiles. It doesn't work out that way (for reasons that Erlang actually writes about in one of his original works), but it's not an entirely unreasonable intuition.
I think author made it up just to have something more to show up on graph.
It was a poll on Twitter, do you really expect good responses?
Seemingly inconsequential article on hacker news and assume it probably is the kind of article that describes a profound idea with a naive title. And turns out it's actually very confusing as it puts overweight dramaticity over mundane intuition. Those type of writing belongs to literature sphere, not technology writing.