I Stopped Using Kubernetes. Our DevOps Team Is Happier Than Ever

I Stopped Wearing Shoes. My Wife Is Happier Than Ever

I used to own 10,000 pairs of shoes, so I had to buy an extra house just to have room to store all of them. My fridge was completely filled with shoes, and my wife dreaded even walking around in our home.

After going barefoot, I was able to sell the other house and now have room for groceries in my fridge, and my kids can now eat.

This is why shoes are bad.

> We were managing 47 Kubernetes clusters across three cloud providers.

Not a Kubernetes guy, so perhaps ignorant question. Why would you run 47 clusters?

I thought the point of a Kubernetes clusters is you just throw your workload at it and be happy?

I get you want a few for testing and development etc, and perhaps failover to other provider or similar. But 47?

> Why would you run 47 clusters?

Entirely possible for an enterprise-y or B2B use-case - some clients might want rigid data / network isolation in a separate account / VPC, plus it reduces the blast radius instead of running everything in one big cluster. There are ways of achieving this in a single cluster with a lot of added complexity, and spinning up a new VPC + K8s might be easier if you have the Terraform modules ready to go.

No. It really is insane. The only business I've encountered running that many clusters is an i18n energy company and they only _have_ to run that many due to regulatory compliance with certain kinds of data hosting.

[dead]

Indeed…I wonder what the path not taken, a more optimal k8 rearchitecting would look like

It's not an ignorant question! Running 47 different clusters is insane.

Speaking in absolute numbers without any reference point makes no sense.

The article states they moved all of their stateless stuff to ECS, stateful stuff to Docker containers on EC2, batch jobs to AWS Batch, and event-driven stuff to AWS Lambda. Previously they ran 47 different K8s clusters on 3 different clouds.

Given that all of their needs could be satisfied with 4 AWS services, and for nearly half a million dollars a year cheaper than their previous setup, I think running 47 different clusters in 3 clouds was insane.

Yes, in this context I don't see any reason for that number of separate clusters.

While I'm pretty sure the article is clickbait (can't tell, paywalled too soon), having many clusters ain't dumb nowadays.

The automation of Kubernetes maintenance is great nowadays. For any case, bare metal, onprem, public cloud managed. Leveraging that makes it easier to manage multiple clusters and give each team/project it's own cluster, than implement proper mechanisms on a single cluster. Like proper rbac between projects, network boundaries etc..

Nowadays you can easily move that complexity one layer up and treat whole clusters as some volatile component that is defined in code and is also not a snowflake.

That way you can let each team/project have core things different and managed by themselves, like implementing their contrasting opinions on network meshes, or CRDs that would otherwise be in conflict etc.

The overhead is not that huge, or at least doesn't have to be. My test clusters with multiple environments of my apps consume around 4GB of mem in total (aside of my apps themselves), that includes any k8s stuff, logging aggregation, metrics aggregation and so on. You don't even have to manage your own control plane - cloud can give you a shared one (like Azure has in two lower tiers), or use services that provide you just the control plane, while nodes are on whatever hardware (like scaleway kosmos).

So yeah, it's not for everyone, but it surely can grow to those numbers of clusters, especially if you multiply by dev/staging/qa/prod for each team and add some Infra to actual tests of infra/IaC.

Although, why they had such an overhead is a mystery to me, would be cool to see that part described.

https://archive.is/GoiDF

Simplicity always wins over complexity. I don't think the problem here is Kubernetes, but more like the way they used it. Any system can be made utterly complex, if you don't take the time to make it simple.

> Any system can be made utterly complex

Kubernetes makes it easier to end up with something utterly complex.

Sounds like a made up before with a made up after scenario sprinkled with some random numbers. Which company is this? Why 3 cloud providers? Why 47 clusters? What caused the outages? What slowed deployments down? How were the 200+ manifest files managed?

Can we normalize sending the archive link by default? (in the submission)

Strange article. Saying "99.99% uptime maintained" and that they had 4 major outages in a week is kind of strange, since 99.99% uptime only allow for 4 minutes of downtime a month...

It would be weird, if these two points were talking about the same time frame.

If I read the article correctly, that number relates to statistics collected only after they ditched k8s.

If they have 3 cloud providers and say 5 regions that's potentially 15 clusters. My assumption is that they're running in even more regions..

Edit: https://archive.is/GoiDF, thanks @nosefrog!

Um. Interesting. I don't think anyone should be operating 47 different Kubernetes clusters for an application. You should probably max out at three: production, staging, and dev — if you even need a dev cluster (ideally you can just run your dev server locally); you can probably also get away with colocating staging and production in the same cluster, but in different namespaces or using different sets of services/labels, and ultimately just run one Kubernetes cluster.

They mentioned they run on three different cloud providers at the same time (...why...?), but even then, I'm not clear how that results in forty seven different K8s clusters. 47 isn't even divisible by three!

Sadly the rest of the article post-paywall doesn't explain anything about how they ended up in that mess. Apparently they have "8 senior DevOps engineers," and you... really shouldn't be operating 7x more clusters than you have senior DevOps engineers in my opinion.

This is a classic example of attacking the symptom instead of the disease.

This is a useless article without explaining what their use case is exactly. Running 47 kubernetes clusters sounds weird as hell.

Paywall

DevOps team sound so ridiculous to me. I stopped reading this immediately at the beginning. They should probably try to change their complexity first.

It seems like that's what they did?

I think they mean organizationally, not architecturally.