For what it's worth, I've worked at multiple places that ran shell scripts just fine for their deploys.
- One had only 2 services [php] and ran over 1 billion requests a day. Deploy was trivial, ssh some new files to the server and run a migration, 0 downtime.
- One was in an industry that didn't need "Webscale" (retirement accounts). Prod deploys were just docker commands run by jenkins. We ran two servers per service from the day I joined the day I left 4 years later (3x growth), and ultimately removed one service and one database during all that growth.
Another outstanding thing about both of these places was that we had all the testing environments you need, on-demand, in minutes.
The place I'm at now is trying to do kubernetes and is failing miserably (ongoing nightmare 4 months in and probably at least 8 to go, when it was allegedly supposed to only take 3 total). It has one shared test environment that it takes 3-hours to see your changes in.
I don't fault kubernetes directly, I fault the overall complexity. But at the end of the day kubernetes feels like complexity trying to abstract over complexity, and often I find that's less successful that removing complexity in the first place.
If your application doesn't need and likely won't need to scale to large clusters, or multiple clusters, then there's nothing wrong per se. with your solution. I don't think k8s is that hard but there are a lot of moving pieces and there's a bit to learn. Finding someone with experience to help you can make a ton of difference.
Questions worth asking:
- Do you need a load balancer?
- TLS certs and rotation?
- Horizontal scalability.
- HA/DR
- dev/stage/production + being able to test/stage your complete stack on demand.
- CI/CD integrations, tools like ArgoCD or Spinnaker
- Monitoring and/or alerting with Prometheus and Grafana
- Would you benefit from being able to deploy a lot of off the shelf software (lessay Elastic Search, or some random database, or a monitoring stack) via helm quickly/easily.
- "Ingress"/proxy.
- DNS integrations.
If you answer yes to many of those questions there's really no better alternative than k8s. If you're building large enough scale web applications the almost to most of these will end up being yes at some point.
Every item on that list is "boring" tech. Approximately everyone have used load balancers, test environments and monitoring since the 90s just fine. What is it that you think make Kubernetes especially suited for this compared to every other solution during the past three decades?
There are good reasons to use Kubernetes, mainly if you are using public clouds and want to avoid lock-in. I may be partial, since managing it pays my bills. But it is complex, mostly unnecessarily so, and no one should be able to say with a straight face that it achieves better uptime or requires less personnel than any alternative. That's just sales talk, and should be a big warning sign.
It's the way things work together. If you want to add a new service you just annotate that service and DNS gets updated, your ingress gets the route added, cert-manager gets you the certs from let's encrypt. You want Prometheus to monitor your pod you just add the right annotation. When your server goes down k8s will move your pod around. k8s storage will take care of having the storage follow your pod. Your entire configuration is highly available and replicated in etcd.
It's just very different than your legacy "standard" technology.
None of this is difficult to do or automate, and we've done it for years. Kubernetes simply makes it more complex by adding additional abstractions in the pursuit of pretending hardware doesn't exist.
There are, maybe, a dozen companies in the world with a large enough physical footprint where Kubernetes might make sense. Everyone else is either engaged in resume-driven development, or has gone down some profoundly wrong path with their application architecture to where it is somehow the lesser evil.
I used to feel the same way, but have come around. I think it's great for small companies for a few reasons. I can spin up effectively identical dev/ci/stg/prod clusters for a new project in an hour for a medium sized project, with CD in addition to everything GP mentioned.
I basically don't have to think about ops anymore until something exotic comes up, it's nice. I agree that it feels clunky, and it was annoying to learn, but once you have something working it's a huge time saver. The ability to scale without drastically changing the system is a bonus.
> I can spin up effectively identical dev/ci/stg/prod clusters for a new project in an hour for a medium sized project, with CD in addition to everything GP mentioned.
I can do the same thing with `make local` invoking a few bash commands. If the complexity increases beyond that, a mistake has been made.
You could say the same thing about Ansible or Vagrant or Nomad or Salt or anything else.
I can say with complete confidence however, that if you are running Kubernetes and not thinking about ops, you are simply not operating it yourself. You are paying someone else to think about it for you. Which is fine, but says nothing about the technology.
> Every item on that list is "boring" tech. Approximately everyone have used load balancers, test environments and monitoring since the 90s just fine. What is it that you think make Kubernetes especially suited for this compared to every other solution during the past three decades?
You could make the same argument against using cloud at all, or against using CI. The point of Kubernetes isn't to make those things possible, it's to make them easy and consistent.
Kubernetes is boring tech as well.
And the advantage of it is one way to manage resources, scaling, logging, observability, hardware etc.
All of which is stored in Git and so audited, reviewed, versioned, tested etc in exactly the same way.
Kubernetes is great example of the "second-system effect".
Kubernetes only works if you have a webapp written in a slow interpreted language. For anything else it is a huge impedance mismatch with what you're actually trying to do.
P.S. In the real world, Kubernetes isn't used to solve technical problems. It's used as a buffer between the dev team and the ops team, who usually have different schedules/budgets, and might even be different corporate entities. I'm sure there might be an easier way to solve that problem without dragging in Google's ridiculous and broken tech stack.
Contrary to popular belief, k8s is not Google's tech stack.
My understanding is that it was initially sold as Google's tech to benefit from Google's tech reputation (exploiting the confusion caused by the fact that some of the original k8s devs where ex-googlers), and today it's also Google trying to pose as k8s inventor, to benefit from its popularity. Interesting case of host/parasite symbiosis, it seams.
Just my impression though, I can be wrong, please comment if you know more about the history of k8s.
Is there anyone that works at Google that can confirm this?
What's left of Borg at Google? Did the company switch to the open source Kubernetes distribution at any point? I'd love to know more about this as well.
> exploiting the confusion caused by the fact that some of the original k8s devs where ex-googlers
What about the fact that many active Kubernetes developers, are also active Googlers?
Borg isn't going anywhere, Kubernetes isn't Google-scale
> It's used as a buffer between the dev team and the ops team, who usually have different schedules/budgets
That depends on your definition. If the ops team is solely responsibly for running the Kubernetes cluster, then yes. In reality that's rarely how things turns out. Developers want Kubernetes, because.... I don't know. Ops doesn't even want Kubernetes in many cases. Kubernetes is amazing, for those few organisations that really need it.
My rule of thumb is: If your worker nodes aren't entire physical hosts, then you might not need Kubernetes. I've seen some absolutely crazy setups where developers had designed this entire solution around Kubernetes, only to run one or two containers. The reasoning is pretty much always the same, they know absolutely nothing about operations, and fail to understand that load balancers exists outside of Kubernetes, or that their solution could be an nginx configuration, 100 lines of Python and some systemd configuration.
I accept that I lost the fight that Kubernetes is overly complex and a nightmare to debug. In my current position I can even see some advantages to Kubernetes, so I was at least a little of in my criticism. Still I don't think Kubernetes should be your default deployment platform, unless you have very specific needs.
kubernetes is an API for your cluster, that is portable between providers, more or less.
there are other abstractions, but they are not portable, e.g. fly.io, DO etc.
so unless you want a vendor lock-in, you need it.
for one of my products, I had to migrate due to business reasons 4 times into different kube flavors, from self-manged ( 2 times ) to GKE and EKS.
> there are other abstractions, but they are not portable
Not true. Unix itself is an API for your cluster too, like the original post implies.
Personally, as a "tech lead" I use NixOS. (Yes, I am that guy.)
The point is, k8s is a shitty API because it's built only for Google's "run a huge webapp built on shitty Python scripts" use case.
Most people don't need this, what they actually want is some way for dev to pass the buck to ops in some way that PM's can track on a Gantt chart.
> If you answer yes to many of those questions there's really no better alternative than k8s.
This is not even close to true with even a small number of resources. The notion that k8s somehow is the only choice is right along the lines of “Java Enterprise Edition is the only choice” — ie a real failure of the imagination.
For startups and teams with limited resources, DO, fly.io and render are doing lots of interesting work. But what if you can’t use them? Is k8s your only choice?
Let’s say you’re a large orgs with good engineering leadership, and you have high-revenue systems where downtime isn’t okay. Also for compliance reasons public cloud isn’t okay.
DNS in a tightly controlled large enterprise internal network can be handled with relatively simple microservices. Your org will likely have something already though.
Dev/Stage/Production: if you can spin up instances on demand this is trivial. Also financial services and other regulated biz have been doing this for eons before k8s.
Load Balancers: lots of non-k8s options exist (software and hardware appliances).
Prometheus / Grafana (and things like Netdata) work very well even without k8s.
Load Balancing and Ingress is definitely the most interesting piece of the puzzle. Some choose nginx or Envoy, but there’s also teams that use their own ingress solution (sometimes open-sourced!)
But why would a team do this? Or more appropriately, why would their management spend on this? Answer: many don’t! But for those that do — the driver is usually cost*, availability and accountability, along with engineering capability as a secondary driver.
(*cost because it’s easy to set up a mixed ability team with experienced, mid-career and new engineers for this. You don’t need a team full of kernel hackers.)
It costs less than you think, it creates real accountability throughout the stack and most importantly you’ve now got a team of engineers who can rise to any reasonable challenge, and who can be cross pollinated throughout the org. In brief the goal is to have engineers not “k8s implementers” or “OpenShift implementers” or “Cloud Foundry implementers”.
> DNS in a tightly controlled large enterprise internal network can be handled with relatively simple microservices. Your org will likely have something already though.
And it will likely be buggy with all sorts of edge cases.
> Dev/Stage/Production: if you can spin up instances on demand this is trivial. Also financial services and other regulated biz have been doing this for eons before k8s.
In my experience financial services have been notably not doing it.
> Load Balancers: lots of non-k8s options exist (software and hardware appliances).
The problem isn't running a load balancer with a given configuration at a given point in time. It's how you manage the required changes to load balancers and configuration as time goes on. It's very common for that to be a pile of perl scripts that add up to an ad-hoc informally specified bug-ridden implementation of half of kubernetes.
> And it will likely be buggy with all sorts of edge cases.
I have seen this view in corporate IT teams who’re happy to be “implementers” rather than engineers.
In real life, many orgs will in fact have third party vendor products for internal DNS and cert authorities. Writing bridge APIs to these isn’t difficult and it keeps the IT guys happy.
A relatively few orgs have written their own APIs, typically to manage a delegated zone. Again, you can say these must be buggy, but here’s the thing — everything’s buggy. Including k8s. As long as bugs are understood and fixed, no one cares. The proof of the pudding is how well it works.
Internal DNS in particular is easy enough to control and test if you have engineers (vs implementers) in your team.
> manage changes to load balancers … perl
That’s a very black and white view, that teams are either on k8s (which to you is the bees knees) or a pile of Perl (presumably unmaintainable). Speaks to interesting unconscious bias.
Perhaps it comes from personal experience, in which case I’m sorry you had to be part of such a team. But it’s not particularly difficult to follow modern best practices and operate your own stack.
But if your starter stance is that “k8s is the only way”, no one can talk you out of your own mental hard lines.
> Again, you can say these must be buggy, but here’s the thing — everything’s buggy. Including k8s. As long as bugs are understood and fixed, no one cares.
Agreed, but internal products are generally buggier, because an internal product is in a kind of monopoly position. You generally want to be using a product that is subject to competition, that is a profit center rather than a cost center for the people who are making it.
> Internal DNS in particular is easy enough to control and test if you have engineers (vs implementers) in your team.
Your team probably aren't DNS experts, and why should they be? You're not a DNS company. If you could make a better DNS - or a better DNS-deployment integration - than the pros, you'd be selling it. (The exception is if you really are a DNS company, either because you actually do sell it, or because you have some deep integration with DNS that enables your competitive advantage)
> Perhaps it comes from personal experience, in which case I’m sorry you had to be part of such a team. But it’s not particularly difficult to follow modern best practices and operate your own stack.
I'd say that's a contradiction in terms, because modern best practice is to not run your own stack.
I don't particularly like kubernetes qua kubernetes (indeed I'd generally pick nomad instead). But I absolutely do think you need a declarative, single-source-of-truth way of managing your full deployment, end-to-end. And if your deployment is made up of a standard load balancer (or an equivalent of one), a standard DNS, and prometheus or grafana, then you've either got one of these products or you've got an internal product that does the same thing, which is something I'm extremely skeptical of for the same reason as above - if your company was capable of creating a better solution to this standard problem, why wouldn't you be selling it? (And if an engineer was capable of creating a better solution to this standard problem, why would they work for you rather than one of the big cloud corps?)
In the same way I'm very skeptical of any company with an "internal cloud" - in my experience such a thing is usually a significantly worse implementation of AWS, and, yes, is usually held together with some flaky Perl scripts. Or an internal load balancer. It's generally NIH, or at best a cost-cutting exercise which tends to show; a company might have an internal cloud that's cheaper than AWS (I've worked for one), but you'll notice the cheapness.
Now again, if you really are gaining a competitive advantage from your things then it may make sense to not use a standard solution. But in that case you'll have something deeply integrated, i.e. monolithic, and that's precisely the case where you're not deploying separate standard DNS, separate standard load balancers, separate standard monitoring etc.. And in that case, as grandparent said, not using k8s makes total sense.
But if you're just deploying a standard Rails (or what have you) app with a standard database, load balancer, DNS, monitoring setup? Then 95% of the time your company can't solve that problem better than the companies that are dedicated to solving that problem. Either you don't have a solution at all (beyond doing it manually), you use k8s or similar, or you NIH it. Writing custom code to solve custom problems can be smart, but writing custom code to solve standard problems usually isn't.
> if your company was capable of creating a better solution to this standard problem, why wouldn't you be selling it?
Let's pretend I'm the greatest DevOps software developer engineer ever, and I write a Kubernetes replacement that's 100x better. Since it's 100x better, I simply charge 100x as much as it costs per CPU/RAM for a Kubernetes license to a 1,000 customers, and take all of that money to the bank and I deposit my check for $0.
I don't disagree with the rest of the comment, but the market for the software to host a web app is a weird market.
> If you answer yes to many of those questions there's really no better alternative than k8s.
Nah, most of that list is basically free for any company that uses an amazon loadbalancer and an autoscale group. In terms of likeliness of incidents, time, and cost, those will each be an order of magnitude higher with a team of kubernetes engineers than less complex setup.
People really underestimate the power of a shell scripts and ssh and trusted developers.
Besides the fact that shell scripts aren't scalable (in terms of horizontal scalability like actor model), I would also like to point out that shell scripts should be simple, but if you want to handle something that big, you essentially and definitely is using it as a programming language in disguise -- not ideal and I would like to go Go or Rust instead.
We don't live in 1999 any more. A big machine with a database can serve ervyone in the US and I can fit it in my closet.
It's like people are stuck in the early 2000s when they start thinking about computer capabilities. Today I have more flops in a single GPU under my desk than did the worlds largest super computer in 2004.
> It's like people are stuck in the early 2000s when they start thinking about computer capabilities.
This makes sense, because the code people write makes machines feel like they're from the early 2000's.
This is partially a joke, of course, but I think there is a massive chasm between the people who think you immediately need several computers to do things for anything other than redundancy, and the people who see how ridiculously much you can do with one.
I added performance testing to all our endpoints from the start, so that people don’t start to normalize those 10s response times that our last system had (cry)
> Besides the fact that shell scripts aren't scalable…
What are you trying to say there? My understanding is that, way under the hood, a set of shell scripts is in fact enabling the scalable nature of… the internet.
> My understanding is that, way under the hood, a set of shell scripts is in fact enabling the scalable nature of… the internet.
I sure hope not. The state of error handling in shell scripts alone is enough to disqualify them for serious production systems.
If you're extremely smart and disciplined it's theoretically possible to write a shell script that handles error states correctly. But there are better things to spend your discipline budget on.
[dead]
...that's only for early internet, and the early internet is effing broken at best
On the other hand, my team slapped 3 servers down in a datacenter, had each of them configured in a Proxmox cluster within a few hours. Some 8-10 hours later we had a fully configured kubernetes cluster running within Proxmox VMs, where the VMs and k8s cluster are created and configured using an automation workflow that we have running in GitHub Actions. An hour or two worth of work later we had several deployments running on it and serving requests.
Kubernetes is not simple. In fact it's even more complex than just running an executable with your linux distro's init system. The difference in my mind is that it's more complex for the system maintainer, but less complex for the person deploying workloads to it.
And that's before exploring all the benefits of kubernetes-ecosystem tooling like the Prometheus operator for k8s, or the horizontally scalable Loki deployments, for centrally collecting infrastructure and application metrics, and logs. In my mind, making the most of these kinds of tools, things start to look a bit easier even for the systems maintainers.
Not trying to discount your workplace too much. But I'd wager there's a few people that are maybe not owning up to the fact that it's their first time messing around with kubernetes.
As long as your organisation can cleanly either a) split the responsibility for the platform from the responsibility for the apps that run on it, and fund it properly, or b) do the exact opposite and accommodate all the responsibility for the platform into the app team, I can see it working.
The problems start when you're somewhere between those two points. If you've got a "throw it over the wall to ops" type organisation, it's going to go bad. If you've got an underfunded platform team so the app team has to pick up some of the slack, it's going to go bad. If the app team have to ask permission from the platform team before doing anything interesting, it's going to go bad.
The problem is that a lot of organisations will look at k8s and think it means something it doesn't. If you weren't willing to fund a platform team before k8s, I'd be sceptical that moving to it is going to end well.
Are you self hosting kubernetes or running it managed?
I've only used it managed. There is a bit of a learning curve but it's not so bad. I can't see how it can take 4 months to figure it out.
We are using EKS
> I can't see how it can take 4 months to figure it out.
Well have you ever tried moving a company with a dozen services onto kubernetes piece-by-piece, with zero downtime? How long would it take you to correctly move and test every permission, environment variable, and issue you run into?
Then if you get a single setting wrong (e.g. memory size) and don't load-test with realistic traffic, you bring down production, potentially lose customers, and have to do a public post-mortem about your mistakes? [true story for current employer]
I don't see how anybody says they'd move a large company to kubernetes in such an environment in a few months with no screwups and solid testing.
It sounds like it's not easy to figure out the permissions, envvars, memory size, etc. of your existing system, and that's why the migration is so difficult? That's not really one of Kubernetes' (many) failings.
Yes, and now we are back at the ancestor comment’s original point: “at the end of the day kubernetes feels like complexity trying to abstract over complexity, and often I find that's less successful that removing complexity in the first place”
Which I understand to mean “some people think using Kubernetes will make managing a system easier, but it often will not do that”
Can you elaborate on other things you think Kubernetes gets wrong? Asking out of curiosity because I haven't delved deep into it.
Took us three-four years to go from self hosted multi-dc to getting the main product almost fully in k8s (some parts didn't make sense in k8s and was pushed to our geo-distributed edge nodes). Dozens of services and teams and keeping the old stuff working while changing the tire on the car while driving. All while the company continues to grow and scale doubles every year or so. It takes maturity in testing and monitoring and it takes longer that everyone estimates
It largely depends how customized each microservice is, and how many people are working on this project.
I've seen migrations of thousands of microservices happening with the span of two years. Longer timeline, yes, but the number of microservices is orders of magnitude larger.
Though I suppose the organization works differently at this level. The Kubernetes team build a tool to migrate the microservices, and each owner was asked to perform the migration themselves. Small microservices could be migrated in less than three days, while the large and risk-critical ones took a couple weeks. This all happened in less than two years, but it took more than that in terms of engineer/weeks.
The project was very successful though. The company spends way less money now because of the autoscaling features, and the ability to run multiple microservices in the same node.
Regardless, if the company is running 12 microservices and this number is expected to grow, this is probably a good time to migrate. How did they account for the different shape of services (stateful, stateless, leader elected, cron, etc), networking settings, styles of deployment (blue-green, rolling updates, etc), secret management, load testing, bug bashing, gradual rollouts, dockerizing the containers, etc? If it's taking 4x longer than originally anticipated, it seems like there was a massive failure in project design.
2000 products sounds like you made 2000 engineers learn kubernetes (a week, optimistically, 2000/52 = 38 engineer years, or roughly one wasted career).
Similarly, the actual migration times you estimate add up to decades of engineer time.
It’s possible kubernetes saves more time than using the alternative costs, but that definitely wasn’t the case at my previous two jobs. The jury is out at the current job.
I see the opportunity cost of this stuff every day at work, and am patiently waiting for a replacement.
> 2000 products sounds like you made 2000 engineers learn kubernetes (a week, optimistically, 2000/52 = 38 engineer years, or roughly one wasted career).
Not really, they only had to use the tool to run the migration and then validate that it worked properly. As the other commenter said, a very basic setup for kubernetes is not that hard; the difficult set up is left to the devops team, while the service owners just need to see the basics.
But sure, we can estimate it at 38 engineering years. That's still 38 years for 2,000 microservices; it's way better than 1 year for 12 microservices like in OP's case. Savings that we got was enough to offset these 38 years of work, so this project is now paying dividends.
> 2000 products sounds like you made 2000 engineers learn kubernetes (a week, optimistically, 2000/52 = 38 engineer years, or roughly one wasted career).
Learning k8s enough to be able to work with it isn't that hard. Have a centralized team write up a decent template for a CI/CD pipeline, Dockerfile for the most common stacks you use and a Helm chart with an example for a Deployment, PersistentVolumeClaim, Service and Ingress, distribute that, and be available for support should the need for Kubernetes be beyond "we need 1-N pods for this service, they got some environment variables from which they are configured, and maybe a Secret/ConfigMap if the application rather wants configuration to be done in files" is enough in my experience.
> Learning k8s enough to be able to work with it isn't that hard.
I’ve seen a lot of people learn enough k8s to be dangerous.
Learning it well enough to not get wrapped around the axle with some networking or storage details is quite a bit harder.
For sure but that's the job of a good ops department - where I work at for example, every project's CI/CD pipeline has its own IAM user mapping to a Kubernetes role that only has explicitly defined capabilities: create, modify and delete just the utter basics. Even if they'd commit something into the Helm chart that could cause an annoyance, the service account wouldn't be able to call the required APIs. And the templates themselves come with security built-in - privileges are all explicitly dropped, pod UIDs/GIDs hardcoded to non-root, and we're deploying Network Policies at least for ingress as well now. Only egress network policies aren't available, we haven't been able to make these work with services.
Anyone wishing to do stuff like use the RDS database provisioner gets an introduction from us on how to use it and what the pitfalls are, and regular reviews of their code. They're flexible but we keep tabs on what they're doing, and when they have done something useful we aren't shy from integrating whatever they have done to our shared template repository.
Comparing the simplicity of two PHP servers against a setup with a dozen services is always going to be one sided. The difference in complexity alone is massive, regardless of whether you use k8s or not.
My current employer did something similar, but with fewer services. The upshot is that with terraform and helm and all the other yaml files defining our cluster, we have test environments on demand, and our uptime is 100x better.
Fair enough that sounds hard.
Memory size is an interesting example. A typical Kubernetes deployment has much more control over this than a typical non-container setup. It is costing you to figure out the right setting but in the long term you are rewarded with a more robust and more re-deployable application.
> has much more control over this than a typical non-container setup
Actually not true, k8s uses the exact same cgroups API for this under the hood that systemd does.
> I don't see how anybody says they'd move a large company to kubernetes in such an environment in a few months with no screwups and solid testing.
Unfortunately, I do. Somebody says that when the culture of the organization expects to be told and hear what they want to hear rather than the cold hard truth. And likely the person saying that says it from a perch up high and not responsible for the day to day work of actually implementing the change. I see this happen when the person, management/leadership, lacks the skills and knowledge to perform the work themselves. They've never been in the trenches and had to actually deal face to face with the devil in the details.
Canary deploy dude (or dude-ette), route 0.001% of service traffic and then slowly move it over. Then set error budgets. Then a bad service wont "bring down production".
Thats how we did it at Google (I was part of the core team responsible for ad serving infra - billions of ads to billions of users a day)
Using microk8s or k3s on one node works fine. As the author of "one big server," I am now working on an application that needs some GPUs and needs to be able to deploy on customer hardware, so k8s is natural. Our own hosted product runs on 2 servers, but it's ~10 containers (including databases, etc).
Yup, I like this approach a lot. With cloud providers considering VMs durable these days (they get new hardware for your VM if the hardware it's on dies, without dropping any TCP connections), I think a 1 node approach is enough for small things. You can get like 192 vCPUs per node. This is enough for a lot of small companies.
I occasionally try non-k8s approaches to see what I'm missing. I have a small ARM machine that runs Home Assistant and some other stuff. My first instinct was to run k8s (probably kind honestly), but didn't really want to write a bunch of manifests and let myself scope creep to running ArgoCD. I decided on `podman generate systemd` instead (with nightly re-pulls of the "latest" tag; I live and die by the bleeding edge). This was OK, until I added zwavejs, and now the versions sometimes get out of sync, which I notice by a certain light switch not working anymore. What I should have done instead was have some sort of git repository where I have the versions of these two things, and to update them atomically both at the exact same time. Oh wow, I really did need ArgoCD and Kubernetes ;)
I get by with podman by angrily ssh-ing in in my winter jacket when I'm trying to leave my house but can't turn the lights off. Maybe this can be blamed on auto-updates, but frankly anything exposed to a network that is out of date is also a risk, so, I don't think you can ever really win.
I think porting to k8s can succeed or fail, like any other project. I switched an app that I alone worked on, from Elastic Beanstalk (with Bash), to Kubernetes (with Babashka/Clojure). It didn't seem bad. I think k8s is basically a well-designed solution. I think of it as a declarative language which is sent to interpreters in k8s's control plane.
Obviously, some parts of took a while to figure out. For example, I needed to figure out an AWS security group problem with Ingress objects, that I recall wasn't well-documented. So I think parts of that declarative language can suck, if the declarative parts aren't well factored-out from the imperative parts. Or if the log messages don't help you diagnose errors, or if there isn't some kind of (dynamic?) linter that helps you notice problems quickly
In your team's case, more information seems needed to help us evaluate the problems. Why was it easier before to make testing environments, and harder now?
Yea but that doesn't sound shiny on your resume.
I never did choose any single thing in my job, just because of how it could look in my resume.
After +20 years of Linux sysadmin/devops, and because a spinal disc herniation last year, now I'm looking for a job.
99% of job offers, will ask for EKS/Kubernetes now.
It's like the VMware of the years 200[1-9], or like the "Cloud" of the years 201[1-9].
I've always specialized in physical datacenters and servers, being it on-premises, colocation, embedded, etc... so I'm out of the market now, at least in Spain (which always goes like 8 years behind the market).
You can try to avoid it, and it's nice when you save thousands of operational/performance/security/etc issues and dollars to your company across the years, and you look like a guru that goes ahead of industry issue to your boss eyes, but, it will make finding a job... 99% harder.
It doesn't matter if you demonstrate the highest level on Linux, scripting, ansible, networking, security, hardware, performance tuning, high availability, all kind of balancers, switching, routing, firewalls, encryption, backups, monitoring, log management, compliance, architecture, isolation, budget management, team management, provider/customer management, debugging, automation, programming full stack, and a long etc. If you say "I never worked with Kubernetes, but I learn fast", with your best sincerity at the interview, then you're automatically out of the process. No matter if you're talking with human resources, a helper of the CTO, or the CTO. You're out.
Depends on what kind of company you want to join. Some value simplicity and efficiency more.
[deleted]
So, my current experience somewhere most old apps are very old school:
- most server software is waaaaaaay out of date so getting a dev / test env is a little harder (like last problem we got was the HAproxy version does not do ECDA keys for ssl certs, which is the default with certbot)
- yeah pushing to prod is "easy": FTP directly. But now which version of which files are really in prod? No idea. Yeah when I say old school it's old school before things like Jenkins.
- need something done around the servers? That's the OPS team job. Team which also has too much different work to do so now you'll have to wait a week or two for this simple "add an upload file" endpoint to this old API because you need somewhere to put those files.
Now we've started setting up some on-prem k8s nodes for the new developments. Not because we need crazy scaling but so the dev team can do most OPS they need. It takes time to have everything setup but once it started chugging along it felt good to be able to just declare whatever we need and get it.
You still need to get the devs to learn k8s which is not fun but that's the life of a dev: learning new things every day.
Also k8s does not do data. You want a database or anything managing files: you want to do most of the job outside k8s.
Kubernetes is so easy that you only need two or three dedicated full-time employees to keep the mountains of YAML from collapsing in on themselves before cutting costs and outsourcing your cluster management to someone else.
Sure, it can be easy, just pick one of the many cloud providers that fix all the complicated parts for you. Though, when you do that, expect to pay extra for the privilege, and maybe take a look at the much easier proprietary alternatives. In theory the entire thing is portable enough that you can just switch hosting providers, in practice you're never going to be able to do that without seriously rewriting part of your stack anyway.
The worst part is that the mountains of YAML were never supposed to be written by humans anyway, they're readable configuration your tooling is supposed to generate for you. You still need your bash scripts and your complicated deployment strategies, but rather than using them directly you're supposed to compile them into YAML first.
Kubernetes is nice and all but it's not worth the effort for the vast majority of websites and services. WordPress works just fine without automatic replication and end-to-end microservice TLS encryption.
I went down the Kubernetes path. The product I picked 4 years ago is no longer maintained :(
The biggest breaking change to docker compose since it was introduced was that the docker-compose command stopped working and I had to switch to «docker compose» with a space. Had I stuck with docker and docker-compose I could have trivially kept everything up to date and running smoothly.
I ran small bootstrapped startup , I used GKE. Everything was templated.
each app has it's own template e.g. nodejs-worker, and you don't change the template unless you really needed.
i spent ~2% of my manger+eng leader+hiring manger+ god knows what else people do at startup on managing 100+ microservices because they were templates.
That works great until you want to change something low-level and have to apply it to all those templates.
This is so unnuanced that it reads like rationalization to me. People seem to get stuck on mantras that simple things are inherently fragile which isn't really true, or at least not particularly more fragile than navigating a jungle of yaml files and k8s cottage industry products that link together in arcane ways and tend to be very hard to debug, or just to understand all the moving parts involved in the flow of a request and thus what can go wrong. I get the feeling that they mostly just don't like that it doesn't have professional aesthetics.
This reminds me of the famous Taco Bell Programming post [1]. Simple can surprisingly often be good enough.
> People seem to get stuck on mantras that simple things are inherently fragile which isn't really true...
Ofc it isn't true.
Kubernetes was designed at Google at a time when Google was already a behemoth. 99.99% of all startups and SMEs out there shall never ever have the same scaling issues and automation needs that Google has.
Now that said... When you begin running VMs and containers, even only a very few of them, you immediately run into issues and then you begin to think: "Kubernetes is the solution". And it is. But it is also, in many cases, a solution to a problem you created. Still... the justification for creating that problem, if you're not Google scale, are highly disputable.
And, deep down, there's another very fundamental issue IMO: many of those "let's have only one process in one container" solutions actually mean "we're totally unable to write portable software working on several configs, so let's start with a machine with zero libs and dependencies and install exactly the minimum deps needed to make our ultra-fragile piece of shit of a software kinda work. And because it's still going to be a brittle piece of shit, let's make sure we use heartbeats and try to shut it down and back up again once it'll invariably have memory leaked and/or whatnots".
Then you also gained the right to be sloppy in the software you write: not respecting it. Treating it as cattle to be slaughtered, so it can be shitty. But you've now added an insane layer of complexity.
How do you like your uninitialized var when a container launchs but then silently doesn't work as expected? How do you like them logs in that case? Someone here as described the lack of instant failure on any uninitialized var as the "billion dollar mistake of the devops world".
Meanwhile look at some proper software like, say, the Linux kernel or a distro like Debian. Or compile Emacs or a browser from source and marvel at what's happening. Sure, there may be hickups but it works. On many configs. On many different hardware. On many different architectures. These are robust software that don't need to be "pid 1 on a pristine filesystem" to work properly.
In a way this whole "let's have all our software each as pid 1 each on a pristine OS and filesystem" is an admission of a very deep and profound failure of our entire field.
I don't think it's something to be celebrated.
And don't get me started on security: you know have ultra complicated LANs and VLANs, with a near impossible to monitor traffic, with shitloads of ports open everywhere, the most gigantic attack surface of them all and heartbeats and whatsnots constantly polluting the network, where nobody doesn't even know anymore what's going on. Where the only actual security seems to rely on the firewall being up and correctly configured, which is incredibly complicated to do seen the insane network complexity you added to your stack. "Oh wait, I have an idea, let's make configuring the firewall a service!" (and make sure to not forget to initialize one of the countless var or it'll all silently break and just be not be configuring firewalling for anything).
Now though love is true love: even at home I'm running an hypervisor with VMs and OCI containers ; )
> Meanwhile look at some proper software like, say, the Linux kernel or a distro like Debian. Or compile Emacs or a browser from source and marvel at what's happening. Sure, there may be hickups but it works. On many configs. On many different hardware. On many different architectures. These are robust software
Lol no. The build systems flake out if you look at them funny. The build requirements are whatever Joe in Nebraska happened to have installed on his machine that day (I mean sure there's a text file supposedly listing them, but it hasn't been accurate for 6 years). They list systems that they haven't actually supported for years, because no-one's actually testing them.
I hate containers as much as anyone, but the state of "native" unix software is even worse.
+1 for talking about attack surface. Every service is a potential gateway for bad people. Locking them all down is incredibly difficult to get right.
99.99% of startups and SMEs should not be writing microservices.
But "I wrote a commercial system that served thousands of users, it ran on a single process on a spare box out the back" doesn't look good on resumes.
I sense a lot of painful insights written in blood here.
I love that the only alternative is a "pile of shell scripts". Nobody has posted a legitimate alternative to the complexity of K8S or the simplicity of doctor compose. Certainly feels like there's a gap in the market for an opinionated deployment solution that works locally and on the cloud, with less functionality than K8S and a bit more complexity than docker compose.
I am puzzled by the fact that no successful forks of Nomad and Consul have emerged since the licence change and acquisition of Hashicorp.
If you need a quick scheduler, orchestrator and services control pane without fully embracing containers you might soon be out of luck.
Nomad was amazing at every step of my experiments on it, except one. Simply including a file from the Nomad control to the Nomad host is... impossible? I saw indications of how to tell the host to get it from a file host, and I saw people complaining that they had to do it through the file host, with the response being security (I have thoughts about this and so did the complainants).
I was rather baffled to an extent. I was just trying to push a configuration file that would be the primary difference between a couple otherwise samey apps.
Thumbs up for Nomad. We've been running it for about 3 years in prod now and it hasn't failed us a single time.
Docker Swarm is exactly what tried to fill that niche. It's basically an extension to Docker Compose that adds clustering support and overlay networks.
This looks cool and +1 for the 37Signals and Basecamp folks. I need to verify that I'll be able to spin up GPU enabled containers, but I can't imagine why that wouldn't work...
I coined a term for this because I see it so often.
“People will always defend complexity, stating that the only alternative is shell scripts”.
I saw people defending docker this way, ansible this way and most recently systemd this way.
Now we’re on to kubernetes.
>and most recently systemd this way.
To be fair, most people attacking systemd say they want to return to shell scripts.
No, there are alternatives like runit and SMF that do not use shell scripts.
Its conveniently ignored by systemd-supporters and the conversation always revolves around the fact that we used to use shell scripts. Despite the fact that there are sensible inits that predate systemd that did not use shell languages.
At least I never saw anyone arguing that the only alternative to git was shell scripts.
Wait. Wouldn't that be a good idea?
While not opinionated but you can go with cloud specific tools (e.g. ECS in AWS).
Sure, but those don't support local deployment, at least not in any sort of easy way.
This is basically exactly what we needed at the start up I worked at, with the added need of being able to host open source projects (airbyte, metabase) with a reasonable level of confidence.
We ended up migrating from Heroku to Kubernetes. I tried to take some of the learnings to build https://github.com/czhu12/canine
It basically wraps Kubernetes and tries to hide as much complexity from Kubernetes as possible, and only expose the good parts that will be enough for 95% of web application work loads.
Docker Swarm mode? I know it’s not as well maintained, but I think it’s exactly what you talk about here (forget K3s, etc). I believe smaller companies run it still and it’s perfect for personal projects. I myself run mostly docker compose + shell scripts though because I don’t really need zero-downtime deployments or redundancy/fault tolerance.
Capistrano, Ansible et al. have existed this whole time if you want to do that.
The real difference in approaches is between short lived environments that you redeploy from scratch all the time and long lived environments we nurse back to health with runbooks.
You can use lambda, kube, etc. or chef, puppet etc. but you end up at this same crossroad.
Just starting a process and keeping it alive for a long time is easy to get started with but eventually you have to pay the runbook tax. Instead you could pay the kubernetes tax or the nomad tax at the start instead of the 12am ansible tax later.
I hate to shill my own company, but I took the job because I believe in it.
You should check out DBOS and see if it meets your middle ground requirements.
Works locally and in the cloud, has all the things you’d need to build a reliable and stateful application.
Looking at your page, it looks like Lambdas/Functions but on your system, not Amazon/Microsoft/Google.
Every company I've ever had try to do this has ended in crying after some part of the system doesn't fit neat into Serverless box and it becomes painful to extract from your system into "Run FastAPI in containers."
We run on bare metal in AWS, so you get access to all your other AWS services. We can also run on bare metal in whatever cloud you want.
Sure but I'm still wrapped around your library no? So if your "Process Kafka events" decorator in Python doesn't quite do what I need to, I'm forced to grab the Kafka library, write my code and then learn to build my own container since I assume you were handling the build part. Finally, figure out which 17 ways to run containers on AWS (https://www.lastweekinaws.com/blog/the-17-ways-to-run-contai...) is proper for me and away I go?
That's my SRE recommendation of "These serverless are a trap, it's quick to get going but you can quickly get locked into a bad place."
No, not at all. We run standard python, so we can build with any kafka library. Our decorator is just a subclass of the default decorator to add some kafka stuff, but you can use the generic decorator around whatever kafka library you want. We can build and run any arbitrary Python.
But yes, if you find there is something you can't do, you would have to build a container for it or deploy it to an instance of however you want. Although I'd say that mostly likely we'd work with you to make whatever it is you want to do possible.
I'd also consider that an advantage. You aren't locked into the platform, you can expand it to do whatever you want. The whole point of serverless is to make most things easy, not all things. If you can get your POC working without doing anything, isn't that a great advantage to your business?
Let's be real, if you start with containers, it will be a lot harder to get started and then still hard to add whatever functionality you want. Containers doesn't really make anything easier, it just makes things more consistent.
Nice, but I like my servers and find serverless difficult to debug.
That's the beauty of this system. You build it all locally, test it locally, debug it locally. Only then do you deploy to the cloud. And since you can build the whole thing with one file, it's really easy to reason about.
And if somehow you get a bug in production, you have the time travel debugger to replay exactly what the state of the cloud was at the time.
Great to hear you've improved serverless debugging. What if my endpoint wants to run ffmpeg and extract frames from video. How does that work on serverless?
That particular use case requires some pretty heavy binaries and isn't really suited to serverless. However, you could still use DBOS to manage chunking the work and managing to workflows to make sure every frame is only processed once. Then you could call out to some of the existing serverless offerings that do exactly what you suggest (extract frames from video).
Or you could launch an EC2 instance that is running ffmpeg and takes in videos and spits out frames, and then use DBOS to manage launching and closing down those instances as well as the workflows of getting the work done.
Looks interesting, but this is a bit worrying:
... build reliable AI agents with automatic retries and no limit on how long they can
run for.
It's pretty easy to see how that could go badly wrong. ;)
(and yeah, obviously "don't deploy that stuff" is the solution)
---
That being said, is it all OSS? I can see some stuff here that seems to be, but it mostly seems to be the client side stuff?
Maybe that is worded poorly. :). It's supposed to mean there are no timeouts -- you can wait as long as you want between retries.
> That being said, is it all OSS?
The Transact library is open source and always will be. That is what you gets you the durability, statefulness, some observability, and local testing.
We also offer a hosted cloud product that adds in the reliability, scalability, more observability, and a time travel debugger.
Agreed, something simpler than Nomad as well hopefully.
Ansible and the podman Ansible modules
I'm giggling at the idea you'd need Kubernetes for a mere two servers. We don't run any application with less than two instances for redundancy.
We've just never seen the need for Kubernetes. We're not against it as much as the need to replace our working setup just never arrived. We run EC2 instances with a setup shell script under 50loc. We autoscale up to 40-50 web servers at peak load of a little over 100k concurrent users.
Different strokes for different folks but moreso if it ain't broke, don't fix it
> The inscrutable iptables rules?
You mean the list of calls right there in the shell script?
> Who will know about those undocumented sysctl edits you made on the VM?
You mean those calls to `sysctl` conveniently right there in the shell script?
> your app needs to programmatically spawn other containers
Or you could run a job queue and push tasks to it (gaining all the usual benefits of observability, concurrency limits, etc), instead of spawning ad-hoc containers and hoping for the best.
"We don't know how to learn/read code we are unfamiliar with... Nor do we know how to grok and learn things quickly. Heck, we don't know what grok means "
Who do you quote?
This quote mostly applies to people who don't want to spend the time learning existing tooling, making improvements and instead create a slightly different wheel but with different problems. It also applies to people trying to apply "google" solutions to a non-google company.
Kubernetes and all tooling in the cloud native computing foundation(CNCF) were created to have people adopt the cloud and build communities that then created jobs roles that facilitated hiring people to maintain cloud presences that then fund cloud providers.
This is the same playbook that Microsoft did at Universities. They would give the entire suite of tools in the MSDN library away then then in roughly (4) years collect when another seat needs to be purchased for a new hire that has only used Microsoft tools for the last (4) years.
[dead]
> You mean the list of calls right there in the shell script?
This is about the worst encoding for network rules I can think of.
Worse than yaml generated by string interpolation?
You'd have to give me an example. YAML is certainly better at representing tables of data than a shell script is.
Not entirely a fair comparison, but here. Can you honestly tell me you'd take the yaml over the shell script?
(If you've never had to use Helm, I envy you. And if you have, I genuinely look forward to you showing me an easier way to do this, since it would make my life easier.)
-------------------------------------
Shell script:
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
Multiple ports:
for port in 80 443 8080; do
iptables -A INPUT -p tcp --dport "$port" -j ACCEPT
done
Because if one of those iptables fails above you're in an inconsistent state.
Also if I want to swap from iptables to something like Istio then it's basically the same YAML.
You obviously didn't use k8s (or k3s or anything other implementation) a lot, because it also messed us iptables randomly sometimes due to bugs, version miss match etc.
Have been Kubernetes for the last decade across multiple implementations.
Never had an iptable issue and these days eBPF is the standard.
Highly amateurish take if you call shell spaghetti a Kubernates, especially if we compare complexity of both...
You know what would be even more bad? Introducing kubernates for your non-Google/Netflix/WhateverPlanetaryScale
App instead of just writing few scripts...
Hell, I’m a fan of k8s even for sub-planetary scale (assuming that scale is ultimately a goal of your business, it’s nice to build for success). But I agree that saying “well, it’s either k8s or you will build k8s yourself” is just ignorant. There are a lot of options between the two poles that can be both cheap and easy and offload the ugly bits of server management for the right price and complexity that your business needs.
Both this piece and the piece it’s imitating seem to have 2 central implicit axioms that in my opinion don’t hold. The first, that the constraints of the home grown systems are all cost and the second that the flexibility of the general purpose solution is all benefit.
You generally speaking do not want a code generation or service orchestration system that will support the entire universe of choices. You want your programs and idioms to follow similar patterns across your codebase and you want your services architected and deployed the same way. You want to know when outliers get introduced and similarly you want to make it costly enough to require introspection on if the value of the benefit out ways the cost of oddity.
The compiler one read to me like a reminder to not ignore the lessons of compiler design. The premise being that even though you have small scope project compared to a "real" compiler, you will evolve towards analogues of those design ideas. The databases and k8s pieces are more like don't even try a small scope project because you'll want the same features eventually.
I suppose I can see how people are taking this piece that way, but I don't see it like that. It is snarky and ranty, which makes it hard to express or perceive nuance. They do explicitly acknowledge that "a single server can go a long way" though.
I think the real point, better expressed, is that if you find yourself building a system with like a third of the features of K8s but composed of hand-rolled scripts and random third-party tools kludged together, maybe you should have just bit the bullet and moved to K8s instead.
You probably shouldn't start your project on it unless you have a dedicated DevOps department maintaining your cluster for you, but don't be afraid to move to it if your needs start getting more complex.
Author here. Yes there were many times while writing this that I wanted to insert nuance, but couldn't without breaking the format too much.
I appreciate the wide range of interpretations! I don't necessarily think you should always move to k8s in those situations. I just want people to not dismiss k8s outright for being overly-complex without thinking too hard about it. "You will evolve towards analogues of those design ideas" is a good way to put it.
That's also how I interpreted the original post about compilers. The reader is stubbornly refusing to acknowledge that compilers have irreducible complexity. They think they can build something simpler, but end up rediscovering the same path that lead to the creation of compilers in the first place.
I had a hard time putting my finger on what was so annoying about the follow-ons to the compiler post, and this nails it for me. Thanks!
> You generally speaking do not want a code generation or service orchestration system that will support the entire universe of choices.
This. I will gladly give up the universe of choices for a one size fits most solution that just works. I will bend my use cases to fit the mold if it means not having to write k8s configuration in a twisty maze of managed services.
I like to say, you can make anything look good by considering only the benefits and anything look bad by considering only the costs.
It's a fun philosophy for online debates, but an expensive one to use in real engineering.
outweighs*
Only offering the correction because I was confused at what you meant by “out ways” until I figured it out.
The elephant in the room: People who have gotten over the K8s learning curve almost all tell you it isn't actually that bad.
Most people who have not attempted the learning curve, or have just dipped their toe in, will tell you they are scared of the complexity.
An anecdotal datapoint: My standard lecture teaching developers how to interact with K8s takes almost precisely 30 minutes to have them writing Helm charts for themselves. I have given it a whole bunch of times and it seems to do the job.
> My standard lecture teaching developers how to interact with K8s takes almost precisely 30 minutes to have them writing Helm charts for themselves
And I can teach someone to write "hello world" in 10 languages in 30 minutes, but that doesn't mean they're qualified to develop or fix production software.
One has to start from somewhere I guess. I doubt anyone would learn K8s thoroughly before getting any such job. Tried once and the whole thing bored me out in the fourth video.
I personally know many k8s experts that vehemently recommend against using it unless you have no other option.
Much like Javascript, the problem isn't Kubernetes, its the zillions of half-tested open-source libraries that promise to make things easier but actually completely obfuscate what the system is doing while injecting fantastic amounts of bugs.
Dear Amazon Elastic Beanstalk, Google App Engine, Heroku, Digital Ocean App Platform, and friends,
Thank you for building "a kubernetes" for me so I don't have to muck with that nonsense, or have to hire people that do.
I don't know what that other guy is talking about.
Most of the complaints in this fun post are just bad practice, and really nothing to do with “making a Kubernetes”.
Sans bad engineering practices, if you built a system that did the same things as kubernetes I would have no problem with it.
In reality I don’t want everybody to use k8s. I want people finding different solutions to solve similar problems. Homogenized ecosystems create walls they block progress.
One is the big things that is overlooked when people move to k8s, and why things get better when moving to k8s, is that k8s made a set of rules that forced service owners to fix all of their bad practices.
Most deployment systems would work fine if the same work to remove bad practices from their stack occurred.
K8s is the hot thing today, but mark my words, it will be replaced with something far more simple and much nicer to integrate with. And this will come from some engineer “creating a kubernetes”
Don’t even get me started on how crappy the culture of “you are doing something hard that I think is already a solved problem” is. This goes for compilers and databases too. None is these are hard, and neither is k8s, and all good engineers tasked with making one, be able to do so.
I welcome a k8s replacement! Just how there are better compilers and better databases than we had 10-20 years ago, we need better deployment methods. I just believe those better methods came from really understanding the compilers and databases that came before, rather than dismissing them out of hand.
Can you give examples of what "bad practices" does k8s force to fix?
To name a few:
K8s really kills the urge to say “oh well I guess we can just do that file onto the server as a part of startup rather than use a db/config system/etc.” No more “oh shit the VM died and we lost the file that was supposed to be static except for that thing John wrote to update it only if X happened, but now X happens everyday and the file is gone”.. or worse: it’s in git but now you have 3 different versions that have all drifted due to the John code change.
K8s makes you use containers, which makes you not run things on your machine, which makes you better at CI, which.. (the list goes on, containers are industry standard for a lot of reasons). In general the 12 Factor App is a great set of ideas, and k8s lets you do them (this is not exclusive, though). Containers alone are a huge game changer compared to “cp a JAR to the server and restart it”
K8s makes it really really really easy to just split off that one weird cronjob part of the codebase that Mike needed and man, it would be really nice to just use the same code and dependencies rather than boilerplating a whole new app and deploy, CI, configs, and yamls to make that run. See points about containerization.
K8s doesn’t assume that your business will always be a website/mobile app. See the whole “edge computing” trend.
I do want to stress that k8s is not the only thing in the world that can do these or promote good development practices, and I do think it’s overkill to say that it MAKES you do things well - a foolhardy person can mess any well-intentioned system up.
So you're saying companies should move to k8s and then immediately move to bash scripts
No. I am saying that companies should have their engineers understand why k8s works and make those reasons an engineering practice.
As it is today the pattern is spend a ton of money moving to k8s (mostly costly managed solutions) in the process fix all the bad engineering patterns, forced by k8s. To then have an engineer save the company money by moving back to a more home grown solution, a solution that fits the companies needs and saves money, something that would only be possible once the engineering practices were fixed.
Kubernetes biggest competitor isn’t a pile of bash scripts and docker running on a server, it’s something like ECS which comes with a lot of the benefits but a hell of a lot less complexity
FWIW I’ve been using ECS at my current work (previously K8s) and to me it feels just flat worse:
- only some of the features
- none of the community
- all of the complexity but none of the upsides.
It was genuinely a bit shocking that it was considered a serious product seeing as how chaotic it was.
Can you elaborate on some of the issues you faced? I was considering deploying to ECS fargate as we are all-in on AWS.
Any kind of git-ops style deployment was out.
ECS merges “AWS config” and “app/deployment config together” so it was difficult to separate “what should go in TF, and what is a runtime app configuration. In comparison this is basically trivial ootb with K8s.
I personally found a lot of the moving parts and names needlessly confusing. Tasks e.g. were not your equivalent to “Deployment”.
Want to just deploy something like Prometheus Agent? Well, too bad, the networking doesn’t work the same, so here’s some overly complicated guide where you have to deploy some extra stuff which will no doubt not work right the first dozen times you try. Admittedly, Prom can be a right pain to manage, but the fact that ECS makes you do _extra_ work on top of an already fiddly piece of software left a bad taste in my mouth.
I think ECS get a lot of airtime because of Fargate, but you can use Fargate on K8s these days, or, if you can afford the small increase in initial setup complexity, you can just have Fargates less-expensive, less-restrictive, better sibling: Karpenter on Spot instances.
I think the initial setup complexity is less with ECS personally, and the ongoing maintenance cost is significantly worse on K8s when you run anything serious which leads to people taking shortcuts.
Every time you have a cluster upgrade with K8s there’s a risk something breaks. For any product at scale, you’re likely to be using things like Istio and Metricbeat. You have a whole level of complexity in adding auth to your cluster on top of your existing SSO for the cloud provider. We’ve had to spend quite some time changing the plugin for AKS/EntraID recently which has also meant a change in workflow for users. Upgrading clusters can break things since plenty of stuff (less these days) lives in beta namespaces, and there’s no LTS.
Again, it’s less bad than it was, but many core things live(d) in plugins for clusters which have a risk of breaking when you upgrade cluster.
My view was that the initial startup cost for ECS is lower and once it’s done, that’s kind of it - it’s stable and doesn’t change. With K8s it’s much more a moving target, and it requires someone to actively be maintaining it, which takes time.
In a small team I don’t think that cost and complexity is worth it - there are so many more concepts that you have to learn even on top of the cloud specific ones. It requires a real level of expertise so if you try and adopt it without someone who’s already worked with it for some time you can end up in a real mess
If your workloads are fairly static,ECS is fine. Bringing up new containers and nodes takes ages with very little feedback as to what's going on. It's very frustrating when iterating on workloads.
Also fargate is very expensive and inflexible. If you fit the narrow particular use case it's quicker for bringing up workloads, but you pay extra for it.
Can confirm. I've used ECS with Fargate successfully at multiple companies. Some eventually outgrew it. Some failed first. Some continue to use ECS happily.
Regardless of the outcome, it always felt more important to keep things simple and focus on product and business needs.
It sometimes blows my mind how reductionist and simplistic a world-view it's possible to have and yet still attain some degree of success.
Shovels and mechanical excavators both exist and have a place on a building site. If you talk to a workman he may well tell you he has regular hammer with him at all times but will use a sledgehammer and even rent a pile driver on occasion if the task demands it.
And yet somehow we as software engineers are supposed to restrict ourselves to The One True Tool[tm] (which varies based on time and fashion) and use it for everything. It's such an obviously dumb approach that even people who do basic manual labour realise its shortcomings. Sometimes they will use a forklift truck to move things, sometimes an HGV, sometimes they will put things in a wheelbarrow and sometimes they will carry them by hand. But us? No. Sophisticated engineers as we are there is One Way and it doesn't matter if you're a 3 person startup or you're Google, if you deploy once per year to a single big server or multiple times per day to a farm of thousands of hosts you're supposed to do it that one way no matter what.
The real rule is this: Use your judgement.
You're supposed to be smart. You're supposed to be good. Be good. Figure out what's actually going on and how best to solve the problems in your situation. Don't rely on everyone else to tell you what to do or blindly apply "best practises" invented by someone who doesn't know a thing about what you're trying to do. Yes consider the experiences of others and learn from their mistakes where possible, but use your own goddamn brain and skill. That's why they pay you the big bucks.
I think one thing that is under appreciated with kubernetes is how massive the package library is. It becomes trivial to stand up basically every open source project with a single command via helm. It gets a lot of hate but for medium sized deployments, it’s fantastic.
Before helm, just trying to run third party containers on bare metal resulted in constant downtime when the process would just hang for no reason, and and engineer would have to SSH and manually restart the instance.
We used this as a previous start up to host metabase, sentry and airbyte seamlessly, on our own cluster. Which let us break out of the constant price increases we faced for hosted versions of these products.
Shameless plug: I’ve been building https://github.com/czhu12/canine to try to make Kubernetes easier to use for solo developers. Would love any feedback from anyone looking to deploy something new to K8s!
Right, but this isn't a post about why K8s is good, it's a post about why K8s is effectively mandatory, and it isn't, which is why the post rankles some people.
Yeah I mostly agree. I'd even add that even K8 YAML's are not trivial to maintain, especially if you need to have them be produced by a templating engine.
They become trivial once you stop templating them with text templating engine.
They are serialized json objects, the YAML is there just because raw JSON is not user friendly when you need something done quick and dirty or include comments.
Proper templating should never use text templating on manifests.
> Inevitably, you find a reason to expand to a second server.
The author has some good points, but not every project needs multiple servers for the same reasons as a typical Kubernetes setup. In many scenarios those servers are dedicated to separate tasks.
For example, you can have a separate server for a redundant copy of your application layer, one server for load balancing and caching, one or more servers for the database, another for backups, and none of these servers requires anything more than separate Docker Compose configs for each server.
I'm not saying that Kubernetes is a bad idea, even for the hypothetical setup above, but you don't necessarily need advanced service discovery tools for every workload.
I don’t think scale is the only consideration for using Kubernetes. The ops overhead in managing traditional infrastructure, especially if you’re a large enterprise, drops massively if you really buy into cloud native. Kubernetes converges application orchestration, job scheduling, scaling, monitoring/observability, networking, load balancing, certificate management, storage management, compute provisioning - and more. In a typical enterprise, doing all this requires multiple teams. Changes are request driven and take forever. Operating systems need to be patched. This all happens after hours and costs time and money. When properly implemented and backed by the right level of stakeholder, I’ve seen orgs move to business day maintenance, while gaining the confidence to release during peak times. It’s not just about scale, it’s about converging traditional infra practices into a single, declarative and eventually consistent platform that handles it all for you.
> Spawning containers, of course, requires you to mount the Docker socket in your web app, which is wildly insecure
Dear friend, you are not a systems programmer
To expand on this, the author is describing the so-called "Docker-out-of-Docker (DooD) pattern", i.e. exposing Docker's Unix socket into the container. Since Docker was designed to work remotely (CLI on another machine than DOCKER_HOST), this works fine, but essentially negates all isolation.
For many years now, all major container runtimes support nesting. Some make it easy (podman and runc just work), some hard (systemd-nspawn requires setting many flags to work nested). This is called "Docker-in-a-Docker (DinD)".
FreeBSD has supported nesting of jails natively since version 8.0, which dates back to 2009.
I prefer FreeBSD to K8s.
I think we need to distinguish between two cases:
For a hobby project, using Docker Compose or Podman combined with systemd and some shell scripts is perfectly fine. You’re the only one responsible, and you have the freedom to choose whatever works best for you.
However, in a company setting, things are quite different. Your boss may assign you new tasks that could require writing a lot of custom scripts. This can become a problem for other team members and contractors, as such scripts are often undocumented and don’t follow industry standards.
In this case, I would recommend using Kubernetes (k8s), but only if the company has a dedicated Kubernetes team with an established on-call rotation. Alternatively, I suggest leveraging a managed cloud service like ECS Fargate to handle container orchestration.
There’s also strong competition in the "Container as a Service" (CaaS) space, with smaller and more cost-effective options available if you prefer to avoid the major cloud providers. Overall, these CaaS solutions require far less maintenance compared to managing your own cluster.
> dedicated Kubernetes team with an established on-call rotation.
Using EKS or GKS is basically this. K8s is much nicer than ECS in terms of development and packaging your own apps.
How would you feel if bash scripts were replaced with Ansible playbooks?
At a previous job at a teeny startup, each instance of the environment is a docker-compose instance on a VPS. It works great, but they’re starting to get a bunch of new clients, and some of them need fully independent instances of the app.
Deployment gets harder with every instance because it’s just a pile of bash scripts on each server. My old coworkers have to run a build for each instance for every deploy.
None of us had used ansible, which seems like it could be a solution. It would be a new headache to learn, but it seems like less of a headache than kubernetes!
Ansible is better than Bash if your goals include:
* Automating repetitive tasks across many servers.
* Ensuring idempotent configurations (e.g., setting up web servers, installing packages consistently).
* Managing infrastructure as code for better version control and collaboration.
* Orchestrating complex workflows that involve multiple steps or dependencies.
However, Ansible is not a container orchestrator.
Kubernetes (K8s) provides capabilities that Ansible or Docker-Compose cannot match. While Docker-Compose only supports a basic subset, Kubernetes offers:
* Advanced orchestration features, such as rolling updates, health checks, scaling, and self-healing.
* Automatic maintenance of the desired state for running workloads.
* Restarting failed containers, rescheduling pods, and replacing unhealthy nodes.
* Horizontal pod auto-scaling based on metrics (e.g., CPU, memory, or custom metrics).
* Continuous monitoring and reconciliation of the actual state with the desired state.
* Immediate application of changes to bring resources to the desired configuration.
* Service discovery via DNS and automatic load balancing across pods.
* Native support for Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) for storage management.
* Abstraction of storage providers, supporting local, cloud, and network storage.
If you need these features but are concerned about the complexity of Kubernetes, consider using a managed Kubernetes service like GKE or EKS to simplify deployment and management. Alternatively, and this is my prefered option, combining Terraform with a Container-as-a-Service (CaaS) platform allows the provider to handle most of the operational complexity for you.
Ansible ultimately runs scripts, in parallel, in a defined order across machines. It can help a lot, but it's subject to a lot of the same state bitrot issues as a pole of shell scripts.
Up until a few thousand instances, a well designed setup should be a part time job for a couple of people.
To that scale you can write a custom orchestrator that is likely to be smaller and simpler than the equivalent K8S setup. Been there, done that.
Yes, but the people just cannot comprehend the complexity of it. Even my academic professor for my FYP back when was an undergrad, now he reverted back to Docker Compose, citing the integration is so convoluted that developing for it is very difficult. That's why I'm aiming to cut down the complexity of Kubernetes with a low-friction, turnkey solution, but I guess the angel investors in Hong Kong aren't buying into it yet. I'm still aiming to try again after 2 years when I can at least get an MVP that is complete though (I don't like to present imperfect stuff, either you just have the idea or you give me the full product and not half baked shit)
Like, okay, if that's how you see it, but what's with the tone and content?
The tone's vapidity is only comparable to the content's.
This reads like mocking the target audience rather than showing them how you can help.
A write up that took said "pile of shell scripts that do not work" and showed how to "make it work" with your technology of choice would have been more interesting than whatever this is.
One can build a better container orchestration than kubernetes; things don't need to be that complex.
I was using some ansible playbook scripts to deploy to production some web app. One day the scripts stopped working because of a boring error about python version mismatch.
I rewrite all the deployment scripts with bash (took less than a hour) and never had a problem since.
Morality: it's hard to find the right tool for the job
For my own websites, I host everything on a a single $20/month hetzner instance using https://dokploy.com/ and I'm never going back.
Why do I feel this is not so simple as the compiler scenario?
I've seen a lot of "piles of YAML", even contributed to some. There were some good projects that didn't end up in disaster, but to me the same could be said for the shell.
I was very scared of K8s for a long time then we started using it and it's actually great. Much less complex than its reputation suggests.
I had the exact opposite experience. I had a cloud run app in gcp and experimented with moving it to k8s and I was astonished with the amount of new complexity I had to manage
I thought k8s might be a solution so I decided to learn through doing. It quickly became obvious that we didn't need 90% of its capabilities but more important it'd put undue load/training on the rest of the team. It would be a lot more sensible to write custom orchestration using the docker API - that was straightforward.
Experimenting with k8s was very much worthwhile. It's an amazing thing and was in many ways inspirational. But using it would have been swimming against the tide so to speak. So sure I built a mini-k8s-lite, it's better for us, it fits better than wrapping docker compose.
My only doubt is whether I should have used podman instead but at the time podman seemed to be in an odd place (3-4 years ago now). Though it'd be quite easy to switch now it hardly seems worthwhile.
You did a no-SQL, you did a serverless, you did a micro-services. This makes it abundantly clear you do not understand the nature of your architectural patterns and the multiplicity of your offenses.
I wish the world hadn't consolidated around Kubernetes. Rancher was fantastic. Did what 95% of us need, and dead simple to add and manage services.
Did you find Rancher v2 (which uses Kubernetes instead of their own Cattle system) is worse?
Started with a large shell script, the next itération was written in go and less specific. I still think for some things, k8s is just too much
I am 100% sure that the author of this post has never "built a kubernetes", holds at least one kubernetes cert, and maybe even works for a company that sells kubernetes products and services. Never been more certain of anything in my life. You could go point by point but its just so tiring arguing with these people. Like, the whole "who will maintain these scripts when you go on vacation" my brother in christ have you seen the kubernetes setups some of these people invent? They are not easier to be read into, this much is absolute. At least a shell script has a chance of encoding all of its behavior in the one file, versus putting a third of its behavior in helm variables, a third in poorly-named and documented YAML keys, and a third in some "manifest orchestrator reconciler service deployment system" that's six major versions behind an open source project that no one knows who maintains anymore because their critical developer was a Belarusian 10x'er who got mad about a code of conduct that asked him to stop mispronouning contributors.
Swarm Mode ftw!
For the uninitiated: how does k8s handle OS upgrades? If development moves to next version of Debian, because it should eventually, are upgrades, for example, 2x harder vs docker-compose? 2x easier? About the same? Is it even right question to ask?
It doesn't. The usual approach is to create new nodes with the updated OS, migrate all workloads over and then throw away the old ones
Are you talking about upgrades of the host OS or the base of the image? I think you are talking about the latter. Others covered updating the host.
Upgrades of the Docker image are done by pushing a new image, and updating the Deployment to use the new image, and applying it. Kubernetes will start new containers for the new image, and when they are running, kill off the old containers. There should be no interruption. It isn't any different than normal deploy.
Your cluster consists of multiple machines ('nodes'). Upgrading is as simple as adding a new, upgraded node, then evicting everything from one of the existing nodes, then take it down. Repeat until every node is replaced.
Downtime is the same as with a deploment, so if you run at least 2 copies of everything there should be no downtime.
As for updating the images of your containers, you build them again with the newer base image, then deploy.
>Tired, you parameterize your deploy script and configure firewall rules, distracted from the crucial features you should be working on and shipping.
Where's your Sysop?
i'm at this crossroads right now. somebody talk me out of deploying a dagster etl on azure kubernetes service rather than deploying all of the pieces onto azure container apps with my own bespoke scripts / config
writing this out helped me re-validate what i need to do
what did you decide to do?
[deleted]
Dear friend, you have made a slippery slope argument.
Yes, because the whole situation is a slippery slope (ony upwards). In the initial state, k8s is obviously overkill; in the end state, k8s is obviously adequate.
The problem is choosing the point of transition, and allocating resources for said transition. Sometimes it's easier to allocate a small chunk to update your bespoke script right now instead of sinking more to a proper migration. It's a typical dilemma of taking debt vs paying upfront.
(BTW the same dilemma exists with running in the cloud vs running on bare metal; the only time when a migration from the cloud is easy is the beginning, when it does not make financial sense.)
Odds are you have 100 DAUs and your "end state" is an "our incredible journey" blog post. I understand that people want to pad their resume with buzzwords on the way, but I don't accept making a virtue out of it.
Exactly. Don't start with k8s unless you're already comfortable troubleshooting it at 3am half asleep. Start with one of the things you're comfortable with. Among these things, apply YAGNI liberally, only making certain that you're not going to paint yourself into a corner.
Then, if and when you've become so large that the previous thing has become painful and k8s started looking like a really right tool for the job, allocate time and resources, plan a transition, implement it smoothly. If you have grown to such a size, you must have had a few such transitions in your architecture and infrastructure already, and learned to handle them.
Dear friend, you should first look into using Nomad or Kamal deploy instead of K8S
As for Kamal, I shudder to think of the hubris required to say "pfft, haproxy is for lamez, how hard can it be to make my own lb?!" https://github.com/basecamp/kamal-proxy
why adding complexity when many services don't even need horizontal scaling, servers are powerful enough that if you're not stupid to write horrible code, it's fine for millions of requests a day without much of work
Infra person here, this is such the wrong take.
> Do I really need a separate solution for deployment, rolling updates, rollbacks, and scaling.
Yes it's called an ASG.
> Inevitably, you find a reason to expand to a second server.
ALB, target group, ASG, done.
> Who will know about those undocumented sysctl edits you made on the VM
You put all your modifications and CIS benchmark tweaks in a repo and build a new AMI off it every night. Patching is switching the AMI and triggering a rolling update.
> The inscrutable iptables rules
These are security groups, lord have mercy on anyone who thinks k8s network policy is simple.
> One of your team members suggests connecting the servers with Tailscale: an overlay network with service discovery
Nobody does this, you're in AWS. If you use separate VPCs you can peer them but generally it's just editing some security groups and target groups. k8s is forced into needing to overlay on an already virtual network because they need to address pods rather than VMs, when VMs are your unit you're just doing basic networking.
You reach for k8s when you need control loops beyond what ASGs can provide. The magic of k8s is "continuous terraform," you will know when you need it and you likely never will. If your infra moves from one static config to another static config on deploy (by far the usual case) then no k8s is fine.
You’d be swapping an open-source vendor independent API for a cloud-specific vendor locked one. And paying more for the “privilege”
I mean that's the sales pitch but it's really not vendor independent in practice. We have a mountain of EKS specific code. It would be easier for me to migrate our apps that use ASGs than to migrate our charts. AWS's API isn't actually all that special, they're just modeling the datacenter in code. Anywhere you migrate to will have all the same primitives because the underlying infrastructure is basically the same.
EKS isn't any cheaper either from experience and in hindsight of course it isn't, it's backed by the same things you would deploy without EKS just with another layer. The dream of gains from "OS overhead" and efficient tight-packed pod scheduling doesn't match the reality that our VMs are right-sized for our workloads already and aren't sitting idle. You can't squeeze that much water from the stone even in theory and in practice k8s comes with its own overhead.
Another reason to use k8s is the original:
When you deploy on physical hardware, not VMs, or have to otherwise optimize maximum utilization out of gear you have.
Especially since sometimes Cloud just means hemorrhaging money in comparison to something else, especially with ASGs
We found that the savings from switching from VMs in ASGs to k8s never really materialized. OS overhead wasn't actually that much and once you're requesting cpu / memory you can't fit as many pods per host as you think.
Plus you're competing with hypervisors for maxing out hardware which is rock solid stable.
My experience was quite the opposite, but it depends very much on the workload.
That is, I didn't say the competition was between AWS ASGs and k8s running on EC2, but having already a certain amount of capacity that you want to max out in flexible ways.
You don't need to use an overlay network. Calico works just fine without an overlay.
I'm sure the American Sewing Guild is fantastic, but how do they help here?
Even without needing to spawn additional Docker containers, I think people are more afraid of Kubernetes than is warranted. If you use a managed K8s service like Azure, AWS, GCP, and tons of others provide, it's... Pretty simple and pretty bulletproof, assuming you're doing simple stuff with it (i.e. running a standard web app).
The docs for K8s are incredibly bad for solo devs or small teams, and introduce you to a lot of unnecessary complexity upfront that you just don't need: the docs seem to be written with megacorps in mind who have teams managing large infrastructure migrations with existing, complex needs. To get started on a new project with K8s, you just need a pretty simple set of YAML files:
1. An "ingress" YAML file that defines the ports you listen to for the outside world (typically port 80), and how you listen to them. Using Helm, the K8s package manager, you can install a simple default Nginx-based ingress with minimal config. You probably were going to put Nginx/Caddy/etc in front of your app anyway, so why not do it this way?
2. A "service" YAML file that allocates some internal port mapping used for your web application (i.e. what port do you listen on within the cluster's network, and what port should that map to for the container).
3. A "deployment" YAML file that sets up some number of containers inside your service.
And that's it. As necessary you can start opting into more features; for example, you can add health checks to your deployment file, so that K8s auto-restarts your containers when they die, and you can add deployment strategies there as well, such as rolling deployments and limits on how many new containers can be started before old ones are killed during the deploy, etc. You can add resource requests and limits e.g. make sure my app has at least 500MB RAM, and kill+restart it if it cross 1GB. But it's actually really simple to get started! I think it compares pretty well even to the modern Heroku-replacements like Fly.io... It's just that the docs are bad and the reputation is that it's complicated — and a large part of that reputation is from existing teams who try to do a large migration, and who have very complex needs that have evolved over time. K8s generally is flexible enough to support even those complex needs, but... It's gonna be complex if you have them. For new projects, it really isn't. Part of the reason other platforms are viewed as simpler IMO is just that they lack so many features that teams with complex needs don't bother trying to migrate (and thus never complain about how complicated it is to do complicated things with it).
You can have Claude or ChatGPT walk you through a lot of this stuff though, and thereby get an easier introduction than having to pore through the pretty corporate official docs. And since K8s supports both YAML and JSON, in my opinion it's worth just generating JSON using whatever programming language you already use for your app; it'll help reduce some of the verbosity of YAML.
What you’re saying is that starting a service in kubernetes as a dev is ok, what other people say is that operating a k8s cluster is hard.
Unless I’m mistaken the managed kubernetes instances were introduced by cloud vendors because regular people couldn’t run kubernetes clusters reliably, and when they went wrong they couldn’t fix them.
Where I am, since cloud is not an option ( large mega corp with regulatory constraints ) they’ve decided to run their own k8s cluster. It doesn’t work well, it’s hard to debug, and they don’t know why it doesn’t work.
Now if you have the right people or can have your cluster managed for you, I guess it’s a different story.
Most megacorps use AWS. It's regrettable that your company can't, but that's pretty atypical. Using AWS Kubernetes is easy and simple.
Not sure why you think this is just "as a dev" rather than operating in production — K8s is much more battle-hardened than someone's random shell scripts.
Personally, I've run K8s clusters for a Very Large tech megacorp (not using managed clusters; we ran the clusters ourselves). It was honestly pretty easy, but we were very experienced infra engineers, and I wouldn't recommend doing it for startups or new projects. However, most startups and new projects will be running in the cloud, and you might as well use managed K8s: it's simple.
> Most megacorps use AWS. It's regrettable that your company can't, but that's pretty atypical.
Even then, it seems like you can run EKS yourself:
"EKS Anywhere is free, open source software that you can download, install on your existing hardware, and run in your own data centers."
(Never done it myself, no idea if it's a good option)
Now compare cloud bills.
k8s is the API. Forget the implementation, it's really not that important.
Folks that get tied up in the "complexity" argument are forever missing the point.
The thing that the k8s api does is force you to do good practices, that is it.
Dear Friend,
This fascination with this new garbage-collected language from a Santa Clara vendor is perplexing. You’ve built yourself a COBOL system by another name.
/s
I love the “untested” criticism in a lot of these use-k8s screeds, and also the suggestion that they’re hanging together because of one guy. The implicit criticism is that doing your own engineering is bad, really, you should follow the crowd.
Here’s a counterpoint.
Sometimes just writing YAML is enough. Sometimes it’s not. Eg there are times when managed k8s is just not on the table, eg because of compliance or business issues. Then you’ve to think about self-managed k8s. That’s rather hard to do well. And often, you don’t need all of that complexity.
Yet — sometimes availability and accountability reasons mean that you need to have a really deep understanding of your stack.
And in those cases, having the engineering capability to orchestrate isolated workloads, move them around, resize them, monitor them, etc is imperative — and engineering capability means understanding the code, fixing bugs, improving the system. Not just writing YAML.
It’s shockingly inexpensive to get this started with a two-pizza team that understands Linux well. You do need a couple really good, experienced engineers to start this off though. Onboarding newcomers is relatively easy — there’s plenty of mid-career candidates and you’ll find talent at many LUGs.
But yes, a lot of orgs won’t want to commit to this because they don’t want that engineering capability. But a few do - and having that capability really pays off in the ownership the team can take for the platform.
For the orgs that do invest in the engineering capability, the benefit isn’t just a well-running platform, it’s having access to a team of engineers who feel they can deal with anything the business throws at them. And really, creating that high-performing trusted team is the end-goal, it really pays off for all sorts of things. Especially when you start cross-pollinating your other teams.
For what it's worth, I've worked at multiple places that ran shell scripts just fine for their deploys.
- One had only 2 services [php] and ran over 1 billion requests a day. Deploy was trivial, ssh some new files to the server and run a migration, 0 downtime.
- One was in an industry that didn't need "Webscale" (retirement accounts). Prod deploys were just docker commands run by jenkins. We ran two servers per service from the day I joined the day I left 4 years later (3x growth), and ultimately removed one service and one database during all that growth.
Another outstanding thing about both of these places was that we had all the testing environments you need, on-demand, in minutes.
The place I'm at now is trying to do kubernetes and is failing miserably (ongoing nightmare 4 months in and probably at least 8 to go, when it was allegedly supposed to only take 3 total). It has one shared test environment that it takes 3-hours to see your changes in.
I don't fault kubernetes directly, I fault the overall complexity. But at the end of the day kubernetes feels like complexity trying to abstract over complexity, and often I find that's less successful that removing complexity in the first place.
If your application doesn't need and likely won't need to scale to large clusters, or multiple clusters, then there's nothing wrong per se. with your solution. I don't think k8s is that hard but there are a lot of moving pieces and there's a bit to learn. Finding someone with experience to help you can make a ton of difference.
Questions worth asking:
- Do you need a load balancer?
- TLS certs and rotation?
- Horizontal scalability.
- HA/DR
- dev/stage/production + being able to test/stage your complete stack on demand.
- CI/CD integrations, tools like ArgoCD or Spinnaker
- Monitoring and/or alerting with Prometheus and Grafana
- Would you benefit from being able to deploy a lot of off the shelf software (lessay Elastic Search, or some random database, or a monitoring stack) via helm quickly/easily.
- "Ingress"/proxy.
- DNS integrations.
If you answer yes to many of those questions there's really no better alternative than k8s. If you're building large enough scale web applications the almost to most of these will end up being yes at some point.
Every item on that list is "boring" tech. Approximately everyone have used load balancers, test environments and monitoring since the 90s just fine. What is it that you think make Kubernetes especially suited for this compared to every other solution during the past three decades?
There are good reasons to use Kubernetes, mainly if you are using public clouds and want to avoid lock-in. I may be partial, since managing it pays my bills. But it is complex, mostly unnecessarily so, and no one should be able to say with a straight face that it achieves better uptime or requires less personnel than any alternative. That's just sales talk, and should be a big warning sign.
It's the way things work together. If you want to add a new service you just annotate that service and DNS gets updated, your ingress gets the route added, cert-manager gets you the certs from let's encrypt. You want Prometheus to monitor your pod you just add the right annotation. When your server goes down k8s will move your pod around. k8s storage will take care of having the storage follow your pod. Your entire configuration is highly available and replicated in etcd.
It's just very different than your legacy "standard" technology.
None of this is difficult to do or automate, and we've done it for years. Kubernetes simply makes it more complex by adding additional abstractions in the pursuit of pretending hardware doesn't exist.
There are, maybe, a dozen companies in the world with a large enough physical footprint where Kubernetes might make sense. Everyone else is either engaged in resume-driven development, or has gone down some profoundly wrong path with their application architecture to where it is somehow the lesser evil.
I used to feel the same way, but have come around. I think it's great for small companies for a few reasons. I can spin up effectively identical dev/ci/stg/prod clusters for a new project in an hour for a medium sized project, with CD in addition to everything GP mentioned.
I basically don't have to think about ops anymore until something exotic comes up, it's nice. I agree that it feels clunky, and it was annoying to learn, but once you have something working it's a huge time saver. The ability to scale without drastically changing the system is a bonus.
> I can spin up effectively identical dev/ci/stg/prod clusters for a new project in an hour for a medium sized project, with CD in addition to everything GP mentioned.
I can do the same thing with `make local` invoking a few bash commands. If the complexity increases beyond that, a mistake has been made.
You could say the same thing about Ansible or Vagrant or Nomad or Salt or anything else.
I can say with complete confidence however, that if you are running Kubernetes and not thinking about ops, you are simply not operating it yourself. You are paying someone else to think about it for you. Which is fine, but says nothing about the technology.
> Every item on that list is "boring" tech. Approximately everyone have used load balancers, test environments and monitoring since the 90s just fine. What is it that you think make Kubernetes especially suited for this compared to every other solution during the past three decades?
You could make the same argument against using cloud at all, or against using CI. The point of Kubernetes isn't to make those things possible, it's to make them easy and consistent.
Kubernetes is boring tech as well.
And the advantage of it is one way to manage resources, scaling, logging, observability, hardware etc.
All of which is stored in Git and so audited, reviewed, versioned, tested etc in exactly the same way.
Kubernetes is great example of the "second-system effect".
Kubernetes only works if you have a webapp written in a slow interpreted language. For anything else it is a huge impedance mismatch with what you're actually trying to do.
P.S. In the real world, Kubernetes isn't used to solve technical problems. It's used as a buffer between the dev team and the ops team, who usually have different schedules/budgets, and might even be different corporate entities. I'm sure there might be an easier way to solve that problem without dragging in Google's ridiculous and broken tech stack.
Contrary to popular belief, k8s is not Google's tech stack.
My understanding is that it was initially sold as Google's tech to benefit from Google's tech reputation (exploiting the confusion caused by the fact that some of the original k8s devs where ex-googlers), and today it's also Google trying to pose as k8s inventor, to benefit from its popularity. Interesting case of host/parasite symbiosis, it seams.
Just my impression though, I can be wrong, please comment if you know more about the history of k8s.
Is there anyone that works at Google that can confirm this?
What's left of Borg at Google? Did the company switch to the open source Kubernetes distribution at any point? I'd love to know more about this as well.
> exploiting the confusion caused by the fact that some of the original k8s devs where ex-googlers
What about the fact that many active Kubernetes developers, are also active Googlers?
Borg isn't going anywhere, Kubernetes isn't Google-scale
> It's used as a buffer between the dev team and the ops team, who usually have different schedules/budgets
That depends on your definition. If the ops team is solely responsibly for running the Kubernetes cluster, then yes. In reality that's rarely how things turns out. Developers want Kubernetes, because.... I don't know. Ops doesn't even want Kubernetes in many cases. Kubernetes is amazing, for those few organisations that really need it.
My rule of thumb is: If your worker nodes aren't entire physical hosts, then you might not need Kubernetes. I've seen some absolutely crazy setups where developers had designed this entire solution around Kubernetes, only to run one or two containers. The reasoning is pretty much always the same, they know absolutely nothing about operations, and fail to understand that load balancers exists outside of Kubernetes, or that their solution could be an nginx configuration, 100 lines of Python and some systemd configuration.
I accept that I lost the fight that Kubernetes is overly complex and a nightmare to debug. In my current position I can even see some advantages to Kubernetes, so I was at least a little of in my criticism. Still I don't think Kubernetes should be your default deployment platform, unless you have very specific needs.
kubernetes is an API for your cluster, that is portable between providers, more or less. there are other abstractions, but they are not portable, e.g. fly.io, DO etc. so unless you want a vendor lock-in, you need it. for one of my products, I had to migrate due to business reasons 4 times into different kube flavors, from self-manged ( 2 times ) to GKE and EKS.
> there are other abstractions, but they are not portable
Not true. Unix itself is an API for your cluster too, like the original post implies.
Personally, as a "tech lead" I use NixOS. (Yes, I am that guy.)
The point is, k8s is a shitty API because it's built only for Google's "run a huge webapp built on shitty Python scripts" use case.
Most people don't need this, what they actually want is some way for dev to pass the buck to ops in some way that PM's can track on a Gantt chart.
> If you answer yes to many of those questions there's really no better alternative than k8s.
This is not even close to true with even a small number of resources. The notion that k8s somehow is the only choice is right along the lines of “Java Enterprise Edition is the only choice” — ie a real failure of the imagination.
For startups and teams with limited resources, DO, fly.io and render are doing lots of interesting work. But what if you can’t use them? Is k8s your only choice?
Let’s say you’re a large orgs with good engineering leadership, and you have high-revenue systems where downtime isn’t okay. Also for compliance reasons public cloud isn’t okay.
DNS in a tightly controlled large enterprise internal network can be handled with relatively simple microservices. Your org will likely have something already though.
Dev/Stage/Production: if you can spin up instances on demand this is trivial. Also financial services and other regulated biz have been doing this for eons before k8s.
Load Balancers: lots of non-k8s options exist (software and hardware appliances).
Prometheus / Grafana (and things like Netdata) work very well even without k8s.
Load Balancing and Ingress is definitely the most interesting piece of the puzzle. Some choose nginx or Envoy, but there’s also teams that use their own ingress solution (sometimes open-sourced!)
But why would a team do this? Or more appropriately, why would their management spend on this? Answer: many don’t! But for those that do — the driver is usually cost*, availability and accountability, along with engineering capability as a secondary driver.
(*cost because it’s easy to set up a mixed ability team with experienced, mid-career and new engineers for this. You don’t need a team full of kernel hackers.)
It costs less than you think, it creates real accountability throughout the stack and most importantly you’ve now got a team of engineers who can rise to any reasonable challenge, and who can be cross pollinated throughout the org. In brief the goal is to have engineers not “k8s implementers” or “OpenShift implementers” or “Cloud Foundry implementers”.
> DNS in a tightly controlled large enterprise internal network can be handled with relatively simple microservices. Your org will likely have something already though.
And it will likely be buggy with all sorts of edge cases.
> Dev/Stage/Production: if you can spin up instances on demand this is trivial. Also financial services and other regulated biz have been doing this for eons before k8s.
In my experience financial services have been notably not doing it.
> Load Balancers: lots of non-k8s options exist (software and hardware appliances).
The problem isn't running a load balancer with a given configuration at a given point in time. It's how you manage the required changes to load balancers and configuration as time goes on. It's very common for that to be a pile of perl scripts that add up to an ad-hoc informally specified bug-ridden implementation of half of kubernetes.
> And it will likely be buggy with all sorts of edge cases.
I have seen this view in corporate IT teams who’re happy to be “implementers” rather than engineers.
In real life, many orgs will in fact have third party vendor products for internal DNS and cert authorities. Writing bridge APIs to these isn’t difficult and it keeps the IT guys happy.
A relatively few orgs have written their own APIs, typically to manage a delegated zone. Again, you can say these must be buggy, but here’s the thing — everything’s buggy. Including k8s. As long as bugs are understood and fixed, no one cares. The proof of the pudding is how well it works.
Internal DNS in particular is easy enough to control and test if you have engineers (vs implementers) in your team.
> manage changes to load balancers … perl
That’s a very black and white view, that teams are either on k8s (which to you is the bees knees) or a pile of Perl (presumably unmaintainable). Speaks to interesting unconscious bias.
Perhaps it comes from personal experience, in which case I’m sorry you had to be part of such a team. But it’s not particularly difficult to follow modern best practices and operate your own stack.
But if your starter stance is that “k8s is the only way”, no one can talk you out of your own mental hard lines.
> Again, you can say these must be buggy, but here’s the thing — everything’s buggy. Including k8s. As long as bugs are understood and fixed, no one cares.
Agreed, but internal products are generally buggier, because an internal product is in a kind of monopoly position. You generally want to be using a product that is subject to competition, that is a profit center rather than a cost center for the people who are making it.
> Internal DNS in particular is easy enough to control and test if you have engineers (vs implementers) in your team.
Your team probably aren't DNS experts, and why should they be? You're not a DNS company. If you could make a better DNS - or a better DNS-deployment integration - than the pros, you'd be selling it. (The exception is if you really are a DNS company, either because you actually do sell it, or because you have some deep integration with DNS that enables your competitive advantage)
> Perhaps it comes from personal experience, in which case I’m sorry you had to be part of such a team. But it’s not particularly difficult to follow modern best practices and operate your own stack.
I'd say that's a contradiction in terms, because modern best practice is to not run your own stack.
I don't particularly like kubernetes qua kubernetes (indeed I'd generally pick nomad instead). But I absolutely do think you need a declarative, single-source-of-truth way of managing your full deployment, end-to-end. And if your deployment is made up of a standard load balancer (or an equivalent of one), a standard DNS, and prometheus or grafana, then you've either got one of these products or you've got an internal product that does the same thing, which is something I'm extremely skeptical of for the same reason as above - if your company was capable of creating a better solution to this standard problem, why wouldn't you be selling it? (And if an engineer was capable of creating a better solution to this standard problem, why would they work for you rather than one of the big cloud corps?)
In the same way I'm very skeptical of any company with an "internal cloud" - in my experience such a thing is usually a significantly worse implementation of AWS, and, yes, is usually held together with some flaky Perl scripts. Or an internal load balancer. It's generally NIH, or at best a cost-cutting exercise which tends to show; a company might have an internal cloud that's cheaper than AWS (I've worked for one), but you'll notice the cheapness.
Now again, if you really are gaining a competitive advantage from your things then it may make sense to not use a standard solution. But in that case you'll have something deeply integrated, i.e. monolithic, and that's precisely the case where you're not deploying separate standard DNS, separate standard load balancers, separate standard monitoring etc.. And in that case, as grandparent said, not using k8s makes total sense.
But if you're just deploying a standard Rails (or what have you) app with a standard database, load balancer, DNS, monitoring setup? Then 95% of the time your company can't solve that problem better than the companies that are dedicated to solving that problem. Either you don't have a solution at all (beyond doing it manually), you use k8s or similar, or you NIH it. Writing custom code to solve custom problems can be smart, but writing custom code to solve standard problems usually isn't.
> if your company was capable of creating a better solution to this standard problem, why wouldn't you be selling it?
Let's pretend I'm the greatest DevOps software developer engineer ever, and I write a Kubernetes replacement that's 100x better. Since it's 100x better, I simply charge 100x as much as it costs per CPU/RAM for a Kubernetes license to a 1,000 customers, and take all of that money to the bank and I deposit my check for $0.
I don't disagree with the rest of the comment, but the market for the software to host a web app is a weird market.
> If you answer yes to many of those questions there's really no better alternative than k8s.
Nah, most of that list is basically free for any company that uses an amazon loadbalancer and an autoscale group. In terms of likeliness of incidents, time, and cost, those will each be an order of magnitude higher with a team of kubernetes engineers than less complex setup.
Oz Nova nailed it nicely in "You Are Not Google"
https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
People really underestimate the power of a shell scripts and ssh and trusted developers.
Besides the fact that shell scripts aren't scalable (in terms of horizontal scalability like actor model), I would also like to point out that shell scripts should be simple, but if you want to handle something that big, you essentially and definitely is using it as a programming language in disguise -- not ideal and I would like to go Go or Rust instead.
We don't live in 1999 any more. A big machine with a database can serve ervyone in the US and I can fit it in my closet.
It's like people are stuck in the early 2000s when they start thinking about computer capabilities. Today I have more flops in a single GPU under my desk than did the worlds largest super computer in 2004.
> It's like people are stuck in the early 2000s when they start thinking about computer capabilities.
This makes sense, because the code people write makes machines feel like they're from the early 2000's.
This is partially a joke, of course, but I think there is a massive chasm between the people who think you immediately need several computers to do things for anything other than redundancy, and the people who see how ridiculously much you can do with one.
I added performance testing to all our endpoints from the start, so that people don’t start to normalize those 10s response times that our last system had (cry)
> Besides the fact that shell scripts aren't scalable…
What are you trying to say there? My understanding is that, way under the hood, a set of shell scripts is in fact enabling the scalable nature of… the internet.
> My understanding is that, way under the hood, a set of shell scripts is in fact enabling the scalable nature of… the internet.
I sure hope not. The state of error handling in shell scripts alone is enough to disqualify them for serious production systems.
If you're extremely smart and disciplined it's theoretically possible to write a shell script that handles error states correctly. But there are better things to spend your discipline budget on.
[dead]
...that's only for early internet, and the early internet is effing broken at best
On the other hand, my team slapped 3 servers down in a datacenter, had each of them configured in a Proxmox cluster within a few hours. Some 8-10 hours later we had a fully configured kubernetes cluster running within Proxmox VMs, where the VMs and k8s cluster are created and configured using an automation workflow that we have running in GitHub Actions. An hour or two worth of work later we had several deployments running on it and serving requests.
Kubernetes is not simple. In fact it's even more complex than just running an executable with your linux distro's init system. The difference in my mind is that it's more complex for the system maintainer, but less complex for the person deploying workloads to it.
And that's before exploring all the benefits of kubernetes-ecosystem tooling like the Prometheus operator for k8s, or the horizontally scalable Loki deployments, for centrally collecting infrastructure and application metrics, and logs. In my mind, making the most of these kinds of tools, things start to look a bit easier even for the systems maintainers.
Not trying to discount your workplace too much. But I'd wager there's a few people that are maybe not owning up to the fact that it's their first time messing around with kubernetes.
As long as your organisation can cleanly either a) split the responsibility for the platform from the responsibility for the apps that run on it, and fund it properly, or b) do the exact opposite and accommodate all the responsibility for the platform into the app team, I can see it working.
The problems start when you're somewhere between those two points. If you've got a "throw it over the wall to ops" type organisation, it's going to go bad. If you've got an underfunded platform team so the app team has to pick up some of the slack, it's going to go bad. If the app team have to ask permission from the platform team before doing anything interesting, it's going to go bad.
The problem is that a lot of organisations will look at k8s and think it means something it doesn't. If you weren't willing to fund a platform team before k8s, I'd be sceptical that moving to it is going to end well.
Are you self hosting kubernetes or running it managed?
I've only used it managed. There is a bit of a learning curve but it's not so bad. I can't see how it can take 4 months to figure it out.
We are using EKS
> I can't see how it can take 4 months to figure it out.
Well have you ever tried moving a company with a dozen services onto kubernetes piece-by-piece, with zero downtime? How long would it take you to correctly move and test every permission, environment variable, and issue you run into?
Then if you get a single setting wrong (e.g. memory size) and don't load-test with realistic traffic, you bring down production, potentially lose customers, and have to do a public post-mortem about your mistakes? [true story for current employer]
I don't see how anybody says they'd move a large company to kubernetes in such an environment in a few months with no screwups and solid testing.
It sounds like it's not easy to figure out the permissions, envvars, memory size, etc. of your existing system, and that's why the migration is so difficult? That's not really one of Kubernetes' (many) failings.
Yes, and now we are back at the ancestor comment’s original point: “at the end of the day kubernetes feels like complexity trying to abstract over complexity, and often I find that's less successful that removing complexity in the first place”
Which I understand to mean “some people think using Kubernetes will make managing a system easier, but it often will not do that”
Can you elaborate on other things you think Kubernetes gets wrong? Asking out of curiosity because I haven't delved deep into it.
Took us three-four years to go from self hosted multi-dc to getting the main product almost fully in k8s (some parts didn't make sense in k8s and was pushed to our geo-distributed edge nodes). Dozens of services and teams and keeping the old stuff working while changing the tire on the car while driving. All while the company continues to grow and scale doubles every year or so. It takes maturity in testing and monitoring and it takes longer that everyone estimates
It largely depends how customized each microservice is, and how many people are working on this project.
I've seen migrations of thousands of microservices happening with the span of two years. Longer timeline, yes, but the number of microservices is orders of magnitude larger.
Though I suppose the organization works differently at this level. The Kubernetes team build a tool to migrate the microservices, and each owner was asked to perform the migration themselves. Small microservices could be migrated in less than three days, while the large and risk-critical ones took a couple weeks. This all happened in less than two years, but it took more than that in terms of engineer/weeks.
The project was very successful though. The company spends way less money now because of the autoscaling features, and the ability to run multiple microservices in the same node.
Regardless, if the company is running 12 microservices and this number is expected to grow, this is probably a good time to migrate. How did they account for the different shape of services (stateful, stateless, leader elected, cron, etc), networking settings, styles of deployment (blue-green, rolling updates, etc), secret management, load testing, bug bashing, gradual rollouts, dockerizing the containers, etc? If it's taking 4x longer than originally anticipated, it seems like there was a massive failure in project design.
2000 products sounds like you made 2000 engineers learn kubernetes (a week, optimistically, 2000/52 = 38 engineer years, or roughly one wasted career).
Similarly, the actual migration times you estimate add up to decades of engineer time.
It’s possible kubernetes saves more time than using the alternative costs, but that definitely wasn’t the case at my previous two jobs. The jury is out at the current job.
I see the opportunity cost of this stuff every day at work, and am patiently waiting for a replacement.
> 2000 products sounds like you made 2000 engineers learn kubernetes (a week, optimistically, 2000/52 = 38 engineer years, or roughly one wasted career).
Not really, they only had to use the tool to run the migration and then validate that it worked properly. As the other commenter said, a very basic setup for kubernetes is not that hard; the difficult set up is left to the devops team, while the service owners just need to see the basics.
But sure, we can estimate it at 38 engineering years. That's still 38 years for 2,000 microservices; it's way better than 1 year for 12 microservices like in OP's case. Savings that we got was enough to offset these 38 years of work, so this project is now paying dividends.
> 2000 products sounds like you made 2000 engineers learn kubernetes (a week, optimistically, 2000/52 = 38 engineer years, or roughly one wasted career).
Learning k8s enough to be able to work with it isn't that hard. Have a centralized team write up a decent template for a CI/CD pipeline, Dockerfile for the most common stacks you use and a Helm chart with an example for a Deployment, PersistentVolumeClaim, Service and Ingress, distribute that, and be available for support should the need for Kubernetes be beyond "we need 1-N pods for this service, they got some environment variables from which they are configured, and maybe a Secret/ConfigMap if the application rather wants configuration to be done in files" is enough in my experience.
> Learning k8s enough to be able to work with it isn't that hard.
I’ve seen a lot of people learn enough k8s to be dangerous.
Learning it well enough to not get wrapped around the axle with some networking or storage details is quite a bit harder.
For sure but that's the job of a good ops department - where I work at for example, every project's CI/CD pipeline has its own IAM user mapping to a Kubernetes role that only has explicitly defined capabilities: create, modify and delete just the utter basics. Even if they'd commit something into the Helm chart that could cause an annoyance, the service account wouldn't be able to call the required APIs. And the templates themselves come with security built-in - privileges are all explicitly dropped, pod UIDs/GIDs hardcoded to non-root, and we're deploying Network Policies at least for ingress as well now. Only egress network policies aren't available, we haven't been able to make these work with services.
Anyone wishing to do stuff like use the RDS database provisioner gets an introduction from us on how to use it and what the pitfalls are, and regular reviews of their code. They're flexible but we keep tabs on what they're doing, and when they have done something useful we aren't shy from integrating whatever they have done to our shared template repository.
Comparing the simplicity of two PHP servers against a setup with a dozen services is always going to be one sided. The difference in complexity alone is massive, regardless of whether you use k8s or not.
My current employer did something similar, but with fewer services. The upshot is that with terraform and helm and all the other yaml files defining our cluster, we have test environments on demand, and our uptime is 100x better.
Fair enough that sounds hard.
Memory size is an interesting example. A typical Kubernetes deployment has much more control over this than a typical non-container setup. It is costing you to figure out the right setting but in the long term you are rewarded with a more robust and more re-deployable application.
> has much more control over this than a typical non-container setup
Actually not true, k8s uses the exact same cgroups API for this under the hood that systemd does.
> I don't see how anybody says they'd move a large company to kubernetes in such an environment in a few months with no screwups and solid testing.
Unfortunately, I do. Somebody says that when the culture of the organization expects to be told and hear what they want to hear rather than the cold hard truth. And likely the person saying that says it from a perch up high and not responsible for the day to day work of actually implementing the change. I see this happen when the person, management/leadership, lacks the skills and knowledge to perform the work themselves. They've never been in the trenches and had to actually deal face to face with the devil in the details.
Canary deploy dude (or dude-ette), route 0.001% of service traffic and then slowly move it over. Then set error budgets. Then a bad service wont "bring down production".
Thats how we did it at Google (I was part of the core team responsible for ad serving infra - billions of ads to billions of users a day)
Using microk8s or k3s on one node works fine. As the author of "one big server," I am now working on an application that needs some GPUs and needs to be able to deploy on customer hardware, so k8s is natural. Our own hosted product runs on 2 servers, but it's ~10 containers (including databases, etc).
Yup, I like this approach a lot. With cloud providers considering VMs durable these days (they get new hardware for your VM if the hardware it's on dies, without dropping any TCP connections), I think a 1 node approach is enough for small things. You can get like 192 vCPUs per node. This is enough for a lot of small companies.
I occasionally try non-k8s approaches to see what I'm missing. I have a small ARM machine that runs Home Assistant and some other stuff. My first instinct was to run k8s (probably kind honestly), but didn't really want to write a bunch of manifests and let myself scope creep to running ArgoCD. I decided on `podman generate systemd` instead (with nightly re-pulls of the "latest" tag; I live and die by the bleeding edge). This was OK, until I added zwavejs, and now the versions sometimes get out of sync, which I notice by a certain light switch not working anymore. What I should have done instead was have some sort of git repository where I have the versions of these two things, and to update them atomically both at the exact same time. Oh wow, I really did need ArgoCD and Kubernetes ;)
I get by with podman by angrily ssh-ing in in my winter jacket when I'm trying to leave my house but can't turn the lights off. Maybe this can be blamed on auto-updates, but frankly anything exposed to a network that is out of date is also a risk, so, I don't think you can ever really win.
I think porting to k8s can succeed or fail, like any other project. I switched an app that I alone worked on, from Elastic Beanstalk (with Bash), to Kubernetes (with Babashka/Clojure). It didn't seem bad. I think k8s is basically a well-designed solution. I think of it as a declarative language which is sent to interpreters in k8s's control plane.
Obviously, some parts of took a while to figure out. For example, I needed to figure out an AWS security group problem with Ingress objects, that I recall wasn't well-documented. So I think parts of that declarative language can suck, if the declarative parts aren't well factored-out from the imperative parts. Or if the log messages don't help you diagnose errors, or if there isn't some kind of (dynamic?) linter that helps you notice problems quickly
In your team's case, more information seems needed to help us evaluate the problems. Why was it easier before to make testing environments, and harder now?
Yea but that doesn't sound shiny on your resume.
I never did choose any single thing in my job, just because of how it could look in my resume.
After +20 years of Linux sysadmin/devops, and because a spinal disc herniation last year, now I'm looking for a job.
99% of job offers, will ask for EKS/Kubernetes now.
It's like the VMware of the years 200[1-9], or like the "Cloud" of the years 201[1-9].
I've always specialized in physical datacenters and servers, being it on-premises, colocation, embedded, etc... so I'm out of the market now, at least in Spain (which always goes like 8 years behind the market).
You can try to avoid it, and it's nice when you save thousands of operational/performance/security/etc issues and dollars to your company across the years, and you look like a guru that goes ahead of industry issue to your boss eyes, but, it will make finding a job... 99% harder.
It doesn't matter if you demonstrate the highest level on Linux, scripting, ansible, networking, security, hardware, performance tuning, high availability, all kind of balancers, switching, routing, firewalls, encryption, backups, monitoring, log management, compliance, architecture, isolation, budget management, team management, provider/customer management, debugging, automation, programming full stack, and a long etc. If you say "I never worked with Kubernetes, but I learn fast", with your best sincerity at the interview, then you're automatically out of the process. No matter if you're talking with human resources, a helper of the CTO, or the CTO. You're out.
Depends on what kind of company you want to join. Some value simplicity and efficiency more.
So, my current experience somewhere most old apps are very old school:
- most server software is waaaaaaay out of date so getting a dev / test env is a little harder (like last problem we got was the HAproxy version does not do ECDA keys for ssl certs, which is the default with certbot) - yeah pushing to prod is "easy": FTP directly. But now which version of which files are really in prod? No idea. Yeah when I say old school it's old school before things like Jenkins. - need something done around the servers? That's the OPS team job. Team which also has too much different work to do so now you'll have to wait a week or two for this simple "add an upload file" endpoint to this old API because you need somewhere to put those files.
Now we've started setting up some on-prem k8s nodes for the new developments. Not because we need crazy scaling but so the dev team can do most OPS they need. It takes time to have everything setup but once it started chugging along it felt good to be able to just declare whatever we need and get it. You still need to get the devs to learn k8s which is not fun but that's the life of a dev: learning new things every day.
Also k8s does not do data. You want a database or anything managing files: you want to do most of the job outside k8s.
Kubernetes is so easy that you only need two or three dedicated full-time employees to keep the mountains of YAML from collapsing in on themselves before cutting costs and outsourcing your cluster management to someone else.
Sure, it can be easy, just pick one of the many cloud providers that fix all the complicated parts for you. Though, when you do that, expect to pay extra for the privilege, and maybe take a look at the much easier proprietary alternatives. In theory the entire thing is portable enough that you can just switch hosting providers, in practice you're never going to be able to do that without seriously rewriting part of your stack anyway.
The worst part is that the mountains of YAML were never supposed to be written by humans anyway, they're readable configuration your tooling is supposed to generate for you. You still need your bash scripts and your complicated deployment strategies, but rather than using them directly you're supposed to compile them into YAML first.
Kubernetes is nice and all but it's not worth the effort for the vast majority of websites and services. WordPress works just fine without automatic replication and end-to-end microservice TLS encryption.
I went down the Kubernetes path. The product I picked 4 years ago is no longer maintained :(
The biggest breaking change to docker compose since it was introduced was that the docker-compose command stopped working and I had to switch to «docker compose» with a space. Had I stuck with docker and docker-compose I could have trivially kept everything up to date and running smoothly.
I ran small bootstrapped startup , I used GKE. Everything was templated.
each app has it's own template e.g. nodejs-worker, and you don't change the template unless you really needed.
i spent ~2% of my manger+eng leader+hiring manger+ god knows what else people do at startup on managing 100+ microservices because they were templates.
That works great until you want to change something low-level and have to apply it to all those templates.
This is so unnuanced that it reads like rationalization to me. People seem to get stuck on mantras that simple things are inherently fragile which isn't really true, or at least not particularly more fragile than navigating a jungle of yaml files and k8s cottage industry products that link together in arcane ways and tend to be very hard to debug, or just to understand all the moving parts involved in the flow of a request and thus what can go wrong. I get the feeling that they mostly just don't like that it doesn't have professional aesthetics.
This reminds me of the famous Taco Bell Programming post [1]. Simple can surprisingly often be good enough.
[1] http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...
> People seem to get stuck on mantras that simple things are inherently fragile which isn't really true...
Ofc it isn't true.
Kubernetes was designed at Google at a time when Google was already a behemoth. 99.99% of all startups and SMEs out there shall never ever have the same scaling issues and automation needs that Google has.
Now that said... When you begin running VMs and containers, even only a very few of them, you immediately run into issues and then you begin to think: "Kubernetes is the solution". And it is. But it is also, in many cases, a solution to a problem you created. Still... the justification for creating that problem, if you're not Google scale, are highly disputable.
And, deep down, there's another very fundamental issue IMO: many of those "let's have only one process in one container" solutions actually mean "we're totally unable to write portable software working on several configs, so let's start with a machine with zero libs and dependencies and install exactly the minimum deps needed to make our ultra-fragile piece of shit of a software kinda work. And because it's still going to be a brittle piece of shit, let's make sure we use heartbeats and try to shut it down and back up again once it'll invariably have memory leaked and/or whatnots".
Then you also gained the right to be sloppy in the software you write: not respecting it. Treating it as cattle to be slaughtered, so it can be shitty. But you've now added an insane layer of complexity.
How do you like your uninitialized var when a container launchs but then silently doesn't work as expected? How do you like them logs in that case? Someone here as described the lack of instant failure on any uninitialized var as the "billion dollar mistake of the devops world".
Meanwhile look at some proper software like, say, the Linux kernel or a distro like Debian. Or compile Emacs or a browser from source and marvel at what's happening. Sure, there may be hickups but it works. On many configs. On many different hardware. On many different architectures. These are robust software that don't need to be "pid 1 on a pristine filesystem" to work properly.
In a way this whole "let's have all our software each as pid 1 each on a pristine OS and filesystem" is an admission of a very deep and profound failure of our entire field.
I don't think it's something to be celebrated.
And don't get me started on security: you know have ultra complicated LANs and VLANs, with a near impossible to monitor traffic, with shitloads of ports open everywhere, the most gigantic attack surface of them all and heartbeats and whatsnots constantly polluting the network, where nobody doesn't even know anymore what's going on. Where the only actual security seems to rely on the firewall being up and correctly configured, which is incredibly complicated to do seen the insane network complexity you added to your stack. "Oh wait, I have an idea, let's make configuring the firewall a service!" (and make sure to not forget to initialize one of the countless var or it'll all silently break and just be not be configuring firewalling for anything).
Now though love is true love: even at home I'm running an hypervisor with VMs and OCI containers ; )
> Meanwhile look at some proper software like, say, the Linux kernel or a distro like Debian. Or compile Emacs or a browser from source and marvel at what's happening. Sure, there may be hickups but it works. On many configs. On many different hardware. On many different architectures. These are robust software
Lol no. The build systems flake out if you look at them funny. The build requirements are whatever Joe in Nebraska happened to have installed on his machine that day (I mean sure there's a text file supposedly listing them, but it hasn't been accurate for 6 years). They list systems that they haven't actually supported for years, because no-one's actually testing them.
I hate containers as much as anyone, but the state of "native" unix software is even worse.
+1 for talking about attack surface. Every service is a potential gateway for bad people. Locking them all down is incredibly difficult to get right.
99.99% of startups and SMEs should not be writing microservices.
But "I wrote a commercial system that served thousands of users, it ran on a single process on a spare box out the back" doesn't look good on resumes.
I sense a lot of painful insights written in blood here.
I love that the only alternative is a "pile of shell scripts". Nobody has posted a legitimate alternative to the complexity of K8S or the simplicity of doctor compose. Certainly feels like there's a gap in the market for an opinionated deployment solution that works locally and on the cloud, with less functionality than K8S and a bit more complexity than docker compose.
K8s just drowns out all other options. Hashicorp Nomad is great, https://www.nomadproject.io/
I am puzzled by the fact that no successful forks of Nomad and Consul have emerged since the licence change and acquisition of Hashicorp.
If you need a quick scheduler, orchestrator and services control pane without fully embracing containers you might soon be out of luck.
Nomad was amazing at every step of my experiments on it, except one. Simply including a file from the Nomad control to the Nomad host is... impossible? I saw indications of how to tell the host to get it from a file host, and I saw people complaining that they had to do it through the file host, with the response being security (I have thoughts about this and so did the complainants).
I was rather baffled to an extent. I was just trying to push a configuration file that would be the primary difference between a couple otherwise samey apps.
Thumbs up for Nomad. We've been running it for about 3 years in prod now and it hasn't failed us a single time.
Docker Swarm is exactly what tried to fill that niche. It's basically an extension to Docker Compose that adds clustering support and overlay networks.
Kamal was also built with that purpose in mind.
https://kamal-deploy.org/
This looks cool and +1 for the 37Signals and Basecamp folks. I need to verify that I'll be able to spin up GPU enabled containers, but I can't imagine why that wouldn't work...
I coined a term for this because I see it so often.
“People will always defend complexity, stating that the only alternative is shell scripts”.
I saw people defending docker this way, ansible this way and most recently systemd this way.
Now we’re on to kubernetes.
>and most recently systemd this way.
To be fair, most people attacking systemd say they want to return to shell scripts.
No, there are alternatives like runit and SMF that do not use shell scripts.
Its conveniently ignored by systemd-supporters and the conversation always revolves around the fact that we used to use shell scripts. Despite the fact that there are sensible inits that predate systemd that did not use shell languages.
At least I never saw anyone arguing that the only alternative to git was shell scripts.
Wait. Wouldn't that be a good idea?
While not opinionated but you can go with cloud specific tools (e.g. ECS in AWS).
Sure, but those don't support local deployment, at least not in any sort of easy way.
This is basically exactly what we needed at the start up I worked at, with the added need of being able to host open source projects (airbyte, metabase) with a reasonable level of confidence.
We ended up migrating from Heroku to Kubernetes. I tried to take some of the learnings to build https://github.com/czhu12/canine
It basically wraps Kubernetes and tries to hide as much complexity from Kubernetes as possible, and only expose the good parts that will be enough for 95% of web application work loads.
Docker Swarm mode? I know it’s not as well maintained, but I think it’s exactly what you talk about here (forget K3s, etc). I believe smaller companies run it still and it’s perfect for personal projects. I myself run mostly docker compose + shell scripts though because I don’t really need zero-downtime deployments or redundancy/fault tolerance.
Capistrano, Ansible et al. have existed this whole time if you want to do that.
The real difference in approaches is between short lived environments that you redeploy from scratch all the time and long lived environments we nurse back to health with runbooks.
You can use lambda, kube, etc. or chef, puppet etc. but you end up at this same crossroad.
Just starting a process and keeping it alive for a long time is easy to get started with but eventually you have to pay the runbook tax. Instead you could pay the kubernetes tax or the nomad tax at the start instead of the 12am ansible tax later.
I hate to shill my own company, but I took the job because I believe in it.
You should check out DBOS and see if it meets your middle ground requirements.
Works locally and in the cloud, has all the things you’d need to build a reliable and stateful application.
[0] https://dbos.dev
Looking at your page, it looks like Lambdas/Functions but on your system, not Amazon/Microsoft/Google.
Every company I've ever had try to do this has ended in crying after some part of the system doesn't fit neat into Serverless box and it becomes painful to extract from your system into "Run FastAPI in containers."
We run on bare metal in AWS, so you get access to all your other AWS services. We can also run on bare metal in whatever cloud you want.
Sure but I'm still wrapped around your library no? So if your "Process Kafka events" decorator in Python doesn't quite do what I need to, I'm forced to grab the Kafka library, write my code and then learn to build my own container since I assume you were handling the build part. Finally, figure out which 17 ways to run containers on AWS (https://www.lastweekinaws.com/blog/the-17-ways-to-run-contai...) is proper for me and away I go?
That's my SRE recommendation of "These serverless are a trap, it's quick to get going but you can quickly get locked into a bad place."
No, not at all. We run standard python, so we can build with any kafka library. Our decorator is just a subclass of the default decorator to add some kafka stuff, but you can use the generic decorator around whatever kafka library you want. We can build and run any arbitrary Python.
But yes, if you find there is something you can't do, you would have to build a container for it or deploy it to an instance of however you want. Although I'd say that mostly likely we'd work with you to make whatever it is you want to do possible.
I'd also consider that an advantage. You aren't locked into the platform, you can expand it to do whatever you want. The whole point of serverless is to make most things easy, not all things. If you can get your POC working without doing anything, isn't that a great advantage to your business?
Let's be real, if you start with containers, it will be a lot harder to get started and then still hard to add whatever functionality you want. Containers doesn't really make anything easier, it just makes things more consistent.
Nice, but I like my servers and find serverless difficult to debug.
That's the beauty of this system. You build it all locally, test it locally, debug it locally. Only then do you deploy to the cloud. And since you can build the whole thing with one file, it's really easy to reason about.
And if somehow you get a bug in production, you have the time travel debugger to replay exactly what the state of the cloud was at the time.
Great to hear you've improved serverless debugging. What if my endpoint wants to run ffmpeg and extract frames from video. How does that work on serverless?
That particular use case requires some pretty heavy binaries and isn't really suited to serverless. However, you could still use DBOS to manage chunking the work and managing to workflows to make sure every frame is only processed once. Then you could call out to some of the existing serverless offerings that do exactly what you suggest (extract frames from video).
Or you could launch an EC2 instance that is running ffmpeg and takes in videos and spits out frames, and then use DBOS to manage launching and closing down those instances as well as the workflows of getting the work done.
Looks interesting, but this is a bit worrying:
It's pretty easy to see how that could go badly wrong. ;)(and yeah, obviously "don't deploy that stuff" is the solution)
---
That being said, is it all OSS? I can see some stuff here that seems to be, but it mostly seems to be the client side stuff?
https://github.com/dbos-inc
Maybe that is worded poorly. :). It's supposed to mean there are no timeouts -- you can wait as long as you want between retries.
> That being said, is it all OSS?
The Transact library is open source and always will be. That is what you gets you the durability, statefulness, some observability, and local testing.
We also offer a hosted cloud product that adds in the reliability, scalability, more observability, and a time travel debugger.
Agreed, something simpler than Nomad as well hopefully.
Ansible and the podman Ansible modules
I'm giggling at the idea you'd need Kubernetes for a mere two servers. We don't run any application with less than two instances for redundancy.
We've just never seen the need for Kubernetes. We're not against it as much as the need to replace our working setup just never arrived. We run EC2 instances with a setup shell script under 50loc. We autoscale up to 40-50 web servers at peak load of a little over 100k concurrent users.
Different strokes for different folks but moreso if it ain't broke, don't fix it
> The inscrutable iptables rules?
You mean the list of calls right there in the shell script?
> Who will know about those undocumented sysctl edits you made on the VM?
You mean those calls to `sysctl` conveniently right there in the shell script?
> your app needs to programmatically spawn other containers
Or you could run a job queue and push tasks to it (gaining all the usual benefits of observability, concurrency limits, etc), instead of spawning ad-hoc containers and hoping for the best.
"We don't know how to learn/read code we are unfamiliar with... Nor do we know how to grok and learn things quickly. Heck, we don't know what grok means "
Who do you quote?
This quote mostly applies to people who don't want to spend the time learning existing tooling, making improvements and instead create a slightly different wheel but with different problems. It also applies to people trying to apply "google" solutions to a non-google company.
Kubernetes and all tooling in the cloud native computing foundation(CNCF) were created to have people adopt the cloud and build communities that then created jobs roles that facilitated hiring people to maintain cloud presences that then fund cloud providers.
This is the same playbook that Microsoft did at Universities. They would give the entire suite of tools in the MSDN library away then then in roughly (4) years collect when another seat needs to be purchased for a new hire that has only used Microsoft tools for the last (4) years.
[dead]
> You mean the list of calls right there in the shell script?
This is about the worst encoding for network rules I can think of.
Worse than yaml generated by string interpolation?
You'd have to give me an example. YAML is certainly better at representing tables of data than a shell script is.
Not entirely a fair comparison, but here. Can you honestly tell me you'd take the yaml over the shell script?
(If you've never had to use Helm, I envy you. And if you have, I genuinely look forward to you showing me an easier way to do this, since it would make my life easier.)
-------------------------------------
Shell script:
Multiple ports: Easy and concise.-------------------------------------
Kubernetes (disclaimer: untested, obviously)
Multiple ports:I would take the YAML any day.
Because if one of those iptables fails above you're in an inconsistent state.
Also if I want to swap from iptables to something like Istio then it's basically the same YAML.
You obviously didn't use k8s (or k3s or anything other implementation) a lot, because it also messed us iptables randomly sometimes due to bugs, version miss match etc.
Have been Kubernetes for the last decade across multiple implementations.
Never had an iptable issue and these days eBPF is the standard.
Highly amateurish take if you call shell spaghetti a Kubernates, especially if we compare complexity of both...
You know what would be even more bad? Introducing kubernates for your non-Google/Netflix/WhateverPlanetaryScale App instead of just writing few scripts...
Hell, I’m a fan of k8s even for sub-planetary scale (assuming that scale is ultimately a goal of your business, it’s nice to build for success). But I agree that saying “well, it’s either k8s or you will build k8s yourself” is just ignorant. There are a lot of options between the two poles that can be both cheap and easy and offload the ugly bits of server management for the right price and complexity that your business needs.
Both this piece and the piece it’s imitating seem to have 2 central implicit axioms that in my opinion don’t hold. The first, that the constraints of the home grown systems are all cost and the second that the flexibility of the general purpose solution is all benefit.
You generally speaking do not want a code generation or service orchestration system that will support the entire universe of choices. You want your programs and idioms to follow similar patterns across your codebase and you want your services architected and deployed the same way. You want to know when outliers get introduced and similarly you want to make it costly enough to require introspection on if the value of the benefit out ways the cost of oddity.
The compiler one read to me like a reminder to not ignore the lessons of compiler design. The premise being that even though you have small scope project compared to a "real" compiler, you will evolve towards analogues of those design ideas. The databases and k8s pieces are more like don't even try a small scope project because you'll want the same features eventually.
I suppose I can see how people are taking this piece that way, but I don't see it like that. It is snarky and ranty, which makes it hard to express or perceive nuance. They do explicitly acknowledge that "a single server can go a long way" though.
I think the real point, better expressed, is that if you find yourself building a system with like a third of the features of K8s but composed of hand-rolled scripts and random third-party tools kludged together, maybe you should have just bit the bullet and moved to K8s instead.
You probably shouldn't start your project on it unless you have a dedicated DevOps department maintaining your cluster for you, but don't be afraid to move to it if your needs start getting more complex.
Author here. Yes there were many times while writing this that I wanted to insert nuance, but couldn't without breaking the format too much.
I appreciate the wide range of interpretations! I don't necessarily think you should always move to k8s in those situations. I just want people to not dismiss k8s outright for being overly-complex without thinking too hard about it. "You will evolve towards analogues of those design ideas" is a good way to put it.
That's also how I interpreted the original post about compilers. The reader is stubbornly refusing to acknowledge that compilers have irreducible complexity. They think they can build something simpler, but end up rediscovering the same path that lead to the creation of compilers in the first place.
I had a hard time putting my finger on what was so annoying about the follow-ons to the compiler post, and this nails it for me. Thanks!
> You generally speaking do not want a code generation or service orchestration system that will support the entire universe of choices.
This. I will gladly give up the universe of choices for a one size fits most solution that just works. I will bend my use cases to fit the mold if it means not having to write k8s configuration in a twisty maze of managed services.
I like to say, you can make anything look good by considering only the benefits and anything look bad by considering only the costs.
It's a fun philosophy for online debates, but an expensive one to use in real engineering.
outweighs*
Only offering the correction because I was confused at what you meant by “out ways” until I figured it out.
The elephant in the room: People who have gotten over the K8s learning curve almost all tell you it isn't actually that bad. Most people who have not attempted the learning curve, or have just dipped their toe in, will tell you they are scared of the complexity.
An anecdotal datapoint: My standard lecture teaching developers how to interact with K8s takes almost precisely 30 minutes to have them writing Helm charts for themselves. I have given it a whole bunch of times and it seems to do the job.
> My standard lecture teaching developers how to interact with K8s takes almost precisely 30 minutes to have them writing Helm charts for themselves
And I can teach someone to write "hello world" in 10 languages in 30 minutes, but that doesn't mean they're qualified to develop or fix production software.
One has to start from somewhere I guess. I doubt anyone would learn K8s thoroughly before getting any such job. Tried once and the whole thing bored me out in the fourth video.
I personally know many k8s experts that vehemently recommend against using it unless you have no other option.
Much like Javascript, the problem isn't Kubernetes, its the zillions of half-tested open-source libraries that promise to make things easier but actually completely obfuscate what the system is doing while injecting fantastic amounts of bugs.
Dear Amazon Elastic Beanstalk, Google App Engine, Heroku, Digital Ocean App Platform, and friends,
Thank you for building "a kubernetes" for me so I don't have to muck with that nonsense, or have to hire people that do.
I don't know what that other guy is talking about.
Most of the complaints in this fun post are just bad practice, and really nothing to do with “making a Kubernetes”.
Sans bad engineering practices, if you built a system that did the same things as kubernetes I would have no problem with it.
In reality I don’t want everybody to use k8s. I want people finding different solutions to solve similar problems. Homogenized ecosystems create walls they block progress.
One is the big things that is overlooked when people move to k8s, and why things get better when moving to k8s, is that k8s made a set of rules that forced service owners to fix all of their bad practices.
Most deployment systems would work fine if the same work to remove bad practices from their stack occurred.
K8s is the hot thing today, but mark my words, it will be replaced with something far more simple and much nicer to integrate with. And this will come from some engineer “creating a kubernetes”
Don’t even get me started on how crappy the culture of “you are doing something hard that I think is already a solved problem” is. This goes for compilers and databases too. None is these are hard, and neither is k8s, and all good engineers tasked with making one, be able to do so.
I welcome a k8s replacement! Just how there are better compilers and better databases than we had 10-20 years ago, we need better deployment methods. I just believe those better methods came from really understanding the compilers and databases that came before, rather than dismissing them out of hand.
Can you give examples of what "bad practices" does k8s force to fix?
To name a few:
K8s really kills the urge to say “oh well I guess we can just do that file onto the server as a part of startup rather than use a db/config system/etc.” No more “oh shit the VM died and we lost the file that was supposed to be static except for that thing John wrote to update it only if X happened, but now X happens everyday and the file is gone”.. or worse: it’s in git but now you have 3 different versions that have all drifted due to the John code change.
K8s makes you use containers, which makes you not run things on your machine, which makes you better at CI, which.. (the list goes on, containers are industry standard for a lot of reasons). In general the 12 Factor App is a great set of ideas, and k8s lets you do them (this is not exclusive, though). Containers alone are a huge game changer compared to “cp a JAR to the server and restart it”
K8s makes it really really really easy to just split off that one weird cronjob part of the codebase that Mike needed and man, it would be really nice to just use the same code and dependencies rather than boilerplating a whole new app and deploy, CI, configs, and yamls to make that run. See points about containerization.
K8s doesn’t assume that your business will always be a website/mobile app. See the whole “edge computing” trend.
I do want to stress that k8s is not the only thing in the world that can do these or promote good development practices, and I do think it’s overkill to say that it MAKES you do things well - a foolhardy person can mess any well-intentioned system up.
So you're saying companies should move to k8s and then immediately move to bash scripts
No. I am saying that companies should have their engineers understand why k8s works and make those reasons an engineering practice.
As it is today the pattern is spend a ton of money moving to k8s (mostly costly managed solutions) in the process fix all the bad engineering patterns, forced by k8s. To then have an engineer save the company money by moving back to a more home grown solution, a solution that fits the companies needs and saves money, something that would only be possible once the engineering practices were fixed.
Kubernetes biggest competitor isn’t a pile of bash scripts and docker running on a server, it’s something like ECS which comes with a lot of the benefits but a hell of a lot less complexity
FWIW I’ve been using ECS at my current work (previously K8s) and to me it feels just flat worse:
- only some of the features
- none of the community
- all of the complexity but none of the upsides.
It was genuinely a bit shocking that it was considered a serious product seeing as how chaotic it was.
Can you elaborate on some of the issues you faced? I was considering deploying to ECS fargate as we are all-in on AWS.
Any kind of git-ops style deployment was out.
ECS merges “AWS config” and “app/deployment config together” so it was difficult to separate “what should go in TF, and what is a runtime app configuration. In comparison this is basically trivial ootb with K8s.
I personally found a lot of the moving parts and names needlessly confusing. Tasks e.g. were not your equivalent to “Deployment”.
Want to just deploy something like Prometheus Agent? Well, too bad, the networking doesn’t work the same, so here’s some overly complicated guide where you have to deploy some extra stuff which will no doubt not work right the first dozen times you try. Admittedly, Prom can be a right pain to manage, but the fact that ECS makes you do _extra_ work on top of an already fiddly piece of software left a bad taste in my mouth.
I think ECS get a lot of airtime because of Fargate, but you can use Fargate on K8s these days, or, if you can afford the small increase in initial setup complexity, you can just have Fargates less-expensive, less-restrictive, better sibling: Karpenter on Spot instances.
I think the initial setup complexity is less with ECS personally, and the ongoing maintenance cost is significantly worse on K8s when you run anything serious which leads to people taking shortcuts.
Every time you have a cluster upgrade with K8s there’s a risk something breaks. For any product at scale, you’re likely to be using things like Istio and Metricbeat. You have a whole level of complexity in adding auth to your cluster on top of your existing SSO for the cloud provider. We’ve had to spend quite some time changing the plugin for AKS/EntraID recently which has also meant a change in workflow for users. Upgrading clusters can break things since plenty of stuff (less these days) lives in beta namespaces, and there’s no LTS.
Again, it’s less bad than it was, but many core things live(d) in plugins for clusters which have a risk of breaking when you upgrade cluster.
My view was that the initial startup cost for ECS is lower and once it’s done, that’s kind of it - it’s stable and doesn’t change. With K8s it’s much more a moving target, and it requires someone to actively be maintaining it, which takes time.
In a small team I don’t think that cost and complexity is worth it - there are so many more concepts that you have to learn even on top of the cloud specific ones. It requires a real level of expertise so if you try and adopt it without someone who’s already worked with it for some time you can end up in a real mess
If your workloads are fairly static,ECS is fine. Bringing up new containers and nodes takes ages with very little feedback as to what's going on. It's very frustrating when iterating on workloads.
Also fargate is very expensive and inflexible. If you fit the narrow particular use case it's quicker for bringing up workloads, but you pay extra for it.
Can confirm. I've used ECS with Fargate successfully at multiple companies. Some eventually outgrew it. Some failed first. Some continue to use ECS happily.
Regardless of the outcome, it always felt more important to keep things simple and focus on product and business needs.
It sometimes blows my mind how reductionist and simplistic a world-view it's possible to have and yet still attain some degree of success.
Shovels and mechanical excavators both exist and have a place on a building site. If you talk to a workman he may well tell you he has regular hammer with him at all times but will use a sledgehammer and even rent a pile driver on occasion if the task demands it.
And yet somehow we as software engineers are supposed to restrict ourselves to The One True Tool[tm] (which varies based on time and fashion) and use it for everything. It's such an obviously dumb approach that even people who do basic manual labour realise its shortcomings. Sometimes they will use a forklift truck to move things, sometimes an HGV, sometimes they will put things in a wheelbarrow and sometimes they will carry them by hand. But us? No. Sophisticated engineers as we are there is One Way and it doesn't matter if you're a 3 person startup or you're Google, if you deploy once per year to a single big server or multiple times per day to a farm of thousands of hosts you're supposed to do it that one way no matter what.
The real rule is this: Use your judgement.
You're supposed to be smart. You're supposed to be good. Be good. Figure out what's actually going on and how best to solve the problems in your situation. Don't rely on everyone else to tell you what to do or blindly apply "best practises" invented by someone who doesn't know a thing about what you're trying to do. Yes consider the experiences of others and learn from their mistakes where possible, but use your own goddamn brain and skill. That's why they pay you the big bucks.
I think one thing that is under appreciated with kubernetes is how massive the package library is. It becomes trivial to stand up basically every open source project with a single command via helm. It gets a lot of hate but for medium sized deployments, it’s fantastic.
Before helm, just trying to run third party containers on bare metal resulted in constant downtime when the process would just hang for no reason, and and engineer would have to SSH and manually restart the instance.
We used this as a previous start up to host metabase, sentry and airbyte seamlessly, on our own cluster. Which let us break out of the constant price increases we faced for hosted versions of these products.
Shameless plug: I’ve been building https://github.com/czhu12/canine to try to make Kubernetes easier to use for solo developers. Would love any feedback from anyone looking to deploy something new to K8s!
Right, but this isn't a post about why K8s is good, it's a post about why K8s is effectively mandatory, and it isn't, which is why the post rankles some people.
Yeah I mostly agree. I'd even add that even K8 YAML's are not trivial to maintain, especially if you need to have them be produced by a templating engine.
They become trivial once you stop templating them with text templating engine.
They are serialized json objects, the YAML is there just because raw JSON is not user friendly when you need something done quick and dirty or include comments.
Proper templating should never use text templating on manifests.
> Inevitably, you find a reason to expand to a second server.
The author has some good points, but not every project needs multiple servers for the same reasons as a typical Kubernetes setup. In many scenarios those servers are dedicated to separate tasks.
For example, you can have a separate server for a redundant copy of your application layer, one server for load balancing and caching, one or more servers for the database, another for backups, and none of these servers requires anything more than separate Docker Compose configs for each server.
I'm not saying that Kubernetes is a bad idea, even for the hypothetical setup above, but you don't necessarily need advanced service discovery tools for every workload.
I don’t think scale is the only consideration for using Kubernetes. The ops overhead in managing traditional infrastructure, especially if you’re a large enterprise, drops massively if you really buy into cloud native. Kubernetes converges application orchestration, job scheduling, scaling, monitoring/observability, networking, load balancing, certificate management, storage management, compute provisioning - and more. In a typical enterprise, doing all this requires multiple teams. Changes are request driven and take forever. Operating systems need to be patched. This all happens after hours and costs time and money. When properly implemented and backed by the right level of stakeholder, I’ve seen orgs move to business day maintenance, while gaining the confidence to release during peak times. It’s not just about scale, it’s about converging traditional infra practices into a single, declarative and eventually consistent platform that handles it all for you.
> Spawning containers, of course, requires you to mount the Docker socket in your web app, which is wildly insecure
Dear friend, you are not a systems programmer
To expand on this, the author is describing the so-called "Docker-out-of-Docker (DooD) pattern", i.e. exposing Docker's Unix socket into the container. Since Docker was designed to work remotely (CLI on another machine than DOCKER_HOST), this works fine, but essentially negates all isolation.
For many years now, all major container runtimes support nesting. Some make it easy (podman and runc just work), some hard (systemd-nspawn requires setting many flags to work nested). This is called "Docker-in-a-Docker (DinD)".
FreeBSD has supported nesting of jails natively since version 8.0, which dates back to 2009.
I prefer FreeBSD to K8s.
I think we need to distinguish between two cases:
For a hobby project, using Docker Compose or Podman combined with systemd and some shell scripts is perfectly fine. You’re the only one responsible, and you have the freedom to choose whatever works best for you.
However, in a company setting, things are quite different. Your boss may assign you new tasks that could require writing a lot of custom scripts. This can become a problem for other team members and contractors, as such scripts are often undocumented and don’t follow industry standards.
In this case, I would recommend using Kubernetes (k8s), but only if the company has a dedicated Kubernetes team with an established on-call rotation. Alternatively, I suggest leveraging a managed cloud service like ECS Fargate to handle container orchestration.
There’s also strong competition in the "Container as a Service" (CaaS) space, with smaller and more cost-effective options available if you prefer to avoid the major cloud providers. Overall, these CaaS solutions require far less maintenance compared to managing your own cluster.
> dedicated Kubernetes team with an established on-call rotation.
Using EKS or GKS is basically this. K8s is much nicer than ECS in terms of development and packaging your own apps.
How would you feel if bash scripts were replaced with Ansible playbooks?
At a previous job at a teeny startup, each instance of the environment is a docker-compose instance on a VPS. It works great, but they’re starting to get a bunch of new clients, and some of them need fully independent instances of the app.
Deployment gets harder with every instance because it’s just a pile of bash scripts on each server. My old coworkers have to run a build for each instance for every deploy.
None of us had used ansible, which seems like it could be a solution. It would be a new headache to learn, but it seems like less of a headache than kubernetes!
Ansible is better than Bash if your goals include:
* Automating repetitive tasks across many servers.
* Ensuring idempotent configurations (e.g., setting up web servers, installing packages consistently).
* Managing infrastructure as code for better version control and collaboration.
* Orchestrating complex workflows that involve multiple steps or dependencies.
However, Ansible is not a container orchestrator.
Kubernetes (K8s) provides capabilities that Ansible or Docker-Compose cannot match. While Docker-Compose only supports a basic subset, Kubernetes offers:
* Advanced orchestration features, such as rolling updates, health checks, scaling, and self-healing.
* Automatic maintenance of the desired state for running workloads.
* Restarting failed containers, rescheduling pods, and replacing unhealthy nodes.
* Horizontal pod auto-scaling based on metrics (e.g., CPU, memory, or custom metrics).
* Continuous monitoring and reconciliation of the actual state with the desired state.
* Immediate application of changes to bring resources to the desired configuration.
* Service discovery via DNS and automatic load balancing across pods.
* Native support for Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) for storage management.
* Abstraction of storage providers, supporting local, cloud, and network storage.
If you need these features but are concerned about the complexity of Kubernetes, consider using a managed Kubernetes service like GKE or EKS to simplify deployment and management. Alternatively, and this is my prefered option, combining Terraform with a Container-as-a-Service (CaaS) platform allows the provider to handle most of the operational complexity for you.
Ansible ultimately runs scripts, in parallel, in a defined order across machines. It can help a lot, but it's subject to a lot of the same state bitrot issues as a pole of shell scripts.
Up until a few thousand instances, a well designed setup should be a part time job for a couple of people.
To that scale you can write a custom orchestrator that is likely to be smaller and simpler than the equivalent K8S setup. Been there, done that.
Yes, but the people just cannot comprehend the complexity of it. Even my academic professor for my FYP back when was an undergrad, now he reverted back to Docker Compose, citing the integration is so convoluted that developing for it is very difficult. That's why I'm aiming to cut down the complexity of Kubernetes with a low-friction, turnkey solution, but I guess the angel investors in Hong Kong aren't buying into it yet. I'm still aiming to try again after 2 years when I can at least get an MVP that is complete though (I don't like to present imperfect stuff, either you just have the idea or you give me the full product and not half baked shit)
Like, okay, if that's how you see it, but what's with the tone and content?
The tone's vapidity is only comparable to the content's.
This reads like mocking the target audience rather than showing them how you can help.
A write up that took said "pile of shell scripts that do not work" and showed how to "make it work" with your technology of choice would have been more interesting than whatever this is.
One can build a better container orchestration than kubernetes; things don't need to be that complex.
I was using some ansible playbook scripts to deploy to production some web app. One day the scripts stopped working because of a boring error about python version mismatch.
I rewrite all the deployment scripts with bash (took less than a hour) and never had a problem since.
Morality: it's hard to find the right tool for the job
For my own websites, I host everything on a a single $20/month hetzner instance using https://dokploy.com/ and I'm never going back.
Why do I feel this is not so simple as the compiler scenario?
I've seen a lot of "piles of YAML", even contributed to some. There were some good projects that didn't end up in disaster, but to me the same could be said for the shell.
I was very scared of K8s for a long time then we started using it and it's actually great. Much less complex than its reputation suggests.
I had the exact opposite experience. I had a cloud run app in gcp and experimented with moving it to k8s and I was astonished with the amount of new complexity I had to manage
I thought k8s might be a solution so I decided to learn through doing. It quickly became obvious that we didn't need 90% of its capabilities but more important it'd put undue load/training on the rest of the team. It would be a lot more sensible to write custom orchestration using the docker API - that was straightforward.
Experimenting with k8s was very much worthwhile. It's an amazing thing and was in many ways inspirational. But using it would have been swimming against the tide so to speak. So sure I built a mini-k8s-lite, it's better for us, it fits better than wrapping docker compose.
My only doubt is whether I should have used podman instead but at the time podman seemed to be in an odd place (3-4 years ago now). Though it'd be quite easy to switch now it hardly seems worthwhile.
You did a no-SQL, you did a serverless, you did a micro-services. This makes it abundantly clear you do not understand the nature of your architectural patterns and the multiplicity of your offenses.
I wish the world hadn't consolidated around Kubernetes. Rancher was fantastic. Did what 95% of us need, and dead simple to add and manage services.
Did you find Rancher v2 (which uses Kubernetes instead of their own Cattle system) is worse?
Started with a large shell script, the next itération was written in go and less specific. I still think for some things, k8s is just too much
https://github.com/mildred/conductor.go/
Friend, you built a J2EE server
I am 100% sure that the author of this post has never "built a kubernetes", holds at least one kubernetes cert, and maybe even works for a company that sells kubernetes products and services. Never been more certain of anything in my life. You could go point by point but its just so tiring arguing with these people. Like, the whole "who will maintain these scripts when you go on vacation" my brother in christ have you seen the kubernetes setups some of these people invent? They are not easier to be read into, this much is absolute. At least a shell script has a chance of encoding all of its behavior in the one file, versus putting a third of its behavior in helm variables, a third in poorly-named and documented YAML keys, and a third in some "manifest orchestrator reconciler service deployment system" that's six major versions behind an open source project that no one knows who maintains anymore because their critical developer was a Belarusian 10x'er who got mad about a code of conduct that asked him to stop mispronouning contributors.
Swarm Mode ftw!
For the uninitiated: how does k8s handle OS upgrades? If development moves to next version of Debian, because it should eventually, are upgrades, for example, 2x harder vs docker-compose? 2x easier? About the same? Is it even right question to ask?
It doesn't. The usual approach is to create new nodes with the updated OS, migrate all workloads over and then throw away the old ones
Are you talking about upgrades of the host OS or the base of the image? I think you are talking about the latter. Others covered updating the host.
Upgrades of the Docker image are done by pushing a new image, and updating the Deployment to use the new image, and applying it. Kubernetes will start new containers for the new image, and when they are running, kill off the old containers. There should be no interruption. It isn't any different than normal deploy.
Your cluster consists of multiple machines ('nodes'). Upgrading is as simple as adding a new, upgraded node, then evicting everything from one of the existing nodes, then take it down. Repeat until every node is replaced.
Downtime is the same as with a deploment, so if you run at least 2 copies of everything there should be no downtime.
As for updating the images of your containers, you build them again with the newer base image, then deploy.
>Tired, you parameterize your deploy script and configure firewall rules, distracted from the crucial features you should be working on and shipping.
Where's your Sysop?
i'm at this crossroads right now. somebody talk me out of deploying a dagster etl on azure kubernetes service rather than deploying all of the pieces onto azure container apps with my own bespoke scripts / config
writing this out helped me re-validate what i need to do
what did you decide to do?
Dear friend, you have made a slippery slope argument.
Yes, because the whole situation is a slippery slope (ony upwards). In the initial state, k8s is obviously overkill; in the end state, k8s is obviously adequate.
The problem is choosing the point of transition, and allocating resources for said transition. Sometimes it's easier to allocate a small chunk to update your bespoke script right now instead of sinking more to a proper migration. It's a typical dilemma of taking debt vs paying upfront.
(BTW the same dilemma exists with running in the cloud vs running on bare metal; the only time when a migration from the cloud is easy is the beginning, when it does not make financial sense.)
Odds are you have 100 DAUs and your "end state" is an "our incredible journey" blog post. I understand that people want to pad their resume with buzzwords on the way, but I don't accept making a virtue out of it.
Exactly. Don't start with k8s unless you're already comfortable troubleshooting it at 3am half asleep. Start with one of the things you're comfortable with. Among these things, apply YAGNI liberally, only making certain that you're not going to paint yourself into a corner.
Then, if and when you've become so large that the previous thing has become painful and k8s started looking like a really right tool for the job, allocate time and resources, plan a transition, implement it smoothly. If you have grown to such a size, you must have had a few such transitions in your architecture and infrastructure already, and learned to handle them.
Dear friend, you should first look into using Nomad or Kamal deploy instead of K8S
You mean the rugpull-stack? "Pray we do not alter the deal further when the investors really grumble" https://github.com/hashicorp/nomad/blob/v1.9.3/LICENSE
As for Kamal, I shudder to think of the hubris required to say "pfft, haproxy is for lamez, how hard can it be to make my own lb?!" https://github.com/basecamp/kamal-proxy
why adding complexity when many services don't even need horizontal scaling, servers are powerful enough that if you're not stupid to write horrible code, it's fine for millions of requests a day without much of work
Infra person here, this is such the wrong take.
> Do I really need a separate solution for deployment, rolling updates, rollbacks, and scaling.
Yes it's called an ASG.
> Inevitably, you find a reason to expand to a second server.
ALB, target group, ASG, done.
> Who will know about those undocumented sysctl edits you made on the VM
You put all your modifications and CIS benchmark tweaks in a repo and build a new AMI off it every night. Patching is switching the AMI and triggering a rolling update.
> The inscrutable iptables rules
These are security groups, lord have mercy on anyone who thinks k8s network policy is simple.
> One of your team members suggests connecting the servers with Tailscale: an overlay network with service discovery
Nobody does this, you're in AWS. If you use separate VPCs you can peer them but generally it's just editing some security groups and target groups. k8s is forced into needing to overlay on an already virtual network because they need to address pods rather than VMs, when VMs are your unit you're just doing basic networking.
You reach for k8s when you need control loops beyond what ASGs can provide. The magic of k8s is "continuous terraform," you will know when you need it and you likely never will. If your infra moves from one static config to another static config on deploy (by far the usual case) then no k8s is fine.
You’d be swapping an open-source vendor independent API for a cloud-specific vendor locked one. And paying more for the “privilege”
I mean that's the sales pitch but it's really not vendor independent in practice. We have a mountain of EKS specific code. It would be easier for me to migrate our apps that use ASGs than to migrate our charts. AWS's API isn't actually all that special, they're just modeling the datacenter in code. Anywhere you migrate to will have all the same primitives because the underlying infrastructure is basically the same.
EKS isn't any cheaper either from experience and in hindsight of course it isn't, it's backed by the same things you would deploy without EKS just with another layer. The dream of gains from "OS overhead" and efficient tight-packed pod scheduling doesn't match the reality that our VMs are right-sized for our workloads already and aren't sitting idle. You can't squeeze that much water from the stone even in theory and in practice k8s comes with its own overhead.
Another reason to use k8s is the original:
When you deploy on physical hardware, not VMs, or have to otherwise optimize maximum utilization out of gear you have.
Especially since sometimes Cloud just means hemorrhaging money in comparison to something else, especially with ASGs
We found that the savings from switching from VMs in ASGs to k8s never really materialized. OS overhead wasn't actually that much and once you're requesting cpu / memory you can't fit as many pods per host as you think.
Plus you're competing with hypervisors for maxing out hardware which is rock solid stable.
My experience was quite the opposite, but it depends very much on the workload.
That is, I didn't say the competition was between AWS ASGs and k8s running on EC2, but having already a certain amount of capacity that you want to max out in flexible ways.
You don't need to use an overlay network. Calico works just fine without an overlay.
I'm sure the American Sewing Guild is fantastic, but how do they help here?
ASG = Auto-Scaling Group
https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-s...
Even without needing to spawn additional Docker containers, I think people are more afraid of Kubernetes than is warranted. If you use a managed K8s service like Azure, AWS, GCP, and tons of others provide, it's... Pretty simple and pretty bulletproof, assuming you're doing simple stuff with it (i.e. running a standard web app).
The docs for K8s are incredibly bad for solo devs or small teams, and introduce you to a lot of unnecessary complexity upfront that you just don't need: the docs seem to be written with megacorps in mind who have teams managing large infrastructure migrations with existing, complex needs. To get started on a new project with K8s, you just need a pretty simple set of YAML files:
1. An "ingress" YAML file that defines the ports you listen to for the outside world (typically port 80), and how you listen to them. Using Helm, the K8s package manager, you can install a simple default Nginx-based ingress with minimal config. You probably were going to put Nginx/Caddy/etc in front of your app anyway, so why not do it this way?
2. A "service" YAML file that allocates some internal port mapping used for your web application (i.e. what port do you listen on within the cluster's network, and what port should that map to for the container).
3. A "deployment" YAML file that sets up some number of containers inside your service.
And that's it. As necessary you can start opting into more features; for example, you can add health checks to your deployment file, so that K8s auto-restarts your containers when they die, and you can add deployment strategies there as well, such as rolling deployments and limits on how many new containers can be started before old ones are killed during the deploy, etc. You can add resource requests and limits e.g. make sure my app has at least 500MB RAM, and kill+restart it if it cross 1GB. But it's actually really simple to get started! I think it compares pretty well even to the modern Heroku-replacements like Fly.io... It's just that the docs are bad and the reputation is that it's complicated — and a large part of that reputation is from existing teams who try to do a large migration, and who have very complex needs that have evolved over time. K8s generally is flexible enough to support even those complex needs, but... It's gonna be complex if you have them. For new projects, it really isn't. Part of the reason other platforms are viewed as simpler IMO is just that they lack so many features that teams with complex needs don't bother trying to migrate (and thus never complain about how complicated it is to do complicated things with it).
You can have Claude or ChatGPT walk you through a lot of this stuff though, and thereby get an easier introduction than having to pore through the pretty corporate official docs. And since K8s supports both YAML and JSON, in my opinion it's worth just generating JSON using whatever programming language you already use for your app; it'll help reduce some of the verbosity of YAML.
What you’re saying is that starting a service in kubernetes as a dev is ok, what other people say is that operating a k8s cluster is hard.
Unless I’m mistaken the managed kubernetes instances were introduced by cloud vendors because regular people couldn’t run kubernetes clusters reliably, and when they went wrong they couldn’t fix them.
Where I am, since cloud is not an option ( large mega corp with regulatory constraints ) they’ve decided to run their own k8s cluster. It doesn’t work well, it’s hard to debug, and they don’t know why it doesn’t work.
Now if you have the right people or can have your cluster managed for you, I guess it’s a different story.
Most megacorps use AWS. It's regrettable that your company can't, but that's pretty atypical. Using AWS Kubernetes is easy and simple.
Not sure why you think this is just "as a dev" rather than operating in production — K8s is much more battle-hardened than someone's random shell scripts.
Personally, I've run K8s clusters for a Very Large tech megacorp (not using managed clusters; we ran the clusters ourselves). It was honestly pretty easy, but we were very experienced infra engineers, and I wouldn't recommend doing it for startups or new projects. However, most startups and new projects will be running in the cloud, and you might as well use managed K8s: it's simple.
> Most megacorps use AWS. It's regrettable that your company can't, but that's pretty atypical.
Even then, it seems like you can run EKS yourself:
https://github.com/aws/eks-anywhere
"EKS Anywhere is free, open source software that you can download, install on your existing hardware, and run in your own data centers."
(Never done it myself, no idea if it's a good option)
Now compare cloud bills.
k8s is the API. Forget the implementation, it's really not that important.
Folks that get tied up in the "complexity" argument are forever missing the point.
The thing that the k8s api does is force you to do good practices, that is it.
Dear Friend,
This fascination with this new garbage-collected language from a Santa Clara vendor is perplexing. You’ve built yourself a COBOL system by another name.
/s
I love the “untested” criticism in a lot of these use-k8s screeds, and also the suggestion that they’re hanging together because of one guy. The implicit criticism is that doing your own engineering is bad, really, you should follow the crowd.
Here’s a counterpoint.
Sometimes just writing YAML is enough. Sometimes it’s not. Eg there are times when managed k8s is just not on the table, eg because of compliance or business issues. Then you’ve to think about self-managed k8s. That’s rather hard to do well. And often, you don’t need all of that complexity.
Yet — sometimes availability and accountability reasons mean that you need to have a really deep understanding of your stack.
And in those cases, having the engineering capability to orchestrate isolated workloads, move them around, resize them, monitor them, etc is imperative — and engineering capability means understanding the code, fixing bugs, improving the system. Not just writing YAML.
It’s shockingly inexpensive to get this started with a two-pizza team that understands Linux well. You do need a couple really good, experienced engineers to start this off though. Onboarding newcomers is relatively easy — there’s plenty of mid-career candidates and you’ll find talent at many LUGs.
But yes, a lot of orgs won’t want to commit to this because they don’t want that engineering capability. But a few do - and having that capability really pays off in the ownership the team can take for the platform.
For the orgs that do invest in the engineering capability, the benefit isn’t just a well-running platform, it’s having access to a team of engineers who feel they can deal with anything the business throws at them. And really, creating that high-performing trusted team is the end-goal, it really pays off for all sorts of things. Especially when you start cross-pollinating your other teams.
This is definitely not for everyone though!