Ephemeral Infrastructure: Why Short-Lived Is a Good Thing

What does "compute" mean here? It is supposed to be a verb, but here seems to be a noun. I have recently started seeing it apparently mean "computational capacity" but here it seems to mean "an instance of a virtualized computer installation." All this verbing is confusing. Excuse me, I have an eat.

I've written this about four times for two employers and two clients: ABC: Always Be Cycling

Basic premise is to encode, be it lifecycle rules or a cron, behavior such that instances are cycled after at most 7 days, but there should always be an instance cycling (with some cool down period of course).

It has never not improved overall system stability and in a few cases even decreased costs significantly.

I do appreciate the way Kubernetes forces you to plan for instance failure from the beginning, and that it creates standards on how to deal with it.

However, I feel like this article really glosses over the challenge of stateful workloads by simply handing over that responsibility to the cloud providers.

A lot of us have to run our own servers in our own datacenters for various reasons, so we have to solve that problem ourselves.

Luckily, the same principals apply for stateful workloads, it is just more challenging. You have to plan for instance failures while still preserving your data.

Even more luckily, the tools for this have gotten better and better. Various database controllers are getting much better at handling clustering and failover for you, so you can handle instances and nodes going down without losing data and without having to outsource the management to the cloud.

This seems to be rediscovering "pets vs. cattle."

Less effectively too (depending how you travel). Most hotel rooms I'm there for a couple of days min, most normally a vacation week. I settle in, move the chair somewhere sensible, unpack clothes and charge, set up for the short term. Poster seems to be talking about very short lived instances where you can kill them at any time. I'm never able to leave a hotel room at a moments notice - that's where my stuff is...

Pets vs Cattle seems much more clear, cattle is there to be culled, you feed it, look after it, but you don't get attached. If the herd has a week member you kill it.

I'd be a heartless farmer, but that analogy radically improved my infrastructure.

Pets vs cattle is a more generic term that's applied to lots of things (originally file naming, later server naming); ephemeral infrastructure is the specific technical term for throwing away your infrastructure and replacing it with a copy

for me it feels like: Everything is stateful by default/convenience. Building robust systems is in part about confining statefulness to as few parts as possible. To contain statefulness. It’s to buy you some time and capacity. Yet the toughest problems often arise in the stateful parts of the system as well as quasi-stateless parts which sometimes develop hidden statefulness (think of syncing webclient and server state). So being good at handling stateful systems is valuable. Maybe one should even embrace statefulness. However, the AWS Solution Architect will tell you otherwise.

I think most of us learned this from an early age - computer systems often degrade as they keep running and need to be reset from time to time.

I remember when I had my first desktop PC at home (Windows 95) and it would need a fresh install of Windows every so often as things went off the rails.

This has got to be a failure of early Windows versions -- I've had systems online for 5+ years without needing to be restarted, updating and restarting the software running on them without service interruption. RAID storage makes hotswapping failing drives easy, which is the most common part needing periodic replacement.

Yes. With Windows 3.x there wasn’t a lot to go wrong that couldn’t be fixed in a single ini file. Windows 95 through ME was a complete shitshow where many many things could go wrong and the fastest path to fixing it was a fresh install.

Windows XP largely made that irrelevant, and Windows 7 made it almost completely irrelevant.

This only applies to Windows and I think you're referencing desktops.

Ten years ago I think rule of thumb was uptime of not greater than 6 months. But for different reasons. (Windows Server...)

On Solaris, Linux, BSDs etc. it's only necessary for maintenance. Literally. I think my longest uptime production system was a sparc postgres system under sustained high load with uptime of around 6 years.

With cloud infra, people have forgotten just how stable the Unixken are.

All software is ephemeral on a human timescale, isn't it?

https://successfulsoftware.net/2013/03/24/ephemeral/

Some classes of software, as the author rightly points out, yeah.

But there is much out there that still run the exact same software that they wrote 20, 30 or even longer time ago. Personally, it's most noticable with musical instruments I use, where most of the synthesizers are still running the exact same software as they launched with, +- some recreational hacks sometimes.

Once you control both the hardware and software, things become a lot easier (yet not problem-free). I'm sure there is more stuff out there than synthesizers that is similar, from signage to 3D printing firmware. I still come across random (important) devices running Windows XP out there in the wild.

>most of the synthesizers are still running the exact same software as they launched with

Is that software, or firmware though?

[deleted]

Another benefit is that you can provision ephemeral resources with an identity that has an expiration to match the resource’s lifecycle. Then, you don’t need to figure out rotation at all, just redeploy with a newly minted identity included.

Nice post, one more thing to keep in mind with your StatefulSets is how long the service running in the pod takes to come back up. Many will scan the on disk state for integrity and perform recovery tasks. These can take a while and mean the overall service is in a degraded state.

Manage these things and any stateful distributed service can run easily in Kubernetes.

Have been doing this in production for years now with Cluster-API + Talos.

When I update the Kubernetes or Talos version new nodes will be created, and after the existing pods are rescheduled on new nodes the old nodes are deleted.

Works pretty well.