We were wrong about GPUs

> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs. System engineers may have smart, fussy opinions on how to get their models loaded with CUDA, and what the best GPU is. But software developers don’t care about any of that. When a software developer shipping an app comes looking for a way for their app to deliver prompts to an LLM, you can’t just give them a GPU.

I'm increasingly coming to the view that there is a big split among "software developers" and AI is exacerbating it. There's an (increasingly small) group of software developers who don't like "magic" and want to understand where their code is running and what it's doing. These developers gravitate toward open source solutions like Kubernetes, and often just want to rent a VPS or at most a managed K8s solution. The other group (increasingly large) just wants to `git push` and be done with it, and they're willing to spend a lot of (usually their employer's) money to have that experience. They don't want to have to understand DNS, linux, or anything else beyond whatever framework they are using.

A company like fly.io absolutely appeals to the latter. GPU instances at this point are very much appealing to the former. I think you have to treat these two markets very differently from a marketing and product perspective. Even though they both write code, they are otherwise radically different. You can sell the latter group a lot of abstractions and automations without them needing to know any details, but the former group will care very much about the details.

> There's an (increasingly small) group of software developers who don't like "magic" and want to understand where their code is running and what it's doing. These developers gravitate toward open source solutions like Kubernetes

Kubernetes is not the first thing that comes to mind when I think of "understanding where their code is running and what it's doing"...

Indeed, I have to wonder how many people actually understand Kubernetes. Not just as a “user” but exactly all what it is doing behind the scenes…

Just an “idle” Kubernetes system is a behemoth to comprehend…

I keep seeing this opinion and I don't understand it. For various reasons, I recently transitioned from a dev role to running a 60+ node, 14+ PB bare metal cluster. 3 years in, and the only thing ever giving me trouble is Ceph.

Kubernetes is etcd, apiserver, and controllers. That's exactly as many components as your average MVC app. The control-loop thing is interesting, and there are a few "kinds" of resources to get used to, but why is it always presented as this insurmountable complexity?

I ran into a VXLAN checksum offload kernel bug once, but otherwise this thing is just solid. Sure it's a lot of YAML but I don't understand the rep.

“etcd, apiserver, and controllers.”

…and containerd and csi plugins and kubelet and cni plugins and kubectl and kube-proxy and ingresses and load balancers…

And system calls and filesystems and sockets and LVM and...

Sure at some point there are too many layers to count but I wouldn't say any of this is "Kubernetes". What people tend to be hung about is the difficulty of Kubernetes compared to `docker run` or `docker compose up`. That is what I am surprised about.

I never had any issue with kubelet, or kube-proxy, or CSI plugins, or CNI plugins. That is after years of running a multi-tenant cluster in a research institution. I think about those about as much as I think about ext4, runc, or GRUB.

But you just said that you had issues with ceph? How is that not a CSI problem?

And CNI problems are extremely normal. Pretty much anyone that didn't just use weavenet and called it a day has had to spend quiet a bit of time to figure it out. If you already know networking by heart it's obviously going to be easier, but few devs do.

Never had a problem with the CSI plugin, I had problems with the Ceph cluster itself. No, I wouldn't call Ceph part of Kubernetes.

You definitely can run Kubernetes without running Ceph or any storage system, and you already rely on a distributed storage system if you use the cloud whether you use Kubernetes or not. So I wouldn't count this as added complexity from Kubernetes.

I'm not sure I can agree with that interpretation. CSI is basically an interface that has to be implemented.

If you discount issues like that, you can safely say that it's impossible to have any issues with CSI, because it's always going to be with one of it's implementation.

That feels a little disingenuous, but maybe that's just me.

So if you run Kubernetes in the cloud, you consider the entire cloud provider's block storage implementation to be part of Kubernetes too?

For example you'd say AWS EBS is part of Kubernetes?

In the context of this discussion, which is about the complexity of the k8s stack: yes.

Youre ultimately gonna have to use a storage of some form unless you're just a stateless service/keep the services with state out of k8s. That's why I'd include it, and the fact that you can use multiple storage backends, each with their own challenges and pitfalls makes k8s indeed quiet complex.

You could argue that multinode PaaS is always going to be complex, and frankly- I'd agree with that. But that was kinda the original point. At least as far as I interpreted it: k8s is not simple and you most likely didn't need it either. But if you do need a distributed PaaS, then it's probably a good idea to use it. Doesn't change the fact that it's a complex system.

So you're comparing Kubernetes to what? Not running services at all? In that case I agree, you're going to have to set up Linux, find a storage solution, etc as part as your setup. Then write your app. It's a lot of work.

But would I say that your entire Linux installation and the cloud it runs on is part of Kubernetes? No.

> So you're comparing Kubernetes to what? Not running services at all?

Surprisingly there were hosted services on the internet prior to kubernetes existing. Hell, I even have reason to believe that the internet may possibly predate Docker

That is my point! If you think "just using SystemD services in a VM" is easy but "Kubernetes is hard", and you say "Kubernetes is hard" is because of Linux, cgroups, cloud storage, mount namespaces, ... Then I can't comprehend that argument, because those are things that exist in both solutions.

Let's be clear on what we're comparing or we can't argue at all. Kubernetes is hard if you have never seen a computer before, I will happily concede that.

ah I apologize for my snark then, I interpreted your sentence as _you_ believing that the only step simpler than using Kubernetes was to not have an application running

I see how you were asking the GP that question now

Next you’re going to claim the internet existed before Google too.

Various options around for simple alternatives, the simplest is probably just running single node.

Maybe with fail over for high availability.

Even that's fine for most deployments that aren't social media sites, aren't developed by multiple teams of devs and don't have any operations people on payroll.

Because CSI is just a way to connect a volume to a pod.

Ceph is its own cluster of kettles filled with fishes

Very fair, although with managed services which are increasingly available, you don't typically need to think about CSI or CNI.

Hence

> Kubernetes is not the first thing that comes to mind when I think of "understanding where their code is running and what it's doing"...

CSI and CNI do about as much magic as `docker volume` and `docker network`.

People act like their web framework and SQL connection pooler and stuff are so simple, while Kubernetes is complex and totally inscrutable for mortals, and I don't get it. It has a couple of moving parts, but it is probably simpler overall than SystemD.

I was genuinely surprised that k8s turned out to actually be pretty straightforward and very sensible after years of never having anything to do with it and just hearing about it on the net. Turns out opinions are just like after all.

That being said, what people tend to build on top of that foundation is a somewhat different story.

it’s not k8s. It’s distrusted systems

Unfortunately people (cough managers) think k8s is some magic that makes distrusted systems problems go away, and automagically enables unlimited scalability

In reality it just makes the mechanics a little easier and centralized

Getting distributed systems right is usually difficult

I asked chatgpt the other day to explain to me Kubernetes. I still don't understand it. Can you share with me what clicked with you, or resources that helped you?

Controller in charge of a specific type of object watches a database table representing the object type. Database table represents the desired state of things. When entries to the table are CRUD-ed, that represents a change to the desired state of things. Controller interacts with the larger system to bring the state of things into alignment with the new desired state of things.

"The larger system" is more controllers in charge of other object types, doing the same kind of work for its object types

There is an API implemented for CRUD-ing each object type. The API specification (model) represents something important to developers, like a group of containers (Pod), a load balancer with VIP (Service), a network volume (PersistentVolume), and so on.

Hand wave hand wave, Lego-style infrastructure.

None of the above is exactly correct (e.g. the DB is actually a k/v store), but it should be conceptually correct.

Is there only a single controller ? What happens if goes down?

If multiple controllers, how do they coordinate ?

>Is there only a single controller ?

No, there are many controllers. Each is in charge of the object types it is in charge of.

>What happens if [it] goes down?

CRUD of the object types it manages have no effect until the controller returns to service.

>If multiple controllers, how do they coordinate ?

The database is the source of truth. If one controller needs to "coordinate" with another, it will CRUD entries of the object types those other controllers are responsible for. e.g. Deployments beget ReplicaSets beget Pods.

The k/v store offers primitives to make that happen, but for non-critical controllers you don't want to deal with things like that they can go down and will be restarted (locally by kubelet/containerd) or rescheduled. Whatever resource they monitor will just not be touched until they get restarted.

What clicked with me is having ChatGPT go line by line through all of the YAML files generated for a simple web app—WordPress on Kubernetes. Doing that, I realized that Kubernetes basically takes a set of instructions on how to run your app and then follows them.

So, take an app like WordPress that you want to make “highly available.” Let’s imagine it’s a very popular blog or a newspaper website that needs to serve millions of pages a day. What would you do without Kubernetes?

Without Kubernetes, you would get yourself a cluster of, let’s say, four servers—one database server, two worker servers running PHP and Apache to handle the WordPress code, and finally, a front-end load balancer/static content host running Nginx (or similar) to take incoming traffic and route it to one of the two worker PHP servers. You would set up all of your servers, network them, install all dependencies, load your database with data, and you’d be ready to rock.

If all of a sudden an article goes viral and you get 10x your usual traffic, you may need to quickly bring online a few more worker PHP nodes. If this happens regularly, you might keep two extra nodes in reserve and spin them up when traffic hits certain limits or your worker nodes’ load exceeds a given threshold. You may even write some custom code to do that automatically. I’ve done all that in the pre-Kubernetes days. It’s not bad, honestly, but Kubernetes just solves a lot of these problems for you in an automated way. Think of it as a framework for your hosting infrastructure.

On Kubernetes, you would take the same WordPress app and split it into the same four functional blocks. Each would become a container. It can be a Docker container or a Containerd container—as long as it’s compatible with the Open Container Initiative, it doesn’t really matter. A container is just a set of files defining a lightweight Linux virtual machine. It’s lightweight because it shares its kernel with the underlying host it eventually runs on, so only the code you are actually running really loads into memory on the host server.

You don’t really care about the kernel your PHP runs on, do you? That’s the idea behind containers—each process runs in its own Linux virtual machine, but it’s relatively efficient because only the code you are actually running is loaded, while the rest is shared with the host. I called these things virtual machines, but in practice they are just jailed and isolated processes running on the host kernel. No actual hardware emulation takes place, which makes it very light on resources.

Just like you don’t care about the kernel your PHP runs on, you don’t really care about much else related to the Linux installation that surrounds your PHP interpreter and your code, as long as it’s secure and it works. To that end, the developer community has created a large set of container templates or images that you can use. For instance, there is a container specifically for running Apache and PHP—it only has those two things loaded and nothing else. So all you have to do is grab that container template, add your code and a few setting changes if needed, and you’re off to the races.

You can make those config changes and tell Kubernetes where to copy and place your code files using YAML files. And that’s really it. If you read the YAML files carefully, line by line, you’ll realize that they are nothing more than a highly specialized way of communicating the same type of instructions you would write to a deployment engineer in an email when telling them how to deploy your code.

It’s basically a set of instructions to take a specific container image, load code into it, apply given settings, spool it up, monitor the load on the cluster, and if the load is too high, add more nodes to the cluster using the same steps. If the load is too low, spool down some nodes to save money.

So, in theory, Kubernetes was supposed to replace an expensive deployment engineer. In practice, it simply shifted the work to an expensive Kubernetes engineer instead. The benefit is automation and the ability to leverage community-standard Linux templates that are (supposedly) secure from the start. The downside is that you are now running several layers of abstraction—all because Unix/Linux in the past had a very unhealthy disdain for statically linked code. Kubernetes is the price we pay for those bad decisions of the 1980s. But isn’t that just how the world works in general? We’re all suffering the consequences of the utter tragedy of the 1980s—but that’s a story for another day.

> People act like their web framework and SQL connection pooler and stuff are so simple

I'm just sitting here wondering why we need 100 billion transistors to move a piece of tape left and right ;)

[deleted]

Well, and the fact that in addition to Kubernetes itself, there are a gazillion adjacent products and options in the cloud-native space. Many/most of which a relatively simple setup may not need. But there's a lot of complexity.

But then there's always always a lot of complexity and abstraction. Certainly, most software people don't need to know everything about what a CPU is doing at the lowest levels.

These components are very different in complexity and scope. Let's be real: a seasoned developer is mostly familiar with load balancers and ingress controllers, so this will be mostly about naming and context. I agree though once you learn about k8s it becomes less mysterious but that also means the author hasn't pushed it to the limits. Outages in the control plane could be pretty nasty and it is easy to have them by creating an illusion everything is kind of free in k8s.

A really simple setup for many smaller organisations wouldn't have a load balancer at all.

No load balancer means... entering one node only? Doing DNS RR over all the nodes? If you don't have a load balancer in front, why are you even using Kubernetes? Deploy a single VM and call it a day!

I mean, in my homelab I do have Kubernetes and no LB in front, but it's a homelab for fun and learn K8s internals. But in a professional environment...

No code at all even - just use excel

[dead]

typical how to program an owl:

step one: draw a circle

step two: import the rest of the owl

... and kubernetes networking, service mesh, secrets management

You arent' forced to use service mesh and complex secrets management schemes. If you add them to the cluster is because you value what they offer you. It's the same thing as kubernetes itself - I'm not sure what people are complaining about, if you don't need what kubernetes offers, just don't use it.

Go back to good ol' corsync/pacemaker clusters with XML and custom scripts to migrate IPs and set up firewall rules (and if you have someone writing them for you, why don't you have people managing your k8s clusters?).

Or buy something from a cloud provider that "just works" and eventually go down in flames with their indian call centers doing their best but with limited access to engineering to understand why service X is misbehaving for you and trashing your customer's data. It's trade-offs all the way.

> …and containerd and csi plugins and kubelet and cni plugins (...)

Do you understand you're referring to optional components and add-ons?

> and kubectl

You mean the command line interface that you optionally use if you choose to do so?

> and kube-proxy and ingresses and load balancers…

Do you understand you're referring to whole classes of applications you run on top of Kubernetes?

I get it that you're trying to make a mountain out of a mole hill. Just understand that you can't argue that something is complex by giving as your best examples a bunch of things that aren't really tied to it.

It's like trying to claim Windows is hard, and then your best example is showing a screenshot of AutoCAD.

How’s kubelet and cni are “optional components”? What do you mean by that?

CNI is optional, you can have workloads bind ports on the host rather than use an overlay network (though CNI plugins and kube-proxy are extremely simple and reliable in my experience, they use VXLAN and iptables which are built into the kernel and that you already use in any organization who might run a cluster, or the basic building blocks of your cloud provider).

CSI is optional, you can just not use persistent storage (use the S3 API or whatever) or declare persistentvolumes that are bound to a single or group of machines (shared NFS mount or whatever).

I don't know how GP thinks you could run without the other bits though. You do need kubelet and a container runtime.

kubelet isn't, but CNI technically is (or can be abstracted to minimum, I think old network support might have been removed from kubelet nowadays)

Because the root comment is mostly but not quite right: there are indeed a large subset of developers that aren't interested in thinking about infrastructure, but there are many subcategories of those people, and many of them aren't fly.io customers. A large number of people who are in that category aren't happy to let someone else handle their infra. They're not interested in infra in the sense that they don't believe it should be more complicated than "start process on Linux box and set up firewall and log rotation".

For some applications these people are absolutely right, but they've persuaded themselves that that means it's the best way to handle all use cases, which makes them see Kubernetes as way more complex than is necessary, rather than as a roll-your-own ECS for those who would otherwise truly need a cloud provider.

Feels like swe engineers are talking past each other a lot about these topics.

I assume everyone wants to be in control of their environment. But with so many ways to compose your infra that means a lot of different things for different people.

I use k8s, wouldn't call it simple, but there are ways to minimize the complexity of your setup. Mostly, what devs see as complexity is k8s packages a lot of system fundamentals, like networking, storage, name resolution, distributed architectures, etc, and if you mainly spent your career in a single lane, k8s becomes impossible to grasp. Not saying those devs are wrong, not everyone needs to be a networking pro.

K8s is meant to be operated by some class of engineers, and used by another. Just like you have DBAs, sysadmins, etc, maybe your devops should have more system experience besides terraform.

"Kubernetes is etcd, apiserver, and controllers....Sure it's a lot of YAML but I don't understand the rep."

Sir, I upvoted you for your wonderful sense of humour.

I consider a '60+ node' kubernetes cluster is very small. Kubernetes at that scale is genuinely excellent! At 6000, 60000, and 600000 nodes it becomes very different and goes from 'Hey, this is pretty great' to 'What have I done?' The maintenance costs of running more than a hundred clusters is incredibly nontrivial especially as a lot of folks end up taking something open-source and thinking they can definitely do a lot better (you can.... there's a lot of "but"s there though).

OK but the alternative if you think Kubernetes is too much magic when you want to operate hundreds of clusters with tens of thousands of nodes is?

Some bash and Ansible and EC2? That is usually what Kubernetes haters suggest one does to simplify.

At a certain scale, let's say 100k+ nodes, you magically run into 'it depends.' It can be kubernetes! It can be bash, ansible, and ec2! It can be a custom-built vm scheduler built on libvirt! It can be a monster fleet of Windows hyper-v hosts! Heck, you could even use Mesos, Docker Swarm, Hashicorp Nomad, et al.

The main pain point I personally see is that everyone goes 'just use Kubernetes' and this is an answer, however it is not the answer. It steamrolling all conversations leads to a lot of the frustration around it in my view.

Hashicorp Nomad, Docker Swarm, Apache Mesos, AWS ECS?

I love that the Kubernetes lovers tend to forget that Kubernetes is just one tool, and they believe that the only possible alternative to this coolness is that sweaty sysadmins writing bash scripts in a dark room.

I’m absolutely not a Kubernetes lover. Bash and Andible etc. is just a very common suggestion from haters.

I thought Mesos was kinda dead nowadays, good to hear it’s still kicking. Last time I used it it the networking was a bit annoying, not able to provide virtual network interfaces but only ports.

It seems like if you are going to operate these things, picking a solution with a huge community and in active development feels like the smart thing to do.

Nomad is very nice to use from a developer perspective, and it’s nice to hear infrastructure people preferring it. From outside the reason people pick Kubernetes seems to be the level of control of infra and security teams want over things like networking and disk.

Can you describe who a Kubernetes hater is? Or show me an example. It's easy to stigmatise someone as a Kubernetes lover or hater. Then use it to invalidate their arguments.

I would argue against Kubernetes in particular situations, and even recommend Ansible in some cases, where it is a better fit in the given circumstances. Do you consider me as a Kubernetes hater?

Point is, Kubernetes is a great tool. In particular situations. Ansible is a great tool. In particular situations. Even bash is a great tool. In particular situations. But Kubernetes even could be the worst tool if you choose unwisely. And Kubernetes is not the ultimate infrastructure tool. There are alternatives, and there will be new ones.

HashiCorp Nomad?

The wheels fall off kubernetes at around 10k nodes. One of the main limitations is etcd from my experience, google recently fixed this problem by making spanner offer an etcd compatible API: https://cloud.google.com/blog/products/containers-kubernetes...

Etcd is truly a horrible data store, even the creator thinks so.

At that point you probably need a cluster of k8s clusters, no?

For anyone unfamiliar with this the "official limits" are here, and as of 1.32 it's 5000 nodes, max 300k containers, etc.

https://kubernetes.io/docs/setup/best-practices/cluster-larg...

Yes this is what I'm referring too. :)

Maintaining a lot of clusters is super different than maintaining one cluster.

Also please don't actually try to get near those limits, your etcd cluster will be very sad unless you're _very_ careful (think few deployments, few services, few namespaces, no using etcd events, etc).

Hey fellow k8s+ceph on bare metaler! We only have a 13 machine rack and 350tb of raw storage. No major issues with ceph after 16.x and all nvme storage though.

[deleted]

Genuinely curious about what sort of business stores and processes 14 PB on a 60 node cluster.

Research institution.

The department saw more need for storage than Kubernetes compute so that's what we're growing. Nowadays you can get storage machines with 1 PB in them.

Yeah, that's an interesting question, because it sounds like a ton of data vs not enough compute, but, aside from this all being in a SAN or large storage array:

The larger Supermicro or Quanta storage servers can easily handle 36 HDD's each, or even more.

So with just 16 of those with 36x24TB disks, that meets the ~14PB capacity mark, leaving 44 remaining nodes for other compute task, load balancing, NVME clusters, etc.

We have boxes with up to 45 drives yes.

Yeah, I'm sure there are tricky details as in anything but the core idea doesn't sound that complicated to me. I've been looking into it a bit after seeing this fun video a while ago where a DOS BBS is ran on kubernetes.

https://youtu.be/wLVHXn79l8M?si=U2FexAMKd3zQVA82

I think "core" kubernetes is actually pretty easy to understand. You have the kubelet, which just cares about getting pods running, which it does by using pretty standard container tech. You bootstrap a cluster by reading the specs for the cluster control plane pods from disk, after which the kubelet will start polling the API it just started for more of the same. The control plane then takes care of scheduling more pods to the kubelets that have joined the cluster. Pods can run controllers that watch the API for other kinds of resources, but one way or another, most of those get eventually turned into Pod specs that get assigned to a kubelet to run.

Cluster networking can sometimes get pretty mind-bending, but honestly that's true of just containers on their own.

I think just that ability to schedule pods on its own requires about that level of complexity; you're not going to get a much simpler system if you try to implement things yourself. Most of the complexity in k8s comes from components layered on top of that core, but then again, once you start adding features, any custom solution will also grow more complex.

If there's one legitimate complaint when it comes to k8s complexity, it's the ad-hoc way annotations get used to control behaviour in a way that isn't discoverable or type-checked like API objects are, and you just have to be aware that they could exist and affect how things behave. A huge benefit of k8s for me is its built-in discoverability, and annotations hurt that quite a bit.

Well, the point is you don't have to understand it all at the same time. Kubernetes really just codifies concepts that people were doing before. And it sits on the same foundations (Linux, IP, DNS etc). People writing apps didn't understand the whole thing before, just as they don't now. But at some level these boxes are plugged into each other. A bad system would be one where people writing business software have to care about what box is plugged into what. That's absolutely not the case with Kubernetes.

> Indeed, I have to wonder how many people actually understand Kubernetes. Not just as a “user” but exactly all what it is doing behind the scenes…

I would ask a different question. How many people actually need to understand implementation details of Kubernetes?

Look at any company. They pay engineers to maintain a web app/backend/mobile app. They want features to be rolled out, and they want their services to be up. At which point does anyone say "we need an expert who actually understands Kubernetes"?

When they get paged three nights in a row and can’t figure out why.

> I have to wonder how many people actually understand Kubernetes.

I have to wonder how many people actually understand when to use K8s or docker. Docker is not a magic bullet, and can actually be a foot gun when it's not the right solution.

I am at this compute thing since 1986, with focus mostly around distributed systems since 2000, and I keep my Kubernetes cheat sheet always close.

> Kubernetes is not the first thing that comes to mind when I think of "understanding where their code is running and what it's doing"...

In the end it's a scheduler for Docker containers on a bunch of virtual or bare metal machines. Once you get that in your head life becomes much more easy.

The only thing I'd really love to see from an ops perspective is a way to force-revive crashed containers for debugging. Yes, one shouldn't have to debug cattle, just haul the carcass off and get a new one... but I still prefer to know why the cattle died.

Yeah. In the whole cattle/pet discourse the fact that you need to take some cattle to the vet for diagnosis got lost. Very operator-centric thinking, I get where it’s coming from, but went a bit too far.

One may think Kubernetes is complex (I agree), but I haven't seen alternative that simultaneously allows to:

* Host hundreds or thousands of interacting containers across multiple teams in sane manner * Let's you manage and understand how is it done in the full extent.

Of course there are tons of organizations that can (and should) easily resign from one of these, but if you need both, there isn't better choice right now.

But how many orgs need that scale?

Something I've discovered is that if you're a small team doing something new, off the shelf products/platforms are almost certainly not optimized to your use case.

What looks like absurd scale to one team is a regular Tuesday for another, because "scale" is completely meaningless without context. We don't balk at a single machine running dozens of processes for a single web browser, we shouldn't balk at something running dozens of containers to do something that creates value somehow. And scale that up by number of devs/customers and you can see how thousands/hundreds of thousands can happen easily.

Also the cloud vendors make it easy to have these problems because it's super profitable.

You can run single-node k3s on a VM with 512MB of RAM and deploy your app with a hundred lines of JSON, and it inherits a ton of useful features that are managed in one place and can grow with your app if/as needed. These discussions always go in circles between Haters and Advocates:

* H: "kubernetes [at planetary scale] is too complex"

* A: "you can run it on a toaster and it's simpler to reason about than systemd + pile of bash scripts"

* H: "what's the point of single node kubernetes? I'll just SSH in and paste my bash script and call it a day"

* A: "but how do you scale/maintain that?"

* H: "who needs that scale?"

The sad thing is there probably is a toaster out there somewhere with 512MB of RAM.

It's not sad until it becomes self-aware.

A very small percentage of orgs, a not-as-small percentage of developers, and at the higher end of the value scale, the percentage is not small at all.

I think the developers who care about knowing how their code works tend to not want hyperscale setups anyway.

If they understood their system, odds are they’d realize that horizontal scaling with few, larger services is plenty scalable.

At those large orgs, the individual developer doesn’t matter at all and the EMs will opt for faster release cycles and rely on internal platform teams to manage k8s and things like it.

Exact opposite - k8s allows developers to actually tailor containers/pods/deployments themselves, instead opening tickets to have it configured on VM by platform team.

Of course there are simpler container runtimes, but they have issues with scale, cost, features or transparency of operation. Of course they can be good fit if you're willing to give up one or more of these.

> k8s allows developers to actually tailor containers/pods/deployments themselves

Yes, complex tools tend to be powerful.

But when I say “devs who care about knowing how their code works” I’m also referring to their tools.

K8s isn’t incomprehensible, but it is very complex, especially if you haven’t worked in devops before.

“Devs who care…” I would, assume, would opt for simpler tools.

I know I would.

We're almost 100 devs in a few teams - works well. There's a bunch of companies of our size even in the same city.

What's a bit different is we're creating own products, not renting people to others, so having uniform hosting platform is actual benefit.

Most of the ones that are profitable for cloud providers.

> Host hundreds or thousands of interacting containers across multiple teams in sane manner

I mean, if that's your starting point, then complexity is absolutely a given. When folks complain about the complexity of Kubernetes, they are usually complaining about the complexity relative to a project that runs a frontend, a backend, and a postgres instance...

In my last job we ran centralized clusters for all teams. They got X namespaces for their applications, and we made sure they could connect to the databases (handled by another team, though there were discussion of moving them onto dedicated clusters). We had basic configuration setup for them and offered "internal consultants" to help them onboard. We handled maintenance, upgrades and if needed migrations between clusters.

We did not have a cluster just for a single application (with some exceptions because those applications were incredibly massive in pod numbers) and/or had patterns that required custom handling and pre-emptive autoscaling (which we wrote code for!).

Why are so many companies running a cluster for each application? That's madness.

I mean, a bunch of companies that have deployed Kubernetes only have 1 application :)

I migrated one such firm off Kubernetes last year, because for their use case it just wasn't worth it - keeping the cluster upgraded and patched, and their CI/CD pipelines working was taking as much IT effort as the rest of their development process

I agree with the blog post that using K8s + containers for GPU virtualization is a security disaster waiting to happen. Even if you configure your container right (which is extremely hard to do), you don't get seccomp-bpf.

People started using K8s for training, where you already had a network isolated cluster. Extending the K8s+container pattern to multi-tenant environments is scary at best.

I didn't understand the following part though.

> Instead, we burned months trying (and ultimately failing) to get Nvidia’s host drivers working to map virtualized GPUs into Intel Cloud Hypervisor.

Why was this part so hard? Doing PCI passthrough with the Cloud Hypervisor (CH) is relatively common. Was it the transition from Firecracker to CH that was tricky?

This has actually brought up an interesting point. Kubernetes is nothing more than an API interface. Should someone be working on building a multi-tenant Kubernetes (so that customers don't need to manage nodes or clusters) which enforces VM-level security (obviously you cannot safely co-locate multiple tenants containers on the same VM)?

Yeah, I think this really exemplifies the "everyone more specialized than me doesn't get the bigger picture, and everyone less specialized than me is wasting their time" trope. Developers who don't want to deal with the nitty gritty in one area are dealing with it in another area. Everyone has 24 hours in a day.

The difference between a good developer and a bad is understanding the stack. Not necessarily an expert but I spend a lot of time debugging for random issues and it could be dns or a file locking issue or a network or a api or parsing EDI whatever. Most recently I found a bug in software that had to do with how Windows runs 32 bit mode on 64 bit. I've never used windows professionally and I have only had unix machines since I got a free Ubuntu CD. Yet I figured it out in like 20 minutes exploring the differences between the paths when running in two scenarios. Idk maybe I'm a genius, I don't think so, but I was able to solve the problem because I know just barely enough about enough things to poke shit and break them or make them light up. Compare that to a dev on my team who needed help writing a series of command line prompts to do a simple bit of textual adjustments and pipe some data around.

I'm not a even good developer. But I know enough to chime in on calls and provide useful and generally 'Wizarding' knowledge. Like a detective with a good hunch.

But yeah just autocomplete everything lol

It's great that you were able to debug that. It may have come at an opportunity cost of being able to solve some more specialized problem within your domain.

In my job I develop a React Native app. I also need to have a decent understanding of iOS and Android native code. If I run into a bug related to how iOS runs 32 bit vs 64 bit software? Not my problem, we'll open a ticket with Apple and block the ticket in our system.

I guess I never have enough leverage to order Apple to fix stuff. I'm like water and gravity. It's just a random example though and I agree you do give up a lot by being a generalist. However for most people we don't do really new or hard problems. Its a lot of spaghetti

I don't think of it as spaghetti but as messy plumbing.

I don't disagree with you, but I do think it's important to acknowledge that this approach requires someone else to do it. If you're at a big company where there are tons of specialists, then perhaps this is just fine because there is someone available to do it for you. If you find yourself in a different situation, however, where you don't have that other specialist, you could end up significantly blocked for a period of time. If whatever you're working on is not important and can afford to be blocked, then again no problem, but I've been in many situations where what I was doing absolutely had to work and had to work on a timetable. If I had to offload the work to someone else because I wasn't capable, it would have meant disaster.

> we'll open a ticket with Apple and block the ticket in our system.

Wouldn't it be annoying to be blocked on Apple rather than shipping on your schedule?

If we're blocked on Apple, so is everyone else. A key consideration in shipping high-level software is to avoid using niche features that the vendor might ignore if they're broken.

Great opportunity for someone ballsy to write a book about kubernetes internals for the general engineering population.

Bonus points for writing a basic implementation from first principles capturing the essence of the problem kubernetes really was meant to solve.

The 100 pages kubernetes book, Andriy Burkov style.

You might be interested in this:

https://github.com/kelseyhightower/kubernetes-the-hard-way

It probably won't answer the "why" (although any LLM can answer that nowadays), but it will definitely answer the "how".

I actually took the time to read the tutorial and found it helpful.

Thanks for taking the time to share the walk through.

That's nice but I was looking more for a simple implementation of the concept from first principles.

I mean an understanding from the view of the internals and not so much the user perspective.

https://www.cncf.io/phippy/the-childrens-illustrated-guide-t...

This is actually cool, thanks.

Kubernetes in Action book is very good.

I actually have the book and I agree it is very good.

> Great opportunity for someone ballsy to write a book about kubernetes internals for the general engineering population.

What would be the interest of it? Think about it:

- kubernetes is an interface and not a specific implementation,

- the bulk of the industry standardized on managed services, which means you actually have no idea what are the actual internals driving your services,

- so you read up on the exact function call that handles a specific aspect of pod auto scaling. That was a nice read. How does that make you a better engineer than those who didn't?

I don't really care about the standardized interface.

I just want to know how you'd implement something that would load your services and dependencies from a config file, bind them altogether, distribute the load through several local VMs and make it still work if I kill the service or increase the load.

In less than 1000 lines.

> I don't really care about the standardized interface.

Then you seem to be confused, because you're saying Kubernetes but what you're actually talking about is implementing a toy container orchestrator.

This is one of the truest comments I have ever read on here

I almost started laughing at the same comment. Kubernetes is the last place to know what your code is doing. A VM or bare metal is more practical for the persona that OP described. The git pushers might want the container on k8s

If you have a system that's actually big or complex enough to warrant using Kubernetes, which, to be frank, isn't really that much considering the realities of production, the only thing more complex than Kubernetes is implementing the same concepts but half-assed.

I really wonder why this opinion is so commonly accepted by everyone. I get that not everything needs most Kubernetes features, but it's useful. The Linux kernel is a dreadfully complex beast full of winding subsystems and full of screaming demons all over. eBPF, namespaces, io_uring, cgroups, SE Linux, so much more, all interacting with eachother in sometimes surprising ways.

I suspect there is a decent likelihood that a lot of sysadmins have a more complete understanding of what's going on in Kubernetes than in Linux.

> If you have a system that's actually big or complex enough to warrant using Kubernetes (...)

I think there's a degree of confusion over your understanding of what Kubernetes is.

Kubernetes is a platform to run containerized applications. Originally it started as a way to simplify the work of putting together clusters of COTS hardware, but since then its popularity drove it to become the platform instead of an abstraction over other platforms.

What this means is that Kubernetes is now a standard way to deploy cloud applications, regardless of complexity or scale. Kubernetes is used to deploy apps to raspberry pis, one-box systems running under your desk, your own workstation, one or more VMs running on random cloud providers, and AWS. That's it.

I'm not sure what your point is.

> I'm not sure what your point is.

My point is that the mere notion of "a system that's actually big or complex enough to warrant using Kubernetes" is completely absurd, and communicates a high degree of complete cluelessness over the whole topic.

Do you know what's a system big enough for Kubernetes? It's a single instance of a single container. That's it. Kubernetes is a container orchestration system. You tell it to run a container, and it runs it. That's it.

See how silly it all becomes once you realize these things?

First of all, I don't really get the unnecessary condencension. I am not a beginner when it comes to Kubernetes and don't struggle to understand the concept at all. I first used Kubernetes at version 1.3 back in 2016, ran production workloads on it, contributed upstream to Kubernetes itself, and at one point even did a short bit of consulting for it. I am not trying to claim to be any kind of authority on Kubernetes or job scheduling as a topic, but when you talk down to people the way that you are doing to me, it doesn't make your point any better, it just makes you look like an insecure dick. I really tried to avoid escalating this on the last reply, but it has to be said.

Second of all, I don't really understand why you think I'd be blown away by the notion that you can use Kubernetes to run a single container. You can also open a can with a nuclear warhead, does not mean it makes any sense.

In production systems, Kubernetes and its ecosystem are very useful for providing the kinds of things that are table stakes, like zero-downtime deployments, metric collection and monitoring, resource provisioning, load balancing, distributed CRON, etc. which absolutely doesn't come for free either in terms of complexity or resource utilization.

But if all you need to do is run one container on a Raspberry Pi and don't care about any of that stuff, then even something stripped down like k3s is simply not necessary. You can use it if you want to, but it's overkill, and you'll be spending memory and CPU cycles on shit you are basically not using. Literally anything can schedule a single pod on a single node. A systemd Podman unit will certainly work, for example, and it will involve significantly less YAML as a bonus.

I don't think the point I'm making is particularly nuanced here. It's basically YAGNI but for infrastructure.

Kubernetes is an abstraction of VMs so that single container can be implemented in the absence of a code package. The container is the binary in this circumstance. Unfortunately they lose control of blame shifting if their deployment fails. I can no longer be the VMs fault for failure. What is deployed in lower environments is what is in Prod physically identical outside of configuration.

Its the first thing that came to the mind of the person who wrote the comment, which is positively terrifying

Really? There are plenty of valid criticisms of kubernetes, but this doesn't strike me as one of them. It gives you tons of control over all of this. That's a big part of why it's so complex!

IMO, it's rather hard to fully know all of kubernetes and what it's doing, and the kind of person who demands elegance in solutions will hate it.

This mainframe system from the 1990s was so much simpler

https://www.ibm.com/docs/en/cics-ts/6.x?topic=sysplex-parall...

even if it wasn't as scalable as Kube. One the other hand, a cluster of 32 CMOS mainframe could handle any commercial computing job that people were doing in the 1990s.

Seems the causality is going the wrong direction there. Commercial jobs were limited by mainframe constraints, so that's where job sizes topped out.

It's not simple but it's not opaque.

It gives you control via abstractions. That’s fine, and I like K8s personally, but if you don’t understand the underlying fundamentals that it’s controlling, you don’t understand what it’s doing.

It's very easy to understand once you invest a little bit of time.

That's assuming you have a solid foundation in the nuts and bolts of how computers work to begin with.

If you just jumped into software development without that background, well, you're going to end up in the latter pool of developers as described by the parent comment.

Fly.io probably runs it on Kubernetes as well. It can be something in the middle, like RunPod. If you select 8 GPUs, you'll get a complete host for yourself. Though there is a lot of stuff lacking at RunPod too. But Fly.io... First of all, I've never heard about this one. Second, the variety of GPUs is lacking. There are only 3 types, and the L40S on Fly.io is 61.4% more expensive than on RunPod. So I would say it is about marketing, marketplace, long-term strategy, and pricing. But it seems at least they made themselves known to me (I bet there others which heard about them first time today too).

We do not use K8s.

Yeah no I wouldn't touch Kubernetes with a 10' pole. Way too much abstraction.

If my understanding is right, the gist seems to be that you create one or more docker containers that your application can run on, describe the parameters they require e.g. ram size/cuda capability/when you need more instances, and kubernetes provisions them out to the machines available to it based on those parameters. It's abstract but very tractibly so IMO, and it seems like a sensible enough way to achieve load balancing if you keep it simple. I plan to try it out on some machines of mine just for fun/research soon.

It's systemd but distributed across multiple nodes and with containers instead of applications. Instead of .service files telling the init process how to start and and monitor executables, you have charts telling the controller how to start and monitor containers.

It's worth noting that "container" and "process" are pretty similar abstractions. A lot of people don't realize this, but a container is sort of just a process with a different filesystem root (to oversimplify). That arguably is what a process should be on a server.

No, they are not. I'm not sure who started this whole container is just a process thing, but it's not a good analogy. Quite a lot of things you spin up containers for have multiple processes (databases, web servers, etc).

Containers are inherently difficult to sum up in a sentence. Perhaps the most reasonable comparison is to liken them to a "lightweight" vm, but the reasons people use them are so drastically different than vms at this point. The most common usecase for containers is having a decent toolchain for simple, somewhat reproducible software environments. Containers are mostly a hack to get around the mess we've made in software.

Having multiple processes under one user in an operating system is more akin to having multiple threads in one process than you think. The processes don't share a virtual memory space or kernel namespaces and they don't share PID namespaces, but that's pretty much all you get from process isolation (malware works because process isolation is relatively weak). The container adds a layer that goes around multiple processes (see cgroups), but the cgroup scheduling/isolation mechanism is very similar to the process isolation mechanism, just with a new root filesystem. Since everything Linux does happens through FDs, a new root filesystem is a very powerful thing to have. That new root filesystem can have a whole new set of libraries and programs in it compared to the host, but that's all you have to do to get a completely new looking computing environment (from the perspective of Python or Javascript).

A VM, in contrast, fakes the existence of an entire computer, hardware and all. That fake hardware comes with a fake disk on which you put a new root filesystem, but it also comes with a whole lot of other virtualization. In a VM, CPU instructions (eg CPUID) can get trapped and executed by the VM to fake the existence of a different processor, and things like network drivers are completely synthetic. None of that happens with containers. A VM, in turn, needs to run its own OS to manage all this fake hardware, while a container gets to piggyback on the management functions of the host and can then include a very minimal amount of stuff in its synthetic root.

> Having multiple processes under one user in an operating system is more akin to having multiple threads in one process than you think.

Not than I think. I'm well aware of how "tasks" work in Linux specifically, and am pretty comfortable working directly with clone.

Your explanation is great, but I intentionally went out of my way to not explain it and instead give a simple analogy. The entire point was that it's difficult to summarize.

> I'm not sure who started this whole container is just a process thing, but it's not a good analogy. Quite a lot of things you spin up containers for have multiple processes (databases, web servers, etc).

It came from how Docker works, when you start a new container it runs a single process in the container, as defined in the Dockerfile.

It's a simplification of what containers are capable of and how they do what they do, but that simplification is how it got popular.

If a container is "a process", then an entire linux/unix os (pid 1) is simply "a process"

Not just the kernel and PID 1, we also tend to refer to the rest of the system as "linux" as well, even though it's not technically correct. It's very close to the same simplification.

> Containers are inherently difficult to sum up in a sentence.

Super easy if we talk about Linux. It's a process tree being spawned inside it's own set of kernel namespaces, security measures and a cgroup to provide isolation from the rest of the system.

If someone doesn't understand "container", I'm supposed to expect them to understand all the namespaces and their uses, cgroups, and the nitty gritty of the wimpy security isolation? You are proving my point that it's tough to summarize by using a bunch more terms that are difficult to summarize.

Once you recursively expand all the concepts, you will have multiple dense paragraphs, which don't "summarize" anything, but instead provide full explanations.

If you throw out the Linux tech from my explanation, it would become a general description which holds up even for Windows.

Core kubernetes (deployments, services etc..) is fairly easy to understand. lot of other stuff in the cncf ecosystem is immature. I don't think most people need to use all the operators, admission controllers, otel, service mesh though.

If you're running one team with all services trusting each other, you don't have problems solved by these things. Whenever you introduce a CNCF component outside core kubernetes, invest time in understanding it and why it does what it does. Nothing is "deploy and forget" and will need to be regularly checked and upgraded, and when issues come up you need some architecture-level of the component to troubleshoot because so many moving parts are there.

So if I can get away writing my own cronjob in 1000 lines rather than installing something from GitHub with a helm chart, I will go with the former option.

(Helm is crap though, but you often won't have much choice).

Having a team that runs the kubernetes for you and being on receiving end is indeed super easy. Need another microservice? Just add another repository, add short yaml, push it to CI and bam!, it's online.

But setting it up is not a trivial task and often a recipe for disaster.

I've seen a fair share of startups who took too much kool aid and wanted parrot FANG stacks just to discover they are burning tons of money just trying to deploy their first hello world application.

The irony is the whole devops and cloud sales pitch was developers can do all this themselves and you no longer need an sysadmin team. Turns out you still do, it’s just called the devops/cloud team and not sys admin team.

just yesterday a friend messaged me that their cutting edge cloud systems based on kubernetes and latests achievements in microservice architecture will have... 2 days maintenance downtime.

Maybe not Kubernetes, but what about Docker Compose or Docker Swarm? Having each app be separate from the rest of the server, with easily controllable storage, networking, resource limits, restarts, healthchecks, configuration and other things. It's honestly a step up from well crafted cgroups and systemd services etc. (also because it comes in a coherent package and a unified description of environments) while the caveats and shortcomings usually aren't great enough to be dealbreakers.

But yeah, the argument could have as well just said running code on a VPS directly, because that also gives you a good deal of control.

Based on the following I think they also meant _how_ the code is running:

> The other group (increasingly large) just wants to `git push` and be done with it, and they're willing to spend a lot of (usually their employer's) money to have that experience. They don't want to have to understand DNS, linux, or anything else beyond whatever framework they are using.

I'm a "full full-stack" developer because I understand what happens when you type an address into the address bar and hit Enter - the DNS request that returns a CNAME record to object storage, how it returns an SPA, the subsequent XHR requests laden with and cookies and other goodies, the three reverse proxies they have to flow through to get to before they get to one of several containers running on a fleet of VMs, the environment variable being injected by the k8s control plane from a Secret that tells the app where the Postgres instance is, the security groups that allow tcp/5432 from the node server to that instance, et cetera ad infinitum. I'm not hooking debuggers up to V8 to examine optimizations or tweaking container runtimes but I can speak intelligently to and debug every major part of a modern web app stack because I feel strongly that it's my job to be able to do so (and because I've worked places where if I didn't develop that knowledge then nobody would have).

I can attest that this type of thinking is becoming increasingly rare as our industry continues to specialize. These considerations are now often handled by "DevOps Engineers" who crank out infra and seldom write code outside of Python and bash glue scripts (which is the antithesis to what DevOps is supposed to be, but I digress). I find this unfortunate because this results in teams throwing stuff over the wall to each other which only compounds the hand-wringing when things go wrong. Perhaps this is some weird psychopathology of mine but I sleep much better at night knowing that if I'm on the hook for something I can fix it once it's out in the wild, not just when I'm writing features and debugging it locally.

[deleted]

> I can attest that this type of thinking is becoming increasingly rare as our industry continues to specialize.

This (and a few similar upthread comments) sum the problem up really concisely and nicely: pervasive, cross-stack understanding of how things actually work and why A in layer 3 has a ripple effect on B in layer 9 has become increasingly rare, and those who do know it are the true unicorns in the modern world.

Big part of the problem is the lack of succession / continuity at the university level. I have been closely working with very bright, fresh graduates/interns (data science, AI/ML, software engineering – a wide selection of very different specialisations) in the last few years, and I have even hired a few of them due to being that good.

Talking to them has given me interesting insights into what and how universities teach today. My own conclusion is that the reputable universities teach very well, but what they teach to is highly compartmentalised and typically there is little to no intersection across areas of study (unless the prospective student hits the pot of luck and enrolls in elective studies that go across the areas of knowledge). For example, students who study game programming (yes, it is a thing) do not get taught the CPU architectures or low-level programming in assembly; they have no idea what a pointer is. Freshly graduated software engineers have no idea what a netmask is and how it helps in reading a routing table; they do not know what a route is, either.

So modern ways of teaching are one problem. The second (and I think a big one) is the problem that the computing hardware has become heavily commoditised and appliance-like, in general. Yes, there are a select few who still assemble their own racks of PC servers at home or tinker with Raspberry Pi and other trinkets, but it is no longer an en masse experience. Gone are the days when signing up with an ISP also required building your own network at home. This had an important side effect of acquiring the cross-stack knowledge, which can only be gained today by willingfully taking up a dedicated uni course.

With all of that disappearing into oblivion, the worrying question that I have is: who is going to support all this «low level» stuff in a matter of 20 years without a clear plan for the cross-stack knowledge to succeed the current (and the last?) generation of unicorns?

So those who are drumming up the flexibility of k8s and alike miss out on one important aspect: with the lack of cross-stack knowledge succession, k8s is a risk for any mid- to large-sized organisation due to being heavily reliant on the unicorns and rockstar DevOps engineers who are few and far between. It is much easier to palm the infrastructure off to a cloud platform where supporting it will become someone else's headache whenever there is a problem. But the cloud infrastructure usually just works.

> For example, students who study game programming (yes, it is a thing) do not get taught the CPU architectures or low-level programming in assembly; they have no idea what a pointer is. Freshly graduated software engineers have no idea what a netmask is and how it helps in reading a routing table; they do not know what a route is, either.

> So modern ways of teaching are one problem.

IME school is for academic discovery and learning theory. 90% of what I actually do on the job comes from self-directed learning. From what I gather this is the case for lots of other fields too. That being said I've now had multiple people tell me that they graduated with CS degrees without having to write anything except Python so now I'm starting to question what's actually being taught in modern CS curricula. How can one claim to have a B.Sc. in our field without understanding how a microprocessor works? If it's in deference to more practical coursework like software design and such then maybe it's a good thing...

> […] self-directed learning.

And this is whom I ended up hiring – young engineers with curious minds, who are willing to self-learn and are continuously engaged in the self-learning process. I also continuously suggest interesting, prospective, and relevant new things to take a look into, and they seem to be very happy to go away, pick the subject of study apart, and, if they find it useful, incorporate it into their daily work. We have also made a deal with each other that they can ask me absolutely any question, and I will explain and/or give them further directions of where to go next. So far, such an approach has worked very well – they get to learn arcane (it is arcane today, anyway) stuff from me, they get full autonomy, they learn how to make their own informed decisions, and I get a chance to share and disseminate the vasts of knowledge I have accumulated over the years.

> How can one claim to have a B.Sc. in our field without understanding how […]

Because of how universities are run today. A modern uni is a commercial enterprise, with its own CEO, COO, C<whatever other letter>O. They rely on revenue streams (a previously unheard-of concept for a university), they rely on financial forecasts, and, most important of all, they have to turn profits. So, a modern university is basically a slot machine – outcomes to yield depend entirely on how much cash one is willing to feed it. And, because of that, there is no incentive to teach across the areas of study as it does not yield higher profits or is a net negative.

Maybe in the US. Any self-titled Engineer in Europe with no knowledge of CPU's, registers, stacks, concurrency, process management, scheduling, O-notation, dynamic systems, EE, and a bigass chunk of Math from Linear to Abstract it would be insta-bashed down in the spot with no degree at all.

Here in Spain atthe most basic uni you are almost being able to write a Minix clone from scratch into some easy CPU (Risc-V maybe) from all the knowledge you got.

I am no Engineer (trade/voc arts, just a sysadmin) and I can write a small CHIP8 emulator at least....

I am not based in the US, and I currently work for one of the top 100 universities of the world (the lower 50 part, though).

Lol. Yes. I scoffed.

I enjoy the details, but I don’t get paid to tell my executives how we’re running things. I get paid to ship customer facing value.

Particularly at startups, it’s almost always more cost effective to hit that “scale up” button from our hosting provider than do any sort of actual system engineering.

Eventually, someone goes “hey we could save $$$$ by doing XYZ” so we send someone on a systems engineering journey for a week or two and cut our bill in half.

None of it really matters, though. We’re racing against competition and runway. A few days less runway isn’t going to break a startup. Not shipping as fast as reasonable will.

I’ve been in similar situations, but details matter. If your scale up button is heavily abstracted services, your choice starts to become very different as the cost of reimplementing what the service does might be high enough that you end up with a no win situation of your own making.

The closer your “Scale up” button is referencing actual hardware, the less of a problem it is.

That's the next problem startups should avoid at all cost. Don't do heavily abstracted services, just put it all in a monolith which will make it faster and easier to iterate. Don't overthink it, just get the feature out of the door.

Chances are high that you won't get it right from the beginning, you can create these abstractions once you really understand the problem space with real world data.

When you get to that point I have another pro tip: Don't refactor, just rewrite it and put all your learnings into the v2.

This is exactly, precisely what my experience has been.

Why don't refactor just the parts the need it instead of rewriting everything?

You can do small refactors here and there but usually things get complex after some time. My first advice was to not overthink on architecture in the beginning so it is inevitable that you will end up with something quite unorganized after a while. The assumption here is that in the beginning you won't know what the architecture will look like in the future because you are in startup/discovery mode.

Refactoring such a codebase while keeping everything running can be a monumental effort. I found it very hard to keep people who work on such a project motivated. Analyzing the use cases, coming up with a new design incorporating your learnings and then seeing clear progress towards the goal of a cleaner codebase is much more motivating. Engineers get the chance to do something new instead of "moving code around".

I'm not saying rewrite everything. When you get to this point it usually makes sense to start thinking about these abstractions which I advised to avoid in the beginning. You can begin to separate parts of the system by responsibility and then you rewrite just one part and give it a new API which other parts of your system will consume. Usually by that time you'll also want to restructure your database.

Wild idea: maybe if more devs had good fundamental knowledge to begin with, the good systems engineering could be done along the way.

And if more systems engineers had more design knowledge, navigating the AWS console wouldn’t be like walking on hot coals. But it’s still $X0 billion/year business!

We’re all different at good things, and it’s usually better to lean into your strengths than it is to paper over your weaknesses.

We can wish everyone were good at everything, or we can try to actually get things done.

  > We can wish everyone were good at everything, or we can try to actually get things done.

False dichotomy. There's no reason we can't have both.

I want to be clear, there's no perfect code or a perfect understanding or any of that. But the complaint here about not knowing /enough/ fundamentals is valid. There is some threshold which we should recognize as a minimum. The disagreement is about where this threshold is, and no one is calling for perfection. But certainly there are plenty who want the threshold to not exist. Be that AI will replace coders or coding bootcamps get you big tech jobs. Zero to hero in a few months is bull.

It’s not a false dichotomy at all. You only have so many hours in a day. At a startup, it’s very unlikely (certainly not impossible!) that your differentiation will come from very cheap system orchestration - your time is likely better spent on building your product.

Minimum knowledge is one thing; minimum time to apply it is another.

If you had to spend time / VC money learning all of this stuff before you could begin to apply it, I absolutely agree, it's a waste of time. That's not my point. My point is people (by people, I mean "someone interested in tech and is likely to pursue it as a career") can and should learn these things earlier in life such that it's trivial once they're in the workforce.

I could go from servers sitting on the ground to racked, imaged, and ready to serve traffic in a few hours, because I've spent the time learning how to do it, and have built scripts and playbooks to do so. Even if I hadn't done the latter, many others have also done so and published them, so as long as you knew what you were looking for, you could do the same.

Yeah this is what I meant as well. Though I'd also argue that on the job returning is essential too. Which should come through multiple avenues. Mentorship from seniors to junior as well as allowing for time to learn on the job. I can tell you from having been an aerospace engineer you'd be given this time. And from what I hear, in the old days you'd naturally get this time any time you hit compile.

There's a bunch of sayings from tradesmen that I think are relevant here. And it's usually said by people who take pride in their work and won't do shoddy craftsmanship

  - measure twice, cut once 
  - there's never time to do it right, but there's always time to do it twice
  - if you don't have time to do it right when will you have time to do it again?

I think the advantage these guys have is that when they do a shit job it's more noticeable. Not only to the builders but anyone else. Unfortunately we work with high abstractions but high skill is the main reason we get big bucks. Unfortunately I think this makes it harder for managers to differentiate high quality from shit. So they'd rather get shit fast than quality a tad slower because all they can differentiate is time. But they don't see the how this is so costly since everything has to be done at least thrice

> False dichotomy. There's no reason we can't have both.

I'd kinda want to argue with that - it is true, but we don't live in vacuum. Most programmers (me included, don't worry) aren't that skilled, and after work not everyone will want to study more. This is something that could be resolved by changing cultural focus, but like other things involving people, it's easier to change the system/procedures than habits.

Are you wanting to argue or discuss? You can agree in part and disagree with another part. Doesn't have to be an argument.

To your point I agree. I would argue that employers should be giving time for employees to better themselves. It's the nature of any job like this where innovation takes place. It's common among engineers, physicists, chemists, biologists, lawyers, pilots, and others to have time to learn. Doctors seem to be in the same boat as us and it has obviously negative consequences. The job requires continuous learning. And you're right, that learning is work. So guess who's supposed to pay for work?

Sorry, discuss, not argue.

I do agree with you, though. I have fears though on how much can that be a thing in reality - because I cannot disagree that this is a right approach.

Well here's the choice, we do that and build good things or we don't and build shit.

If you look around I think you'll notice it's mostly shit...

There's a flaw in markets though which allows shit to flourish. It's that before purchasing you can't tell the difference between products. So generally people then make the choice based on price. Of course, you get what you pay for. And many markets people are screaming for something different that isn't being currently met, but things are so entrenched that it's hard to even create that new market unless you're a huge player.

Here's a good example. Say you know your customers like fruit that is sweet. So all the farmers breed sweeter and sweeter strawberries. The customers are happy and sells go up. But at some point they don't want it any sweeter. But every farmer continues anyways and the customers have no choice but to buy too sweet strawberries. So strawberry sells decline. The farmers not having much signal from customers other than price and orders, what do they do? Well... they double down of course! It's what worked before.

The problem is that the people making decisions are so far removed from all this that they can't read the room. They don't know what the customer wants. Tbh, with tech, often the customer doesn't know what they want until they see it. (Which this is why so much innovation comes from open source because people are fixing things to make their own lives better and then a company goes "that's a good idea, let's scale this)

> there's no perfect code or a perfect understanding or any of that

I'm unsure what those terms mean. What are qualities that perfect code or perfect understanding would have?

Depending on your framing I may agree or disagree.

Just to lob a softball, I'm sure there are/were people that have a perfect understanding of an older CPU architecture; or an entire system architecture's worth of perfect understanding that gave us spacecraft with hardware and firmware that still works and can be updated (out of the planetary solar system?), or Linux.

These are softballs for framing because they're just what I could type off the cuff.

I'm making a not so subtle reference to "don't let perfection be the enemy of good". A saying often said to people who are saying things need to be better.

To answer your softball, no, I doubt there was anyone who understood everything except petty early on. But very few people understand the whole OS let alone do any specialized task like data analysis, HPC, programming languages, encryption, or anything else. But here's the thing, the extra knowledge never hurts. It almost always helps, but certain knowledge is more generally helpful than others. Especially if we're talking memory but things like caching, {S,M}I{S,M}D, some bash, and some assembly go A LONG way

I'm a big fan of fundamental knowledge, but I disagree somewhat with your statement. The thing that startups care most about is a product market fit. And finding that fit requires a lot of iteration and throw-away code. Once the dust settles and you have an initial user base, you can start looking into optimizations.

> Once the dust settles and you have an initial user base, you can start looking into optimizations.

But people never do. Instead they just scale up, get more funding, rinse and repeat. It isn't until the bill gets silly that anyone bothers to consider it, and they usually then discover that no one knows how to optimize things other than code (maybe – I've worked with many devs who have no idea how to profile their code, which is horrifying).

> But people never do. Instead they just scale up, get more funding, rinse and repeat. It isn't until the bill gets silly that anyone bothers to consider it,

Yes because usually the other option is focus on those things you advocate for up front and then they go out of business before they get a chance to have the problems you're arguing against.

Because that's the right approach for that situation. A core skill in engineering is understanding tradeoffs, and in that case you want speed.

Outside of eng, nobody cares if your company has the prettiest, leanest infrastructure in the world. They care about product.

It all takes time, mental energy, etc.

Different environments require different tradeoffs. The vast majority of startups will die before their systems engineering becomes a problem.

Constant firefighting because you engineered a pile of shit also takes time and mental energy.

You are describing worst of both worlds - systems engineering, done poorly. I find good abstractions to be almost maintenance-free.

Only if you let it. None of that is your problem - just consume the task queue.

Unless of course you're in a leadership role, in which case it's going to be priority #1,000 in 99.9% of cases.

There is such a variety of work environments, and realistically most people learn on the job. Everyone has different skills and knowledge bases.

When I was at <FAANG> we didn’t control our infrastructure, there were teams that did it for us. Those guys knew a lot more about the internals of Linux than your average HNer. Getting access to the SSD of the host wasn’t a sys-call away, it was a ticket to an SRE and a library import. It wasn’t about limited knowledge, it was an intentional engineering tradeoff made at a multi-billion dollar infra level.

When I worked at <startup>, we spent 1hr writing 50loc and throwing it at AWS lambda just to see if it would work. No thought to long term cost or scalability, because the company might not be there tomorrow, and this is the fastest way to prototype an API in the cloud. When it works, obviously management wants you to hit the “scale” button in that moment and if it costs 50% more, well that’s probably only a few hundred dollars a month. It wasn’t about limited knowledge, but instead an intentional engineering tradeoff when you’re focused on speed and costs are small

And there is a whole bunch of companies that exist in between.

This is exactly my experience. Nearly every dev on my team can dive into the details and scale that service effectively, but it’s rarely worth it.

If an engineer costs $100/hour, scaling an extra $100/month (or even an extra $1k/month) is generally a no brainer. That money is almost always better served towards shipping product.

Premature optimization may hit them hard. Overengineering is imo usually the bigger technical debt and a huge upfront cost as well. Well-thought out plans tend to become a sunken cost fallacy. Making room for changes is hard enough in XP like ways of working. When you have to tell your manager that half a year of careful plans and engineering can be thrown away, because of the new requirements, which emerge from late entry to market, you look like a clown. Plans and complexity usually introduce more risk than less.

Infra should not require much in the way of redoing if it's done correctly. Foundational software's configuration like RDBMS schema, maybe, but I wouldn't classify that as infra per se.

Seriously, I'm struggling to figure out how "we have servers that run containers / applications" would need to be redone just because the application changed.

Some things that can happen: Product gets canned. Customers want on premise in their data center. Usage spikes are too extreme and serverless is simply the cheapest option.

I would always recommend "serverless" monolith first with the option to develop with mocks locally/offline. That's imo the best risk/effort ratio.

This is context based dichotomy, not a person-based one.

In my personal life, I’m curiosity-oriented, so I put my blog, side projects and mom’s chocolate shop on fully self hosted VPSs.

At my job managing a team of 25 and servicing thousands of customers for millions in revenue, I’m very results-oriented. Anyone who tries to put a single line of code outside of a managed AWS service is going to be in a lot of trouble with me. In a results-oriented environment, I’m outsourcing a lot of devops work to AWS, and choosing to pay a premium because I need to use the people I hire to work on customer problems.

Trying to conflate the two orientations with mindsets / personality / experience levels is inaccurate. It’s all about context.

This is a false dichotomy. The truth is we are constantly moving further and further away from the silicon. New developers don't have as much need to understand these details because things just work; some do care because they work at a job where it's required, or because they're inherently interested (a small number).

Over time we will move further away. If the cost of an easily managed solution is low enough, why do the details matter?

> The truth is we are constantly moving further and further away from the silicon.

Are we? We're constantly changing abstractions, but we don't keep adding them all that often. Operating systems and high-level programming languages emerged in the 1960s. Since then, the only fundamentally new layer of abstraction were virtual machines (JVM, browser JS, hardware virtualization, etc). There's still plenty of hardware-specific APIs, you still debug assembly when something crashes, you still optimize databases for specific storage technologies and multimedia transcoders for specific CPU architectures...

Maybe fundamentally is an extremely load bearing word here, but just in the hardware itself we see far more abstraction than we saw in the 60s. The difference between what we called microcode in an 8086 and what is running in any processor you buy in 2025 is an abyss. It almost seems like hardware emulation. I could argue that the layers of memory caching that modern hardware have are themselves another layer vs the days when we sent instructions to change which memory banks to read. The fact that some addresses are very cheap and others are not, and the complexity is handled in hardware is very different than stashing data in extra registers we didn't need this loop. The virtualization any OS does for us is much deeper than even a big mainframe that was really running a dozen things at once. It only doesn't look like additional layers if you look from a mile away.

The majority of software today is written without knowing even which architecture the processor is going to be, how much of the processor we are going to have, whether anything will ever fit in memory... hell, we can write code that doesn't know not just the virtual machine it's going to run in, but even the family of virtual machine. I have written code that had no idea if it was running in a JVM, LLVM or a browser!

So when I compare my code from the 80s to what I wrote this morning, the distance from the hardware doesn't seem even remotely similar. I bet someone is writing hardware specific bits somewhere, and that maybe someone's debugging assembly might actually resemble what the hardware runs, maybe. But the vast majority of code is completely detached from anything.

At the company I work for, I routinely mock the software devs for solving every problem by adding yet another layer of abstraction. The piles of abstractions these people levy is mind numbingly absurd. Half the things they are fixing, if not more, are created by the abstractions in the first place.

Yeah, I remember watching a video of (I think?) a European professor who helped with an issue devs were having in developing The Witness. Turns out they had a large algorithm they developed in high level code (~2000 lines of code? can't remember) to place flora in the game world, which took minutes to process, and it was hampering productivity. He looked at it all, and redid almost all of it in something like <20 lines of assembly code, and it achieved the same result in microseconds. Unfortunately, I can't seem to find that video anymore...

Frankly though, when I bring stuff like this up, it feels like I'm being mocked than the other way around - like we're the minority. And sadly, I'm not sure if anything can ultimately be done about it. People just don't know what they don't know. Some things you can't tell people despite trying to, they just won't get it.

It's Casey Muratori, he's an American gamedev, not a professor. The video was a recorded guest lecture for a uni in the Netherlands though.

And it wasn't redone in assembly, it was C++ with SIMD intrinsics, which might as well just be assembly.

https://www.youtube.com/watch?v=Ge3aKEmZcqY&list=PLEMXAbCVnm...

See Plato's Cave. As an experienced dev, I have seen sunlight and the outside world and so many devs think shadows puppets in a cave is life.

https://en.m.wikipedia.org/wiki/Allegory_of_the_cave

That really sounds like no one bothered profiling the code. Which I'd say is underengineered, not over.

what's your point exactly? what do you hope to achieve by "bringing it up" (I assume in your workplace)?

most programmers are not able to solve a problem like that in 20 lines of assembly or whatever, and no amount of education or awareness is going to change that. acting as if they can is just going to come across as arrogant.

The point is exactly as the above post mentioned:

> Half the things they are fixing, if not more, are created by the abstractions in the first place

Unlike the above post though, in my experience, it's less often devs (at least the best ones) who want to keep moving away from the silicon, but more often management. Everywhere I have worked, management wants to avoid control over the lower-level workings of things and outsource or abstract it away. They then proceed to wonder why we struggle with the issues that we have, despite people who deal with these things trying to explain it to them. They seem to automatically assume that higher level abstractions are inherently better, and will lead to productivity gains, simply because you don't have to deal with the underlying workings of things. But the example I gave, is reason for why that that isn't always necessarily the case. Fact is, sometimes problems are better and more easily solved in a lower-level abstraction.

But as I had said, in my experience, management often wants to go the opposite way and often disallows us control over these things. So, as an engineer who wants to solve the problems as much as management or customers want their problems solved, hope to achieve by "bringing it up" in cases which seem appropriate, a change which empowers us to actually solve such problems.

Don't get me wrong though, I'm not saying lower-level is always the way to go. It always depends on the circumstances.

> acting as if they can is just going to come across as arrogant.

Hold on there a sec: WHAT?!

Engineers tend to solve their problems differently and the circumstances for those differences are not always clear. I'm in this field because I want to learn as many different approaches as possible. Did you never experience a moment when you could share a simpler solution to a problem with someone and could observe first hand when they became one of todays lucky 10'000[0]? That's anything but arrogant in my book.

Sadly, what I can increasingly observe is the complete opposite. Nobody wants to talk about their solutions, everyone wants to gatekeep and become indispensable, and criticism isn't seen as part of productive environments as "we just need to ship that damn feature!". Team members should be aware when decisions have been made out of lazyness, in good faith, out of experience, under pressure etc.

[0]: https://xkcd.com/1053/

> There's still plenty of hardware-specific APIs, you still debug assembly when something crashes, you still optimize databases for specific storage technologies and multimedia transcoders for specific CPU architectures...

You might, maybe, but an increasing proportion of developers:

- Don't have access to the assembly to debug it

- Don't even know what storage tech their database is sitting on

- Don't know or even control what CPU architecture their code is running on.

My job is debugging and performance profiling other people's code, but the vast majority of that is looking at query plans. If I'm really stumped, I'll look at the C++, but I've not yet once looked at assembly for it.

This makes sense to me. When I optimize, the most significant gains I find are algorithmic. Whether it's an extra call, a data structure that needs to be tweaked, or just utilizing a library that operates closer to silicon. I rarely need to go to assembly or even a lower level language to get acceptable performance. The only exception is occasionally getting into architecture specifics of a GPU. At this point, optimizing compilers are excellent and probably have more architecture details baked into them than I will ever know. Thank you, compiler programmers!

> At this point, optimizing compilers are excellent

the only people that say this are people who don't work on compilers. ask anyone that actually does and they'll tell you most compiler are pretty mediocre (tend to miss a lot of optimization opportunities), some compilers are horrendous, and a few are good in a small domain (matmul).

It's more that the God of Moore's Law have given us so many transistors that we are essentially always I/O blocked, so it effectively doesn't matter how good our assembly is for all but the most specialized of applications. Good assembly, bad assembly, whatever, the point is that your thread is almost always going to be blocked waiting for I/O (disk, network, human input) rather than something that a fancy optimization of the loop that enables better branch prediction can fix.

> It's more that the God of Moore's Law have given us so many transistors that we are essentially always I/O blocked

this is again just more brash confidence without experience. you're wrong. this is a post about GPUs and so i'll tell you that as a GPU compiler engineer i spend my entire day (work day) staring/thinking about asm in order to affect register pressure and ilp and load/store efficiency etc.

> rather than something that a fancy optimization of the loop

a fancy loop optimization (pipelinig) can fix some problems (load/store efficiency) but create other problems (register pressure). the fundamental fact is NFL theorem applies here fully: you cannot optimize for all programs uniformly.

https://en.wikipedia.org/wiki/No_free_lunch_theorem

I just want to second this. Some of my close friends are PL people working on compilers. I was in HPC before coming to ML, having written a fair amount of CUDA kerenls, a lot of parallelism, and dealing with I/O.

While yes, I/O is often a computational bound, I'd be shy to really say that in a consumer space when we aren't installing flash buffers, performing in situ processing, or even pre-fetching. Hell, in many programs I barely even see any caching! TBH, most stuff can greatly benefit from asynchronous and/or parallel operations. Yeah, I/O is an issue, but I really would not call anything I/O bound until you've actually gotten into parallelism and optimizing code. And even not until you apply this to your I/O operations! There is just so much optimization that a compiler can never do, and so much optimization that a compiler won't do unless you're giving it tons of hints (all that "inline", "const", and stuff you see in C. Not to mention the hell that is template metaprogramming). Things you could never get out of a non-typed language like python, no matter how much of the backend is written in C.

That said, GPU programming is fucking hard. Godspeed you madman, and thank you for your service.

> At this point, optimizing compilers are excellent and probably have more architecture details baked into them than I will ever know.

While modern compilers are great, you’d be surprised about the seemingly obvious optimizations compilers can’t do because of language semantics or the code transformations would be infeasible to detect.

I type versions of functions into godbolt all the time and it’s very interesting to see what code is/isn’t equivalent after O3 passes

The need to expose SSE instruction to system languages tells that compilers are not good at translating straightforward code into optimal machine code. And using SSE properly allows often to speed up the code by several times.

I don’t understand how you could say something like HTTP or Cloud Functions or React aren’t abstractions that software developers take for granted.

These days even if one writes in machine code it will be quite far away from the real silicon as that code has little to do with what CPU is actually doing. I suspect that C source code from, say, nineties was closer to the truth than the modern machine code.

Could you elaborate? I may very well just be ignorant on the topic.

I understand that if you write machine code and run it in your operating system, your operating system actually handles its execution (at least, I _think_ I understand that), but in what way does it have little to do with what the CPU is doing?

For instance, couldn't you still run that same code on bare metal?

Again, sorry if I'm misunderstanding something fundamental here, I'm still learning lol

The quest for performance has turned modern CPUs into something that looks a lot more like a JITed bytecode interpreter rather than the straightforward “this opcode activates this bit of the hardware” model they once were. Things like branch prediction, speculative execution, out-of-order execution, hyperthreading, register renaming, L1/2/3 caches, µOp caches, TLB caches... all mean that even looking at the raw machine code tells you relatively little about what parts of the hardware the code will activate or how it will perform in practice.

That makes sense, thank you!

The abstraction manifest more on the language level. No memory management, simplified synchronization primitives, no need of compilation.

Not sure virtual machine are fundamentally different. In the end if you have 3 virtual or 3 physical machine the most important difference is how fast you can change their configuration. They will still have all the other concepts (network, storage, etc.). The automation that comes with VM-s is better than it was for physical (probably), but then automation for everything got better (not only for machines).

[dead]

The details matter because someone has to understand the details, and it's quicker and more cost-effective if it's the developer.

At my job, a decade ago our developers understood how things worked, what was running on each server, where to look if there were problems, etc. Now the developers just put magic incantations given to them by the "DevOps team" into their config files. Most of them don't understand where the code is running, or even what much of it is doing. They're unable or unwilling to investigate problems on their own, even if they were the cause of the issue. Even getting them to find the error message in the logs can be like pulling teeth. They rely on this support team to do the investigation for them, but continually swiveling back-and-forth is never going to be as efficient as when the developer could do it all themselves. Not to mention it requires maintaining said support team, all those additional salaries, etc.

(I'm part of said support team, but I really wish we didn't exist. We started to take over Ops responsibilities from a different team, but we ended up taking on Dev ones too and we never should've done that.)

http://www.antipope.org/charlie/blog-static/2014/10/not-a-ma...

This blog has a brilliant insight that I still remember more than a decade later: we live in a fantasy setting, not a Sci-fi one. Our modern computers are so unfathomable complex that they are demons, ancient magic that can be tamed and barely manipulated, but not engineered. Modern computing isn't Star Trek TNG, where Captain Picard and Geordi LaForge each have every layer of their starship in their heads with full understanding, and they can manipulate each layer independently. We live in a world where the simple cell phone in our pocket contains so much complexity that it is beyond any 10 human minds combined to fully understand how the hardware, the device drivers, the OS, the app layer, and the internet all interact between each other.

> We live in a world where the simple cell phone in our pocket contains so much complexity that it is beyond any 10 human minds combined to fully understand how the hardware, the device drivers, the OS, the app layer, and the internet all interact between each other.

Try tens of thousands of people. A mobile phone is immensely more complicated than people realize.

Thank you for writing it so eloquently. I will steal it.

> (I'm part of said support team, but I really wish we didn't exist. We started to take over Ops responsibilities from a different team, but we ended up taking on Dev ones too and we never should've done that.)

There will always be work for people like us. It's not so bad. We're not totally immune to layoffs but for us they come several rounds in.

> why do the details matter?

This statement encapsulates nearly everything that I think is wrong with software development today. Captured by MBA types trying to make a workforce that is as cheap and replaceable as possible. Details are simply friction in a machine that is obsessed with efficiency to the point of self-immolation. And yet that is the direction we are moving in.

Details matter, process matters, experience and veterancy matters. Now more than ever.

I used to think this, ut it only works if the abstractions hold - it’s like if we stopped random access memory and went back to tape drives suddenly abstractions matter.

My comment elsewhere goes into but more detail but basically silicon stopped being able to make single threaded code faster in about 2012 - we just have been getting “more parallel cores” since. And now at wafer scale we see 900,000 cores on a “chip”. When 100% parallel coding runs 1 million times faster than your competitors, when following one software engineering path leads to code that can run 1M X, then we will find ways to use that excess capacity - and the engineers who can do it get to win.

I’m not sure how LLMs face this problem.

This.

As soon as the abstractions leak or you run into an underlying issue you suddenly need to understand everything about the underlying system or you're SOOL.

I'd rather have a simpler system I already understand all the proceeding abstractions about.

The overhead of this is minimal when you keep things simple and avoid shiny things.

That's why abstractions like PyTorch exist. You can write a few lines of Python and get good utilization of all those GPU cores.

[dead]

Tell that to the people who keep the network gear running at your office. You might not see the importance of knowing the details, but those details still matter and are still in plain use all around you every day. The aversion to learning the stack you're building with is frustrating to the people who keep that stack running.

I think that if the development side knew a little bit of the rest of the stack they'd write better applications overall.

[dead]

"Preventing the Collapse of Civilization" by Jonathan Blow comes to mind - https://www.youtube.com/watch?v=ZSRHeXYDLko

A fantastic talk.

I think that’s the same answer someone would say about an IBM mainframe in 1990. And just as wrong.

I’ll use my stupid hobby home server stuff as an example. I tossed the old VMware box years ago. You know what I use now? Little HP t6x0 thin clients. They are crappy little x86 SoCs with m2 slots, up to 32GB memory and they can be purchased used for $40. They aren’t fast, but perform better than the cheaper AWS and GCP instances.

In that a trivial use case? Absolutely. Now move from $30 to about $2000. Buy a Mac Mini. It’s a powerful arm soc with ridiculously fast storage and performance. Probably more compute than a small/mid size company computer room a few years ago and more performant than a $1M SAN a decade ago.

6G will bring 10gig cellular.

Hyperscalers datacenters are the mainframe of 2025.

A hyperscaler (or a cloud providers in general) does not only sell you compute in terms of a compute node but rather in compute as a service. There are some value adds, like e.g., AWS cloud services, but on a pure compute level you pay for elasticity and reliability. A comparison between a cloud provider and your homelab also needs to account for connectivity, which likely is in strong favor (latency/ reliability) of a cloud provider or DC compared to a office or home.

Server hardware is reliable. For connectivity price-wise I think the sweet spot presently is to host own hardware in the data center and have a system administrator that lives not far away. I worked before for a company that was doing things like that while having millions of active users. It costed them at least 5 times less then it would be with a cloud provider. And then when they got a better deal with another data center the migration was not much more complex then moving server boxes in a van and changing ip addresses for load balancers.

Absolutely — they add a ton of value. So did IBM… and companies migrated to NT solutions that were half baked because they were cheap.

When I can get the equivalent of a Mac Mini in a super cheap price point… you’re going to have opportunities to attack those stratospheric cloud margins.

Do you happen to have a link for your HP t6x0 reference? I tried https://www.ebay.com/sch/i.html?_nkw=hp+thin+client+32gb+-(4... and there seemed to be plenty with 32GB of storage but none that I could find with that much RAM

That’s because you have to add the memory yourself.

Just took a quick look- appears t730 is of DDR3 era and may only have a single slot.

t740 definitely has two slots https://www8.hp.com/h20195/v2/GetPDF.aspx/c06393061.pdf

Sorry I was imprecise. I typically buy the cheapest one I can find with a power supply. Last year t630 were the sweet spot. They typically ship with 4 or 8… the windows models have higher spec. I picked up a couple of t730s too. I add third party memory and storage if needed.

You get a super capable, low power device in the price footprint of a raspberry pie.

> why do the details matter?

Have you ever had a plumber, HVAC tech, electrician, etc. come out to your house for something, and had them explain it to you? Have you had the unfortunate experience of that happening more than once (with separate people)? If so, you should know why this matters: because if you don’t understand the fundamentals, you can’t possibly understand the entire system.

It’s the same reason why the U.S. Navy Nuclear program still teaches Electronics Technicians incredibly low-level things like bus arbitration on a 386 (before that, it was the 68000). Not because they expect most to need to use that information (though if necessary, they carry everything down to logic analyzers), but because if you don’t understand the fundamentals, you cannot understand the abstractions. Actually, the CPU is an abstraction, I misspoke: they start by learning electron flow, then moving into PN junctions, then transistors, then digital logic, and then and only then do they finally learn how all of those can be put together to accomplish work.

Incidentally, former Navy Nukes were on the initial Google SRE team. If you read the book [0], especially Chapter 12, you’ll get an inkling about why this depth of knowledge matters.

Do most people need to understand how their NIC turns data into electrical signals? No, of course not. But occasionally, some weird bug emerges where that knowledge very much matters. At some point, most people will encounter a bug that they are incapable of reasoning about, because they do not possess the requisite knowledge to do so. When that happens, it should be a humbling experience, and ideally, you endeavor to learn more about the thing you are stuck on.

[0]: https://sre.google/sre-book/table-of-contents/

Capture and product stickiness. If your product is all serverless wired together with an event system by the same cloud provider, you are in a very weak position to argue that you will go elsewhere where, leveraging the competitive market to your advantage.

The more the big cloud providers can abstract cpu cycles, memory, networking, storage etc, the more they don’t have to compete with others doing the same.

> because things just work

If that were true, you might be right.

What happens in reality is that things are promised to work and (at best) fulfill that promise so long as no developers or deployers or underlying systems or users deviate from a narrow golden path, but fail in befuddling ways when any of those constraints introduce a deviation.

And so what we see, year over year, is continued enshittening, with everything continuously pushing the boundaries of unreliability and inefficiency, and fewer and fewer people qualified to actually dig into the details to understand how these systems work, how to diagnose their issues, how to repair them, or how to explain their costs.

> If the cost of an easily managed solution is low enough, why do the details matter?

Because the patience that users have for degraded quality, and the luxury that budgets have for inefficiency, will eventually be exhausted and we'll have collectively led ourselves into a dark forest nobody has the tools or knowledge to navigate out of anymore.

Leveraging abstractions and assembling things from components are good things that enable rapid exploration and growth, but they come with latent costs that eventually need to be revisited. If enough attention isn't paid too understanding, maintaining, refining, and innovating on the lowest levels, the contraptions built through high-level abstraction and assempbly will eventually either collapse upon themselves or be flanked by competitors who struck a better balance and built on more refined and informed foundations.

As a software engineer who wants a long and satisfying career, you should be seeking to understand your systems to as much depth as you can, making informed, contextual choices about what abstractions you leverage, exactly what they abstract over, and what vulnerabilities and limitations are absorbed into your projects by using them. Just making naive use of the things you found a tutorial for, or that are trending, or that make things look easy today, is a poison to your career.

> If the cost of an easily managed solution is low enough

Because vertical scaling is now large enough that I can run all of twitter/amazon on one single large server. And if I'm wrong now, in a decade I won't be.

Compute power grows exponentially, but business requirements do not.

You described two points in an spectrum in which:

One end is PaaS like Heroku, where you just git push. The other end is bare metal hosting.

Every option you mentioned (VPS, Manages K8S, Self Hosted K8S, etc) they all fall somewhere between these two ends of the spectrum.

If, a developer falls into any of these "groups" or has a preference/position on any of these solutions, they are just called juniors.

Where you end up in this spectrum is a matter of cost benefit. Nothing else. And that calculation always changes.

Those options only make sense where the cost of someone else managing it for you for a small premium gets higher than the opportunity/labor cost of you doing it yourself.

So, as a business, you _should_ not have a preference to stick to. You should probably start with PaaS, and as you grow, if PaaS costs get too high, slowly graduate into more self-managed things.

A company like fly.io is a PaaS. Their audience has always been, and will always be application developers who prefer to do nothing low-level. How did they forget this?

> Where you end up in this spectrum is a matter of cost benefit. Nothing else. And that calculation always changes.

This is where I see things too. When you start out, all your value comes from working on your core problem.

eg: You'd be crazy to start a CRM software business by building your own physical datacenter. It makes sense to use a PaaS that abstracts as much away as possible for you so you can focus on the actual thing that generates value.

As you grow, the high abstraction PaaS gets increasingly expensive, and at some point bubbles up to where it's the most valuable thing to work on. This typically means moving down a layer or two. Then you go back to improving your actual software.

You go through this a bunch of times, and over time grow teams dedicated to this work. Given enough time and continuous growth, it should eventually make sense to run your own data centers, or even build your own silicon, but of course very few companies get to that level. Instead most settle somewhere in the vast spectrum of the middle, with a mix of different services/components all done at different levels of abstraction.

You’re correct that it would be absurd to build a DC, but you left out the next-best thing, and the one that is VERY financially attractive: colo’ing. I can rent 1U for around $50-75/month, or if I want HA-ish (same rack in the same DC isn’t exactly HA, but it solves for hardware failure anyway), 5U would probably run $200-250/month or so, and that lets you run two nodes with HAProxy or what-have-you, sharing a virtual IP, fronting three worker nodes running K8s, or a Proxmox cluster, or whatever. The hardware is also stupidly cheap, because you don’t need anything remotely close to new, so for about $200/node, you’ll have more cores and memory than you know what to do with.

The DC will handle physical service for you if something breaks, you just pay for parts and labor.

All of this requires knowledge, of course, but it’s hardly an impossible task. Go look at what the more serious folk in r/homelab (or r/datacenter) are up to; it’ll surprise you.

For $200/month, I can have 2 ALBs, 2 ECS services, 2 CloudWatch log groups, and 2 RDS instances on AWS (one each for dev and prod) and a GitHub Team account with enough included runner minutes to cover most deployments. A colo is going to be more hassle, and I'll have to monitor more things (like system upgrades and intrusion attempts). I'd also have to amortize parts and labor as part of the cost, which is going to push the price up. If I need all that capacity, then the colo is definitely the better bet. But if I don't, and a small shop usually doesn't, then managed infrastructure is going to be preferable.

For $200/month and all that auxiliary infrastructure, those two RDS instances will be running on the equivalent compute power of an iPhone 5s…

Pretty much. I don't really see the problem though. You're also getting regular snapshots, which is yet another thing you have to build in that colo setup (and where are your backups going?). This is not for personal projects or shoestring-budget non-profits where you're willing to volunteer time, it's for businesses and decent paying work, where two hundred dollars is one man-hour of totally compensated labor.

A Postgres `db.m6g.large` (the cheapest non-burstable instance) runs $114/month for a single AZ, and that's not counting storage or bandwidth. A `db.t4g.medium` runs $47/month, again, not counting storage or bandwidth. An ALB that somehow only consumed a single LCU per month would run $22. The rest of the mentioned items will vary wildly depending on the application, but between those and bandwidth - not to mention GitHub's fees - I sincerely doubt you'd come in anywhere close to $200. $300 maybe, but as the sibling comment mentioned, the instances you'll have will be puny in comparison.

> I'll have to monitor more things (like system upgrades and intrusion attempts)

You very much should be monitoring / managing those things on AWS as well. For system upgrades, `unattended-upgrades` can keep security patches (or anything else if you'd like, but I wouldn't recommend that unless you have a canary instance) up to date for you. For kernel upgrades, historically it's reboots, though there have been a smattering of live update tools like kSplice, kGraft, and the latest addition from GEICO of all places, tuxtape [0].

> I'd also have to amortize parts and labor as part of the cost, which is going to push the price up.

Given the prices you laid out for AWS, it's not multi-AZ, but even single-AZ can of course failover with downtime. So I'll say you get 2U, with two individual servers, DBs either doing logical replication w/ failover, or something like DRBD [1] to present the two servers' storage as a single block device (you'd still need a failover mechanism for the DBs). So $400 for two 1U servers, and maybe $150/month at most for colo space. Even with the (IMO unrealistically low) $200/month quote for AWS, at 5 months, you're now saving $50/month. Re: parts and labor, luckily, parts for old servers is incredibly cheap. PC3-12800R 16GiB sticks are $10-12. CPUs are also stupidly cheap. Assuming Ivy Bridge era (yes, this is old, yes, it's still plenty fast for nearly any web app), even the fastest available (E5-2697v2) is $50 for a matched pair.

I don't say all of this just guessing; I run 3x Dell R620s along with 2x Supermicros in my homelab. My uptime for services is better than most places I've worked at (of course, I'm the only one doing work, I get that). They run 24/7/365, and in the ~5 years or so I've had these, the only trouble the Dells have given me is one bad PSU (each server has redundant PSUs, so no big deal), and a couple of bad sticks of RAM. One Supermicro has been slightly less reliable but to be fair, a. it has a hodgepodge of parts b. I modded its BIOS to allow NVMe booting, so it's not entirely SM's fault.

EDIT: re: backups in your other comment, run ZFS as your filesystem (for a variety of reasons), periodically snapshot, and then send those off-site to any number of block storage providers. Keep the last few days, with increasing granularity as you approach today, on the servers as well. If you need to roll back, it's incredibly fast to do so.

[0]: https://github.com/geico/tuxtape

[1]: https://linbit.com/drbd/

I upvoted for the details, and I agree that if you try to buy comparable capacity in AWS, it's going to be more expensive. Scaling up in AWS is definitely going to cost more over time too. I don't want to hide these facts.

But you don't need comparable capacity, at least not at first. And when you do, you click some buttons or run terraform plan/apply. Absolutely it's going to cost more measured only by tech specs. But you're not paying primarily for tech specs, you're paying for somebody else to do the work. That's where the cost comparability really needs to be assessed.

Security in AWS is a thorny topic, I'll agree, but the risks are a little different. You need to secure your accounts and users, and lock out unneeded services while monitoring for unexpected service utilization. Honestly, I think for what you're paying, AWS should be doing more for you here (and they are improving albeit slowly). Hence maybe the real point of comparison ought to be against PaaS because then all of that is out of scope too, and I think such offerings are already putting pressure on AWS to offer more value.

> But you don't need comparable capacity, at least not at first.

Agreed.

> But you're not paying primarily for tech specs, you're paying for somebody else to do the work. ... Honestly, I think for what you're paying, AWS should be doing more for you here

Also agreed, and this is why I don't think the value proposition exists.

We can agree to disagree on which approach is better; I doubt there's an objective truth to be had.

> A Postgres `db.m6g.large` (the cheapest non-burstable instance) runs $114/month for a single AZ, and that's not counting storage or bandwidth. A `db.t4g.medium` runs $47/month, again, not counting storage or bandwidth.

This is why numbers do not stack up in the calculations – the premise that the DB has to be provisioned is not the correct one to start off with.

The right way of cooking RDS in AWS is to go serverless from the start and configure the number of DCU's, e.g. 1 to N. That way it will be even cheaper than the originally quoted $200.

Generally speaking, there is absolutely no need for anything to be provisioned at a fixed compute capacity in AWS unless there is a very specific or an edge case that, likewise, warrants a provisioned instance of something.

> The right way of cooking RDS in AWS is to go serverless from the start

Nitpick, but there is no Serverless for RDS, only Aurora. The two are wildly different in their architecture and performance characteristics. Then there's RDS Multi-AZ Cluster, which is about as confusingly named as they could manage, but I digress.

Let's take your stated Minimum ACU of 1 as an example. That gives you 2 GiB of RAM, with "CPU and networking similar to what is available in provisioned Aurora instances." Since I can't find anything more specific, I'll compare it to a `t4g.small`, which has 2 vCPU (since it's ARM, it's actual cores, not threads), and 0.128 / 5.0 Gbps [0] baseline/burst network bandwidth, which is 8 / 625 MBps. That burst is best-effort, and also only lasts for 5 – 60 minutes [1] "depending on instance size." Since this is tiny, I'm going to assume the low end of that scale. Also, since this is Aurora, we have to account for both [2] client <--> DB and DB-compute (each node, if more than one) <--> DB-storage bandwidth. Aurora Serverless v2 is $0.12/hour, or $87.60/month, plus storage, bandwidth, and I/O costs.

So we have a Postgres-compatible DB with 2 CPUs, 2 GiB of RAM, and 64 Mbps of baseline network bandwidth that's shared between application queries and the cluster volume. Since Aurora doesn't use the OS page cache, its `shared_buffers` will be set to ~75% of RAM, or 1.5 GiB. Memory will also be consumed by the various processes, like the WAL writer, background writer, auto-vacuum daemon, and of course, each connection spawns a process. For the latter reason, unless you're operating at toy scale (single-digit connections at any given time), you need some kind of connection pooler with Postgres. Keeping in the spirit of letting AWS do everything, they have RDS Proxy, which despite the name, also works with Aurora. That's $0.015/ACU-hour, with a minimum 8 ACUs for Aurora Serverless, or $87.60/month.

Now, you could of course just let Aurora scale up in response to network utilization, and skip RDS Proxy. You'll eventually bottleneck / it won't make any financial sense, but you could. I have no idea how to model that pricing, since it depends on so many factors.

I went on about network bandwidth so much because it catches people by surprise, especially with Aurora, and doubly so with Postgres for many services. The reason is its WAL amplification from full page writes [3]. If you have a UUIDv4 (or anything else non-k-sortable) PK, the B+tree is getting thrashed constantly, leading to slower performance on reads and writes. Aurora doesn't suffer from the full page writes problem (that's still worth reading about and understanding), but it does still have the same problems with index thrashing, and it also has the same issues as Postgres with Heap-Only Tuple updates [4]. Unless you've carefully designed your schema around this, it's going to impact you, and you'll have more network traffic than you expected. Add to that dev's love of chucking everything into JSON[B] columns, and the tuples are going to be quite large.

Anyway, I threw together an estimate [5] with just Aurora (1 ACU, no RDS Proxy, modest I/O), 2x ALBs with an absurdly low consumption, and 2x ECS tasks. It came out to $232.52/month.

[0]: https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html...

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

[2]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

[3]: https://www.rockdata.net/tutorial/tune-full-page-writes/

[4]: https://www.postgresql.org/docs/current/storage-hot.html

[5]: https://calculator.aws/#/estimate?id=8972061e6386602efdc2844...

I don't know much about Aurora, there was/is a little too much magic there for my taste, but I feel like once we start with "Postgres-compatible DB", we can't necessarily reason about how things perform under the hood in terms of ordinary Postgres servers. Is there a detailed breakdown of Aurora and its performance/architecture out there? My experience is that AWS is cagey about the details to maintain competitive advantage.

There are some details available[0], although they are scarce.

I procured my Aurora intel from a lengthy phone conversation with an exceptionally knowledgeable (and excessively talkative) AWS engineer – who had worked on Aurora – several years ago. The engineer provided detailed explanations of Aurora’s architecture, do's, and dont's as part of our engagement with AWS. The engineer was very proud of AWS’ accomplishments (and I concur that their «something serverless» products are remarkable engineering feats as well as significant cost-saving solutions for me and my clients). The engineer was willing to share many non-sensitive technical details. Generally speaking, a sound understanding of distributed architectures and networks should be sufficient to grasp Aurora Serverless. The actual secret sauce lies in the fine-tuning and optimisations.

[0] https://muratbuffalo.blogspot.com/2024/07/understanding-perf...

You can find some re:Invent talks about it, but the level of depth may not be what you want.

The tl;dr is they built a distributed storage system that is split across 3 AZs, each with 2 storage nodes. Storage is allocated in 10 GiB chunks, called protection groups (perhaps borrowing from Ceph’s placement group terminology), with each of these being replicated 6x across the nodes in AZs as mentioned. 4/6 are required for quorum. Since readers are all reading from the same volume, replica lag is typically minimal. Finally, there are fewer / no (not positive honestly; I have more experience with MySQL-compatible Aurora) checkpoints and full page writes.

If you’ve used a networked file system with synchronous writes, you’ll know that it’s slow. This is of course exacerbated with a distributed system requiring 4/6 nodes to ack. To work around this, Aurora has “temporary local storage” on each node, which is a fixed size proportional to the instance size. This is used for sorts that spill to disk, and building secondary indices. This has the nasty side effect that if your table is too large for the local storage, you can’t build new indices, period. AWS will tell you “upsize the instance,” but IMO it’s extremely disingenuous to tout the ability for 128 TiB volumes without mentioning that if a single table gets too big, your schema becomes essentially fixed in place.

Similarly, MySQL normally has something called a change buffer that it uses for updating secondary indices during writes. Can’t have that with Aurora’s architecture, so Aurora MySQL has to write through to the cluster volume, which is slow.

AWS claims that Aurora is anywhere from 3-5x faster than the vanilla versions of the respective DBs, but I have never found this to be true. I’ve also had the goalposts shifted when arguing this point, with them saying “it’s faster under heavy write contention,” but again, I have not found this to be true in practice. You can’t get around data locality. EBS is already networked storage; requiring 4/6 quorum across 3 physically distant AZs makes it even worse.

The 64 TiB limit of RDS is completely arbitrary AFAIK, and is purely to differentiate Aurora. Also, if you have a DB where you need that, and you don’t have a DB expert on staff, you’re gonna have a bad time.

Thanks for the correction, it is Aurora, indeed, although it can be found under the «RDS» console in AWS.

Aurora is actually not a database but is a scalable storage layer that operates over the network and is decoupled from the query engine (compute). The architecture has been used to implement vastly different query engines on top of it (PgSQL, MySQL, DocumentDB – a MongoDB alternative, and Neptune – a property graph database / triple store).

The closest abstraction I can think of to describe Aurora is a VAX/VMS cluster – where the consumer sees a single entity, regardless of size, whilst the scaling (out or back in) remains entirely opaque.

Aurora does not support RDS Proxy for PostgreSQL or its equivalents for other query engine types because it addresses cluster access through cluster endpoints. There are two types of endpoints: one for read-only queries («reader endpoints» in Aurora parlance) and one for read-mutate queries («writer endpoint»). Aurora supports up to 15 reader endpoints, but there can be only one writer endpoint.

Reader endpoints improve the performance of non-mutating queries by distributing the load across read replicas. The default Aurora cluster endpoint always points to the writer instance. Consumers can either default to the writer endpoint for all queries or segregate non-mutating queries to reader endpoints for faster execution.

This behaviour is consistent across all supported query engines, such as PostgreSQL, Neptune, and DocumentDB.

I do not think it is correct to state that Aurora does not use the OS page cache – it does, as there is still a server with an operating system somewhere, despite the «serverless» moniker. In fact, due to its layered distributed architecture, there is now more than one OS page cache, as described in [0].

Since Aurora is only accessible over the network, it introduces unique peculiarities where the standard provisions of storage being local do not apply.

Now, onto the subject of costs. A couple of years ago, an internal client who ran provisioned RDS clusters in three environments (dev, uat, and prod) reached out to me with a request to create infrastructure clones of all three clusters. After analysing their data access patterns, peak times, and other relevant performance metrics, I figured that they did not need provisioned RDS and would benefit from Aurora Serverless instead – which is exactly what they got (unbeknownst to them, which I consider another net positive for Aurora). The dev and uat environments were configured with lower upper ACU's, whilst production had a higher upper ACU configuration, as expected.

Switching to Aurora Serverless resulted in a 30% reduction in the monthly bill for the dev and uat environments right off the bat and nearly a 50% reduction in production costs compared to a provisioned RDS cluster of the same capacity (if we use the upper ACU value as the ceiling). No code changes were required, and the transition was seamless.

Ironically, I have discovered that the AWS cost calculator consistently overestimates the projected costs, and the actual monthly costs are consistently lower. The cost calculator provides a rough estimate, which is highly useful for presenting the solution cost estimate to FinOps or executives. Unintentionally, it also offers an opportunity to revisit the same individuals later and inform them that the actual costs are lower. It is quite amusing.

[0] https://muratbuffalo.blogspot.com/2024/07/understanding-perf...

> Aurora is actually not a database but is a scalable storage layer that operates over the network and is decoupled from the query engine (compute).

They call it [0] a database engine, and go on to say "Aurora includes a high-performance storage subsystem.":

> "Amazon Aurora (Aurora) is a fully managed relational database engine that's compatible with MySQL and PostgreSQL."

To your point re: part of RDS, though, they do say that it's "part of RDS."

> The architecture has been used to implement vastly different query engines on top of it (PgSQL, MySQL, DocumentDB – a MongoDB alternative, and Neptune – a property graph database / triple store).

Do you have a source for this? That's new information to me.

> Aurora does not support RDS Proxy for PostgreSQL

Yes it does [1].

> I do not think it is correct to state that Aurora does not use the OS page cache – it does

It does not [2]:

> "Conversely, in Amazon Aurora PostgreSQL, the default value [for shared_buffers] is derived from the formula SUM(DBInstanceClassMemory/12038, -50003). This difference stems from the fact that Amazon Aurora PostgreSQL does not depend on the operating system for data caching." [emphasis mine]

Even without that explicit statement, you could infer it from the fact that the default value for `effective_cache_size` in Aurora Postgres is the same as that of `shared_buffers`, the formula given above.

> Switching to Aurora Serverless resulted in a 30% reduction in the monthly bill for the dev and uat environments right off the bat

Agreed, for lower-traffic clusters you can probably realize savings by doing this. However, it's also likely that for Dev/Stage/UAT environments, you could achieve the same or greater via an EventBridge rule that starts/stops the cluster such that it isn't running overnight (assuming the company doesn't have a globally distributed workforce).

What bothers me most about Aurora's pricing model is charging for I/O. And yes, I know they have an alternative pricing model that doesn't do so (but the baseline is of course higher); it's the principal of the thing. The amortized cost of wear to disks should be baked into the price. It would be difficult for a skilled DBA with plenty of Linux experience to accurately estimate how many I/O a given query might take. In a vacuum for a cold cache, it's not that bad: estimate or look up statistics for row sizes, determine if any predicates can use an index (and if so, the correlation of the column[s]), estimate index selectivity, if any, confirm expected disk block size vs. Postgres page size, and make an educated guess. If you add any concurrent queries that may be altering the tuples you're viewing, it's now much harder. If you then add a distributed storage layer, which I assume attempts to boxcar data blocks for transmission much like EBS does, it's nearly impossible. Now try doing that if you're a "cloud native" type who hasn't the faintest idea what blktrace [3] is.

[0]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

[1]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

[2]: https://aws.amazon.com/blogs/database/determining-the-optima...

[3]: https://linux.die.net/man/8/blktrace

The heartburn I have is how this stifles innovation for some. The cost of experimentation is high if every person wanting to try a new idea goes down this road.

My personal AWS account is stuffed with globally distributed multi-region, multi-az, fault tolerant, hugely scalable things that rarely get used. By “rarely” I mean requests per hour or minute, not second.

The sum total CPU utilization would be negligible. And if I ran instances across the 30+ AZs I’d be broke.

The service based approach (aka event driven) has some real magic at the low end of usage where experimentation and innovation happens.

This is news to us. Our primary DX is a CLI. One of our defining features is hardware isolation. To use us, you have to manage Dockerfiles. Have you had the experience of teaching hundreds of Heroku refugees how to maintain a Dockerfile? We have had that experience. Have you ever successfully explained the distinction between "automated" Postgres and "managed" Postgres? We have not.

You're not wrong that there's a PaaS/public-cloud dividing line, and that we're at an odd place between those two things. But I mean, no, it is not the case that our audience is strictly developers who do nothing low-level. I spent months of my life getting _UDP_ working for Fly apps!

Ok, let me rephrase this:

> Their audience has always been, and will always be application developers who prefer to do nothing low-level. How did they forget this?

to this:

Their audience has always been, and will always be application developers who prefer to do nothing except to build their main product.

> Our primary DX is a CLI. One of our defining features is hardware isolation. To use us, you have to manage Dockerfiles. Have you had the experience of teaching hundreds of Heroku refugees how to maintain a Dockerfile? We have had that experience. Have you ever successfully explained the distinction between "automated" Postgres and "managed" Postgres? We have not.

I'm pretty much sure an application developer in this day and age has to know all of them, yes. Just like git.

No. It is definitely not the case that the modal developer today needs to know Docker. If only! It's a huge pain point for us with the PaaS customer cohort.

So why are you guys not becoming Heroku compatible?

https://github.com/gliderlabs/herokuish

Something like this + Procfile support should allow you to gobble up Heroku customers [like us] quickly since they've been stagnating for long, no?

Agreed. I have been amazed at how Little most developers seem to know docker now. It seemed like more people understood docker 5 to 10 years ago than they do now. I'm not sure why this regression, but it is definitely been my experience

Aren’t we just continually moving up layers of abstractions? Most of the increasingly small group doesn’t concern itself with voltages, manually setting jumpers, hand-rolling assembly for performance-critical code, cache line alignment, raw disk sector manipulation, etc.

I agree it’s worthwhile to understand things more deeply but developers slowly moving up layers of abstractions seems like it’s been a long term trend.

We certainly need abstractions for the first layer of the hardware. An abstraction of the abstraction can be useful if the first abstraction is very bad or very crude. But we are now at an abstraction of an abstraction x 8 or so. It's starting to get a bit over the top.

I disgree with your sentiment. One thing that is constant in my experience as a computer programmer, there are always "old" computer programmers complaining that there are too many abstractions.

You cannot see any way in which we run out of possible abstraction layers? I think in the past we have assumed that was natural language, but I think natural language is a pretty poor programming language. What people actually want when they say that, is for someone to read all the nuance out of their mind and codify it.

I don't think we actually have been abstracting new layers over the past day 5-10 years anyway. Most of what I see is moving sideways, not up the stack. Covering more breadth not height or depth, of abstractions.

    > Most of what I see is moving sideways, not up the stack. Covering more breadth not height or depth, of abstractions.

I don't follow your logic. This comment is so vague. Do you have a specific example?

There are not many languages that for example abstract higher than C# and compile down to it. And whilst we are building abstractions in C# (etc) those are one step up (height), and then there are many of those frameworks, thus many steps sideways (breadth)

We only run out of abstraction once there is stagnation and time to really bake.

As long as some new thing is being invented in our industry, a new abstraction will be needed because the old one just can’t quite flex enough while being backwards compatible.

One problem is that we're building abstractions on top of older abstractions when that isn't most efficient. Suppose we write an HTTP server that can call CGI programs. Then we implement PHP as a CGI program. Then we write a Lua interpreter in PHP. Then we write our website in Lua.

Some of those levels are useful. Some of them are redundant. We should embed a Lua interpreter in our webserver and delete two levels of abstraction.

(I'm not aware of any actual Lua interpreter written in PHP, but it's representative of the kinds of stacks that do exist out there)

Only really relevant if the efficiency is valuable but yeah totally. It’s not just about new abstractions on top of the stack but also new abstractions that replace the middle and bottom of the stacks.

> I'm increasingly coming to the view that there is a big split among "software developers" and AI is exacerbating it.

I don't think this split exists, at least in the way you framed it.

What does exist is workload, and problems that engineers are tasked with fixing. If you are tasked with fixing a problem or implementing a feature, you are not tasked with learning all the minute details or specifics of a technology. You are tasked with getting shit done, which might even turn out to not involve said technology. You are paid to be a problem-solver, not an academic expert on a specific module.

What you tried to describe as "magic" is actually the balance between broad knowledge vs specialization, or being a generalist vs specialist. The bulk of the problems that your average engineer faces requires generalists, not specialists. Moreover, the tasks that actually require a specialist are rare, and when those surface the question is always whether it's worth to invest in a specialist. There are diminished returns on investment, and throwing a generalist at the problem will already get some results. You give a generalist access to a LLM and he'll cut down on the research time to deliver something close to what a specialist would deliver. So why bother?

With this in mind, I would go as far as to frame a scenario backhandedly described as "want to understand where their code is running and what it's doing" (as if no engineer needs to have insight on how things work?) as opposed to the dismissive "just wants to `git push` and be done with it" scenario, can actually be classified as a form of incompetence. You,as an engineer, only have so many hours per day. Your day-to-day activities involve pushing new features and fixing new problems. To be effective, your main skillet is learn the system in a JIT way, dive in, fix it, and move on. You care about system traits, not low-level implementation details that may change tomorrow on a technology you may not even use tomorrow. If, instead, you feel the need to waste time on topics that are irrelevant to address the immediate needs of your role, you are failing to deliver value. I mean, if you frame yourself as a Kubernetes expert who even know commit hashes by heart, does that matter if someone asks you, say, why is a popup box showing off-center?

I'm not entirely certain. Or perhaps we're all part of both groups.

I want to understand LLMs. I want to understand my compiler, my gc, my type system, my distributed systems.

On the other hand, I don't really care about K8s or anything else, as long as I have something that works. Just let me `git push` and focus on making great things elsewhere.

>Or perhaps we're all part of both groups.

this feels right to me. application development and platform development are both software development tasks, and lots of software devs do both. i like working on platform-level stuff, and i like building applications. but i like there to be a good distinction between the two, and when i'm working on application-level stuff, i don't want to have to think about the platform.

services like fly.io do a good job of hiding all the platform level work and just giving you a place to deploy your application to, so when they start exposing tools like GPUs that are more about building platforms than building applications, it's messy.

I am the former. I also make cost benefit based decisions that involve time. Unless I have very specific configuration needs, the git push option lets me focus on what my users care about and gives me one less thing that I need to spend my time on.

Increasingly, Fly even lets you dip into most complex configurations too.

I’ve got no issue with using Tofu and Ansible to manage my own infrastructure but it takes time to get it right and it’s typically not worth the investment early on in the lifecycle.

>who don't like "magic" and want to understand where their code is running and what it's doing.

I just made this point in a post on my substack. Especially in regulated industries, you NEED to the able to explain your AI to the regulator. You can't have a situation where a human say "Well, gee I don't know. The AI told me to do it."

"Enjoys doing linux sysadmin" is not the same as "Wants to understand how things work". It's weird to me that you group those two kinds of people in one bucket.

I feel like fly.io prioritizes a great developer experience and I think that appeals to engineers who both do and don't like magic.

But the real reason I like fly.io is because it is a new thing that allows for new capabilities. It allows you to build your own Cloudflare by running full virtual machines colocated next to appliances in a global multicast network.

> they're willing to spend a lot of (usually their employer's) money

May just be my naïveté, but I thought that something like ECS or EKS is much cheaper than an in-house k8 engineer.

Move to EKS and you still need a k8s engineer, but one who also knows AWS, and you also pay the AWS premium for the hosting, egress, etc. It might make sense for your use case but I definitely wouldn’t consider it a cost-saving measure.

If you don’t have staff to do that, you probably aren’t at the scale when you need them, and you’re needlessly adding complexity.

It’s always baffling to me why people think that ECS or god forbid EKS is somehow easier than a few Linux boxes.

Because the force multiplier of a good DX way outweighs the occasional nonsense from having to do k8s upgrades or troubleshooting

For example: how do you roll out a new release of your product? In sane setups, it's often $(helm upgrade --install ...), which is itself often run either in-cluster by watching a git managed descriptor, or in CI on merge to a release branch/tag

How does your developer get logs? Maybe it's via Splunk/ELK/DataDog/whatever but I have never in my life seen a case where that's a replacement for viewing the logs

How do you jump into the execution environment for your workload, to do more advanced debugging? I'm sure you're going to say ssh, which leads to the next questions of "how do you audit what was done, to prevent config drift" followed by "how do you authenticate the right developer at the right time with access to the right machine without putting root's public key file in a spreadsheet somewhere"

> For example: how do you roll out a new release of your product?

It's pretty easy to accomplish that with docker compose if you have containers, but you can also use systemd and some bash scripts to accomplish the same thing. Admittedly this would only affect a single node, but it's also possible to manage multiple nodes without using K8s / Nomad.

> How does your developer get logs?

fluentd

> How do you jump into the execution environment for your workload, to do more advanced debugging?

ssh

> how do you audit what was done, to prevent config drift

Assuming you're pulling down releases from a git repo, git diff can be used to detect changes, and you can then opt to either generate a patch file and send it somewhere, or just reset to HEAD. For server settings, any config management tool, e.g. puppet.

> how do you authenticate the right developer at the right time with access to the right machine without putting root's public key file in a spreadsheet somewhere

freeipa

I'm not saying any of this is better than K8s. I'm saying that, IMO, the above can be simpler to reason about for small setups, and has a lot less resource overhead. Now, if you're already comfortable administering and troubleshooting K8s (which is quite a bit different than using it), and you have no background in any of the above, then sure, K8s is probably easier. But if you don't know this stuff, there's a good chance you don't have a solid background in Linux administration, which means when your app behaves in strange ways (i.e. not an application bug per se, but how it's interacting with Linux) or K8s breaks, you're going to struggle to figure out why.

> Splunk/ELK/DataDog/whatever but I have never in my life seen a case where that's a replacement for viewing the logs

Uh, any time I run a distributed system and logs could appear on n nodes I need a log aggregator or I am tailing in n terminals. I almost only use Splunk. I tail logs in dev. Prod needs an aggregator. This has been my experience at 4 of my last 6 companies. The shit companies who had all the issues? Logs on cloudwatch or only on the node

kubectl logs deployment my-multinode-deployment

That's a (crappy) aggregator technically.

I've set up and run my own physical Linux server as well as cloud ones and it may be easier to get a Linux box up and running with an application but to get it into a state I consider production ready it's much harder. With ECS (or similar offerings, I agree kubernetes can be overkill) you get logging, blue green deployments, permissions, secret management, scaling, and more built in. You don't need to worry about upgrading your server and there's a whole category of security issues you don't really need to worry about. I work in a space with some compliance requirements and I do not think we could meet them at the size that we are without offerings like this.

a few linux boxes is great when you're a solo dev looking to save money and manage things yourself, but it's a poor place to start scaling up from. not just technologically, but from an HR perspective.

Kubernetes is something you can hire for. A couple of linux boxes running all your server code in the most efficient way possible might save you operational costs, but it resigns you to being the one who has to maintain it. I've learned this the hard way - moving things to ECS as we scale up has allowed me give away responsibility for things. I understand that it's more complex, but i don't have to teach people now.

> Kubernetes is something you can hire for.

I massively distrust Ops-adjacent people's technical abilities if they don't know Linux. Multiple datapoints at multiple companies of varying scale has shown this to be true.

That said, you're correct, and I absolutely hate it. People want to do managed services for everything, and they stare at you like you're insane if you suggest running something yourself.

> There's an (increasingly small) group of software developers who don't like "magic" and want to understand where their code is running and what it's doing.

That problem started so long ago and has gotten so bad that I would be hard pressed to believe there is anyone on the planet who could take a modern consumer pc and explain what exactly is going on the machine without relying on any abstractions to understand the actual physical process.

Given that, it’s only a matter of personal preference on where you draw the line for magic. As other commenters have pointed out, your line allowing for Kubernetes is already surprising to a lot of people

> I'm increasingly coming to the view that there is a big split among "software developers" and AI is exacerbating it

This is admittedly low effort but the vast majority of devs are paid wages to "write CRUD, git push and magic" their way to the end of the month. The company does not afford them the time and privilege of sitting down and analyzing the code with a fine comb. An abstraction that works is good enough.

The seasoned seniors get paid much more and afforded leeway to care about what is happening in the stack, since they are largely responsible for keeping things running. I'm just pointing out it might merely be a function of economics.

I call this difference being a developer who is on call vs being a developer who is not on call

I don't think this is entirely correct. I'm working for a company that does IT Consulting and so I see many Teams working on many different Projects and one thing I have learned the hard way is that Companies and Teams that think they should do it all themselves are usually smaller companies and they often have a lot of Problems with that attitude.

Just an example I recently came across: Working for a smaller company that uses Kubernetes and manages everything themselves with a small team. The result: They get hacked regularly and everything they run is constantly out of date because they don't have the capacity to actually manage it themselves. And it's not even cheaper in the long run because Developer Time is usually more expensive than just paying AWS to keep their EKS up to date.

To be fair, in my home lab I also run everything bare metal and keep it updated but I run everything behind a VPN connection and run a security scanner every weekend that automatically kills any service it finds > Medium Level CVE and I fix it when I get the time to do it.

As a small Team I can only fix so much and keep so much up to date before I get overwhelmed or the next customer Project gets forced upon me by Management with Priority 0, who cares about security updates.

I'd strongly suggest to use as much managed service as you can and focus your effort as a team on what makes your Software Unique. Do you really need to hire 2-3 DevOps guys just to keep everything running when GCP Cloud Run "just werks"?

Everything we do these days runs on so many levels of abstraction anyway, it's no shame to share cost of managing the lower levels of abstraction with others (using managed Service) and focus on your product instead. Unless you are large enough to pay for whole teams that deal with nothing but infrastructure to enable other teams to do Application Level Programming you are, in my limited experience, just going to shoot yourself in the foot.

And again, just to emphasize it: I like to do everything myself because for privacy reasons I use as little services that aren't under my control as possible but I would not recommend this to a customer because it's neither economical nor does it work well in my, albeit limited, experience.

I might be an outlier. I like to think I try for a deeper understanding of what I’m using. Like, fly uses firecracker vms afaik. Sometimes, especially for quick projects or testing ideas I just want to have it work without wrangling a bunch of AWS services. I’m typically evaluating is this the right tool or service and what is the price to convenience? For anything potentially long term, what’s the amount of lock in when or if I want to change providers?

Also "They want LLMs" lol. I cant remember being asked. I dont want to use AI for coding.

I agree that split exists, and that the former is more rare, but in my experience the split is less about avoid magic and more about keeping control of your system.

Many, likely most, developers today don't care about controlling their system/network/hardware. There's nothing wrong with that necessarily, but it is a pretty fundamental difference.

One concern I've had with building LLM features is whether my customers would be okay with me giving their data over to the LLM vendor. Say I'm building a tool for data analysis, is it really okay to a customer for me to give their table schemas or access to the data itself to OpenAI, for example?

I rarely hear that concern raised though. Similarly when I was doing consulting recently, I wouldn't use copilot on client projects as I didn't want copilot servers accessing code that I don't actually own the rights to. Maybe its over protective though, I have never heard anyone raise that concern so maybe its just me.

I work for a major consulting firm and we’ve been threatened with fire and brimstone if any part of client info (code, docs, random email, anything) ever gets sent to an LLM. Even with permission from the client our attack lawyers prefer us not to use them. It’s a very sensitive topic. I still use LLMs from time to time but always starting with a blank prompt and the ask anonymized. (Heh I’m probably not even supposed to do that)

It’s not about wanting, it’s about what the job asks for. As a self employed engineer I am paid to solve business problems in an efficient way. Most of the time it just make more business sense for the client and for me to pay to just have to git push if there is no performance challenges needing custom infrastructure.

[deleted]

I don't agree, I think you're just describing two sides of the same coin.

As a software developer I want strong abstractions without bloat.

LLMs are so successful in part because they are a really strong abstraction. You feed in text and you get back text. Depending on the model and other parameters your results may be better or worse, but changing from eg. Claude to ChatGPT is as simple as swapping out one request with another.

If what I want is to run AI tasks, then GPUs are a poor abstraction. It's very complicated (as Fly have discovered) to share them securely. The amount of GPU you need could vary dramatically. You need to worry about drivers. You need to worry about all kinds of things. There is very little bloat to the ChatGPT-style abstraction, because the network overhead is a negligable part of the overall cost.

If I say I don't want magic, what I really mean is that I don't trust the strength of the abstraction that is being offered. For example, when a distributed SQL database claims to be PostgreSQL compatible, it might just mean it's wire compatible, so none of my existing queries will actually work. It might have all the same functions but be missing support for stored procedures. The transaction isolation might be a lie. It's not that these databases are bad, it's that "PostgreSQL as a whole" cannot serve as a strong abstraction boundary - the API surface is simply too large and complex, and too many implementation details are exposed.

It's the same reason people like containers: running your application on an existing system is a very poor abstraction. The API surface of a modern linux distro is huge, and includes everything from what libraries come pre-installed to the file-system layout. On the other hand the kernel API is (in comparison) small and stable, and so you can swap out either side without too much fear.

K8S can be a very good abstraction if you deploy a lot of services to multiple VMs and need a lot of control over how they are scaled up and down. If you're deploying a single container to a VM, it's massively bloated.

TLDR: Abstractions can be good and bad, both inherently, and depending on your use-case. Make the right choice based on your needs. Fly are probably correct that their GPU offering is a bad abstraction for many of their customer's needs.

I don't think you got this split right.

I prefer to either manage software directly with no wrappers on top, or use a fully automated solution.

K8S is something I'd rather avoid. Do you enjoy writing configuration for your automation layer?

All professional developers want two things: Do their work as fast as possible and spend as little budget to make things work. That's the core operating principle of most companies.

What's changing is that managed solutions are becoming increasingly easier to set up and increasingly cheaper on smaller scales.

While I do personally enjoy understanding the entire stack, I can't justify self-hosting and managing an LLM until we run so many prompts a day that it becomes cheaper for us to run our own GPUs compared to just running APIs like OpenAI/Anthropic/Deepseek/...

I was thinking about this just yesterday. I was advertised a device for an aircraft to geo-assist taxiing? (I have never flown so I don’t know why). The comments were the usual “old man shouts at cloud” angry that assistive devices make lives easier for people.

I feel this is similar to what you are pointing out. Why _shouldn’t_ people be the “magic” users. When was the last time one of your average devs looked in to how esm loading? Or the python interpreter or v8? Or how it communicates with the OS and lower level hardware interfacing?

This is the same thing. Only you are goalpost shifting.

> There's an (increasingly small) group of software developers who don't like "magic" and want to understand where their code is running and what it's doing. (...) The other group (increasingly large) just wants to `git push` and be done with it

I think we're approaching the point where software development becomes a low-skilled job, because the automatic tools are good enough to serve business needs, while manual tools are too difficult to understand by anyone but a few chosen ones anyway.

I think it's true that engineers who want to understand every layer of everything in depth, or who want to have platform ownership, are not necessarily the same group as the more "product itself" focused sort who want to write something and just push it, I don't actually think I'm sold at all that any of these groups, in a vacuum, have substantial demand for GPU compute unless that's someone's area of interest for a pet project.

This. Personally, I’d want a GPU to self host whatever model, because I think that’s fun, plain and simple. Probably many people do too. But the business is not making money from people who are just thinking about fun.

[deleted]

lol, even understanding git is hard for them. Increasingly, software engineers don't want to learn their craft.

I think the root of it is most people coming into the software engineering industry just want a good paying job. They don’t have any real interest in computers or networks or anything else. Whatever keeps the direct deposits coming is what they’ll do. And in their defense, the web dev industry is so large in breadth and depth and the pay/benefits are so generous it’s an attractive career path no matter what your passion is.

The way I think about it is this: any individual engineer (or any individual team) has a limited complexity budget (in other words, how much can you fit in your meat brain). How you spend it is a strategic decision. Depending on your project, you may not want to waste it on infra so you can fit a lot of business logic complexity.

increasingly small is right. i'm definitely part of that former group but sadly more and more these days i just feel dumb for being this way. it usually just means that i'm less productive than my colleagues in practice as i'm spending time figuring out how things work while everybody else is pushing commits. maybe if we were put in a hypothetical locked room with no internet access i'd have a slightly easier time than them but that's not helpful to anybody.

once upon a time i could have said that it's better this way and that everybody will be thankful when i'm the only person who can fix something, but at this point that isn't really true when anybody can just get an LLM to walk them through it if they need to understand what's going on under the hood. really i'm just a nerd and i need to understand if i want to sleep at night lol.

You lost me at "Kubernetes".

The former have the mentality of being independent, at the cost of their ability to produce a result as quickly. The latter are happy to be dependent, because the result is more important than the means. Obviously this is a spectrum.

It depends on the product you're building. At my last job we hosted bespoke controlnet-guided diffusion models. That means k8s+GPUs was a necessity. But I would have loved to use something simpler than k8s.

I don’t think this comment does justice to fly.io.

They have incredible defaults that can make it as simple as just running ‘git push’ but there isn’t really any magic happening, it’s all documented and configurable.

Where does this dichotomy between Kubernetes, and superficial understanding come from? It is not consistent with my experience, and I don't have speculation its origin.

It's been a while since I tried, but my experience trying to manually set up GPUs was atrocious, and with investigation generally ending at the closed-source NVidia drivers it's easy to feel disempowered pretty quickly. I think my biggest learning from trying to do DL on a manually set up computer was simply that GPU setup was awful and I never wanted to deal with it. It's not that I don't want to understand it, but with NVidia software you're essentially not allowed to understand it. If open source drivers or open GPU hardware were released, I would gladly learn how that works.

The latter group sounds like they're more managers than software developers.

The view that developers just want LLMs is plain wrong. The age of AI is just starting.

Somebody who doesn’t want to understand DNS, Linux, or anything beyond their framework is a hazard. They’re not able to do a competent code review on the vomit that LLMs produce. (Am I biased much?)

I'd be in the latter group if my budget were infinite. Alas!

> They don't want to have to understand DNS, linux, or anything else beyond whatever framework they are using.

tell me whether there's many brick layers who wants to understand the chemical composition of their bricks.

I’ve never laid bricks but in other trades I’ve worked in, well, a lot of people understood basics of the chemistry of the products we used. It’s useful to understand how they work together safely, if they can be exposed to different environments, if they’re heat-safe, cold-safe, do they off-gas, etc.

Paints, wood finishes, adhesives, oils, abrasives, you name it. You generally know at least a bit about what’s in it. I can’t say everyone I’ve worked with wanted to know, but it’s often intrinsic to what you’re doing and why. You don’t just pull a random product off a shelf and use it. You choose it, quite often, because of its chemical composition. I suspect it’s not always thought of this way, though.

This is the same with a lot of artistic mediums as well. Ceramicists often know a lot more than you’d expect about what’s in their clay and glazes. It’s really cool.

I’m not trying to be contrarian here. I know some people don’t care at all, and some people use products because it’s what they were told to do and they just go with it. But that wasn’t my experience most of the time. Maybe I got lucky, haha.

In my country I think the vocational/trade degree -which might lie between HS and uni level- on car mechanics has basic physics, mechanics and maybe some chemistry.

Ditto for the rest of technical voc degrees.

If you think you can do IT without at least a trade degree on understanding the low level components interact, (and I'm not talking about CS level, concurrency with CSP, O-notation, linear+discrete algebra... but basic stuff such as networking protocols, basic SQL database normalizations, system administration, configuration, how the OS boots, how processes work -idle, active, waiting..., if you don't get that, you will be fired faster than anyone around.

IaaS or PaaS?

Who owns and depreciates the logs, backups, GPUs, and the database(s)?

K8s docs > Scheduling GPUs: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus... :

> Once you have installed the plugin, your cluster exposes a custom schedulable resource such as amd.com/gpu or nvidia.com/gpu.

> You can consume these GPUs from your containers by requesting the custom GPU resource, the same way you request cpu or memory

awesome-local-ai: Platforms / full solutions https://github.com/janhq/awesome-local-ai?platforms--full-so...

But what about TPUs (Tensor Processing Units) and QPUs (Quantum Processing Units)?

Quantum backends: https://github.com/tequilahub/tequila#quantum-backends

Kubernetes Device Plugin examples: https://kubernetes.io/docs/concepts/extend-kubernetes/comput...

Kubernetes Generic Device Plugin: https://github.com/squat/generic-device-plugin#kubernetes-ge...

K8s GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator...

Re: sunlight server and moonlight for 120 FPS 4K HDR access to GPU output over the Internet: https://github.com/kasmtech/KasmVNC/issues/305#issuecomment-... :

> Still hoping for SR-IOV in retail GPUs.

> Not sure about vCPU functionality in GPUs

Process isolation on vCPUs with or without SR-IOV is probably not as advanced as secure enclave approaches.

Intel SGX is a secure enclave capability, which is cancelled on everything but Xeon. FWIU there is no SGX for timeshared GPUs.

What executable loader reverifies the loaded executable in RAM after imit time ?

What LLM loader reverifies the in-RAM model? Can Merkle hashes reduce that cost; of nn state verification?

Can it be proven that a [chat AI] model hosted by someone else is what is claimed; that it's truly a response from "model abc v2025.02"?

PaaS or IaaS

Having worked with many of the latter and having had the displeasure of educating them on nix systems fundamentals: ugh, oof, I hate this timeline, yet I also feel a sense of job security.

We used to joke about this a lot when Java devs would have memory issues and not know how to adjust the heap size in init scripts. So many “CS majors” who are completely oblivious to anything happening outside of the JVM, and plenty happening within it.

Eh, the way I see it the entire practice of computer science and software engineering is built on abstraction -- which can be described as the the ability to not have to understand lower levels -- to only have to understand the API and not the implementations of the lowest levels you are concerned with, and to have to pay even less attention to lower levels than that.

I want to understand every possible detail about my framework and language and libraries. Like I think I understand more than many do, and I want to understand more, and find it fulfilling to learn more. I don't, it's true, care to understand the implementation details of, say, the OS. I want to know the affordances it offers me and the APIs that matter to me, I don't care about how it's implemented. I don't care to understand more about DNS than I need. I definitely don't care to spend my time futzing with kubernetes -- I see it as a tool, and if I can use a different tool (say heroku or fly.io) that lets me not have to learn as much -- so I have more time to learn every possible detail of my language and framework, so I can do what I really came to do, develop solutions as efficiently and maintainably as possible.

You are apparently interested in lower levels of abstraction than I am. Which is fine! Perhaps you do ops/systems/sre and don't deal with the higher levels of abstraction as much as I do -- that is definitely lucrative these days, there are plenty of positions like that. Perhaps you deal with more levels of abstraction but don't go as deep as me -- or, and I totally know it's possible, you just have more brain space to go as deep or deeper on more levels of abstraction as me. But even you probably don't get into the implementation details of electrical engineering and CPU design? Or if you do, and also go deep on frameworks and languages, I think you belong to a very very small category!

But I also know developers who, to me, dont' want to go to deep on any of the levels of abstraction. I admit I look down on them, as I think you do too, they seem like copy-paste coders who will never be as good at developing efficient maintainable soltuions.

I started this post saying I think that's a different axis than what layers of abstraction one specializes in or how far down one wants to know the details. But as I get here, while I still think that's likely, I'm willing to consider that these developers I have not been respecting -- are just going really deep in even higher levels of abstraction than me? Some of them maybe, but honestly I don't think most of them, but I could be wrong!

[dead]

> They don't want to have to understand DNS, linux, or anything else beyond whatever framework they are using.

This is baffling. What’s value proposition here? At some point customer will be directly asking an AI agent to create an app for them and it will take care of coding/deployment for them..

Some people became software developers because they like learning and knowing what they're doing, and why and how it works.

Some people became software developers because they wanted to make easy money back when the industry was still advertising bootcamps (in order to drive down the cost of developers).

Some people simply drifted into this profession by inertia.

And everything in-between.

From my experience there are a lot of developers who don't take pride in their work, and just do it because it pays the bills. I wouldn't want to be them but I get it. The thing is that by delegating all their knowledge to the tools they use, they are making themselves easy to replace, when the time comes. And if they have to fix something on their own, they can't. Because they don't understand why and how it works, and how and why it became what it is instead of something else.

So they call me and ask me how that thing works...

This is my experience as well. I answer many such calls from devs as part of my work.

I can usually tell at the end of a call which group they belong to. I've been wrong a few times too.

As long as they don't waste my time I'm fine with everyone, some people just have other priorities in life.

One thing I'd say is in my experience there are many competent and capable people in every group, but non-competent ones are extremely rare in the first group.

The value proposition is that you know how to fix something, when it eventually break, because you don’t fundamentally understand.

GPT usually writes code that eventually (often immediately) breaks. When given the failure mode, it usually fixes the issue (and often creates a new one, GOTO 10)

My heart stopped for a moment when reading the title. I'm glad they haven't decided to axe GPUs, because fly GPU machines are FANTASTIC!

Extremely fast to start on-demand, reliable and although a little bit pricy but not unreasonably so considering the alternatives.

And the DX is amazing! it's just like any other fly machine, no new set of commands to learn. Deploy, logs, metrics, everything just works out of the box.

Regarding the price: we've tried a well known cheaper alternative and every once in a while on restart inference performance was reduced by 90%. We never figured out why, but we never had any such problems on fly.

If I'm using a cheaper "Marketplace" to run our AI workloads, I'm also not really clear on who has access to our customer's data. No such issues with fly GPUs.

All that to say, fly GPUs are a game changer for us. I could wish only for lower prices and more regions, otherwise the product is already perfect.

I used the fly.io GPUs as development machines. For that, I generally launch a machine when I need it and scale it to 0 when I am finished. And this is what's really fantastic about fly.io - setting this up takes an hour... and the Dockerfile created in the process can also be used on any other machine. Here's a project where I used this setup: https://github.com/li-il-li/rl-enzyme-engineering

This is in stark contrast to all other options I tried (AWS, GCP, LambdaLabs). The fly.io config really felt like something worth being in every project of mine and I had a few occasions where I was able to tell people to sign up at fly.io and just run it right there (Btw. signing up for GPUs always included writing an email to them, which I think was a bit momentum-killing for some people).

In my experience, the only real minor flaw was the already mentioned embedding of the whole CUDA stack into your container, which creates containers that approach 8GB easily. This then lets you hit some fly.io limits as well as creating slow build times.

I just looked at their pricing and they don't list any GPUs at all that I could find.

Search for A100 on this page: https://fly.io/docs/about/pricing/

I have a timeline that I am still trying to work through but it goes like this :

2012 - moores law basically ends - nand gates do t get smaller just more cleverly wrapped. Single threaded execution more or less stops at 2 GHz and has remained there.

2012-2022 - no one notices single threaded is stalled because everything moves to VMs in the cloud - the excess parallel compute from each generation is just shared out in data centres

2022 - data centres realise there is no point buying the next generation of super chips with even more cores because you make massive capital investments but cannot shovel 10x or 100x processes in because Amdahls law means standard computing is not 100% parallel

2022 - but look, LLMs are 100% parallel hence we can invest capital once again

2024 - this is the bit that makes my noodle - wafer scale silicon. 900,000 cores with GBs SRAM - these monsters run Llama models 10x faster than A100s

We broke moores law and hardware just kept giving more parallel cores because that’s all they can do.

And now software needs to find how to use that power - because dammit, someone can run their code 1 million times faster than a competitor - god knows what that means but it’s got to mean something - but AI surely cannot be the only way to use 1M cores?

I’m surprised nobody has yet (as of this writing) pointed out that Moore’s Law never claimed anything about single threaded execution or clock rates. Moore’s Law is that the number of transistors doubles every two years, and that trend has continued since 2012.

It looks like maybe the slope changed slightly starting around 2006, but it’s funny because this comment ends complaining that Moore’s Law is too good after claiming it’s dead. Yes, software needs to deal with the transistor count. Yes, parallel architectures fit Moore’s law. The need to go to more parallel and more parallel because of Moore’s Law was predicted, even before 2006. It was a talking point in my undergrad classes in the 90s.

https://upload.wikimedia.org/wikipedia/commons/0/00/Moore%27...

Sadly, Moore’s law is often cited, but never read. It’s only 3 pages.

http://cva.stanford.edu/classes/cs99s/papers/moore-crammingm...

But to further needle, the law is

computer go faster over time

>. Single threaded execution more or less stops at 2 GHz and has remained there. 2012-2022 - no one notices single threaded is stalled because everything moves to VMs in the cloud

Single Thread execution, I assume you mean IPC or may be more accurately as PPC ( Performance Per Clock ) has improved steadily if you accounted for ARM design and not just x86. That is why M1 was so surprising to everyone because most (all) thought Geekbench score on Phone doesn't translate to Desktop and somehow M1 went from nonsense to breakthrough.

Clockspeed also went from 2Ghz to 5Ghz and we are pushing 4Ghz on Mobile Phone already.

And Moores law, in terms of transistor density ends when Intel couldn't deliver 10nm on time, so 2016 / 2017 give or take. But that doesn't mean transistor density is not improving.

The most surprising thing about M1 was the energy efficiency and price/performance point they hit. It had been known for a couple of years that the phone SOCs were getting really good, just that being passively cooled inside a phone case only allows them 1-2 seconds of max bursts.

Apple's chips are dramatically faster than any other kind. If you are single thread perf constrained and have the money, running workloads on Apple silicon can actually make sense.

> Apple's chips are dramatically faster than any other kind.

Any idea why? Is it because of some patent they hold?

Nothing to do with that, they just have a head start in the right direction along with enough money to fund many iterations before it landed on Desktop or become accepted by 95% of people.

Qualcomm's current Snapdragon Elite Oryon 2 on Mobile, and ARM Cortex X925 or previously known as X5 are already close to Apple A17 level performance. So this is no longer something unique to Apple.

I just wish both design are more widely available. And for x86, Intel and AMD still haven't quite caught up. At least not in the next 2 years.

Combination of using the latest TSMC processes, a very wide design, very deep speculation pipelines, a weaker memory model than Intel, and lots of clever tricks of the usual sort for fast cpus, also a very high memory bandwidth.

I don't think it's anything to do with patents although I'm sure they have plenty.

The single threading graph is already very flat now ...

https://www.man.com/technology/single-core-stagnation-and-th...

The Graph is exactly like I said, Intel falling behind in 10nm ( 2017 - 2020 ) and discounting IPC improvement made in ARM.

But we may finally be hitting a plateau unless Apple can demonstrate improvement in M5 and M6. They have pretty much squeezed out everything with the 8-Wide Design. Not sure if they could go any wider without some significant trade off.

The unsung hero of early computing was Dennard scaling. Taking CPUS from 10MHz to 2GHz, all alongside massive per-clock efficiency improvements must have been a crazy time.

From a 50MHz 486 in 1990 to a 1.4GHz P3 in 2000 is a factor of 28 improvement in speed solely due to clock speed! Add on all the other multiplicative improvements from IPC...

The greatest increase in clock frequency has been in the decade 1993-2003, when the clock frequency has increased 50 times (from a 66 MHz Pentium to a 3.2 GHz Pentium 4).

Since then, in more than 20 years, the clock frequency has increased only 2 times, while in the previous decade (1983-1993) it had increased only about 5 times, where a doubling of the clock frequency (33 to 66 MHz) had occurred between 1989 and 1993 (for cheap CPUs with MOS logic, because expensive CPUs using ECL had reached 80 MHz already during the seventies).

Also, Pentium III has reached 1.4 GHz only in early 2002, not in 2000, while 80486 has reached 50 GHz only in 1991, not in 1990.

The P4 was nothing to celebrate. Clock for clock it was slower than it’s predecessor. The subsequent generation was based on the Pentium M, which was more energy efficient.

> while 80486 has reached 50 GHz only in 1991, not in 1990.

Your typo got me wondering — what would the performance of an actual 50GHz 486 look like compared to modern single-core performance?

The lack of speculative execution combined with atrocious memory latencies and next to no cache should be enough to annihilate most if not all of the advantage from the faster clock — CPU is just going to be idling waiting for data. Then there’s the amount of work you can get done per cycle, and SIMD, and…

>2012 - moores law basically ends - nand gates do t get smaller just more cleverly wrapped. Single threaded execution more or less stops at 2 GHz and has remained there.

In 2012, 22nm architectures were new (https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchite...). Now we have 3nm architectures (https://en.wikipedia.org/wiki/3_nm_process). In what sense have nand gates "not gotten smaller"?

My computer was initially built in 2014 and the CPU runs up to 3 GHz (and I don't think it was particularly new on the market). CPUs made today overclock to 5+ GHz. In what sense did "single threaded execution more or less stop at 2 GHz and remain there"?

We might not be seeing exponential increases in e.g. transistor count with the same doubling period as before, but there has demonstrably been considerable improvement to traditional CPUs in the last 12+ years.

So the size of nand gates is still roughly 28nm, but if you measure a car from above its size is fixed, if you stand the car on its nose you can measure it from above and it’s “smaller”, this is FinFet, then fold down the roof and the wheels and that’s roughly GAA. The car size stays the same - the parking density increases. It’s more marketing than reality but density is up …

As for clock speeds, yes and no, basically thermal limits stop most CPUs running full time at full speed - the problem was obvious back in the day - I would build PCs and carefully apply thermal paste to the plastic casing of a chip - thus waiting for the heat of the transistors to heat up the plastic to remove the waste. Yes they are working on thermal something something directly on the layers of silicon.

> In what sense have nand gates "not gotten smaller"?

Because 22nm was not actually 22nm, and 3nm is not actually 3nm.

I am still feeling my way through these ideas, but think perhaps of an alternative universe where instead of getting cleverer with instruction pipelining (guessing what the next CPU will ask for and using the silicon to work that out), hardware had just added more parallel cores - so it did not need to guess the next instructions, it just went faster because the instructions were in parallel because we magically solved software and developers.

You could have a laptop with 1000 cores on it - simple 32/64 bit CPUs that just ran full pelt.

The lack of parallelism drove decisions to take silicon and make it do stuff that was not run everything faster. But to focus on getting one instruction set through one core faster.

AI has arrived and found a world of silicon that it by coincidence can use every transistor for going full pelt - and the CPUs we think of in our laptops are using only a fraction of their transistors for full pedal to the metal processing and the rest is … legacy??

> We broke moores law and hardware just kept giving more parallel cores because that’s all they can do.

You get more cores because transistor density didn't stop increasing, software devs/compiler engineers just can't think of anything better to do with the extra real estate!

> Single threaded execution more or less stops at 2 GHz and has remained there.

There are other semiconductor materials that do not have the heat limits of silicon-based FETs and have become shockingly cheap and small (for example, a 200W power supply the size of a wallet that doesn't catch on fire). We're using these materials for power electronics and RF/optics today but they're nowhere close to FinFETs from a few years ago or what they're doing today. That's because all the fabrication technology and practices have yet to be churned out for these new materials (and it's not just UV lasers), but they're getting better, and there will one day be a mcu made from wide bandgap materials that cracks 10GHz in a consumer device.

Total aside, hardware junkies love talking cores and clock speeds, but the real bottlenecks for HPC are memory and i/o bandwidth/latency. That's why the future is optical, but the technology for even designing and experimenting with the hardware is in its infancy.

>someone can run their code 1 million times faster than a competitor

I’d bet most code we use every day spends most time just waiting for things like disk and network. Not to mention it’s probably inherently sequential.

Hell most code we use every day spends a huge portion of its time waiting for memory.

Most code I use spends time waiting while I solve the captcha.

Plenty of non-IT applications use lots of cores, e.g. physics simulations, constraint solving, network simulation used to plan roads or electrical distribution, etc.

Yes - but the amount of code that loves Amdahls law, is tiny compared to amount code churned out each day that can never run parallel over 1M cores - no matter how clever a compiler gets.

I cannot work out if we pack enough parallel problems in the world or just lack a programming language to describe them

Computers are still Von-Neumann machines, and other architectures lost out due to the great returns on investment for that architecture. However, in the AI world, this might not be the case. For instance, neuromorphic computing is one example, and there are others. Or back to analog again! Superposition is instant—no slow adders with carry bits to propagate! Who knows. Fun times!

Most computers use modified Harvard architecture, funnily enough. There's a shared memory space like von Neumann, but separated caches for instructions and data.

It's the best of both worlds, because from the CPU's perspective it gets to have separate lanes for instructions and data, but from the programmer's perspective it's one memory.

Not always. Modern computers are like several computers networked into one, if you think about it. Since the DMA, that approach is not valid. Today we have IOMMU's. The 9front/plan9 guys are trying to write a kernel (Nix) which wants to exploit every concurrent core of your CPU at crazy scaling levels.

I don't get this. One of the worst computational problems holding back robotics is non linear model predictive control. You have 1-2 ms of time to build and solve a QP or a series of QP problems in the non linear case over a horizon of N time steps. 100% accurate MPC is inherently sequential. You must calculate the time step t_1 before t_2, because your joint positions are influenced by the control signal u_1. This means that the problem is intractable, since it needs backtracking via branch and bound.

However, since the problem is intractable, you don't actually have to solve it. What you can do instead is perform random shooting in 32 different directions, linearize and then solve 32 quadratic problems and find the minimum over those 32. That is phase one. However, a cold start from phase one sucks, so there is a second phase, where you take an initial guess and refine it. Again you do the same strategy, but this time you can use the initial guess to carefully choose search directions and solve the next batch of 32 QPs and take the minimum over them.

Now here is the thing. Even this in itself won't save you. At the extreme top end you are going to have 20k decision variables for which you're going to solve an extremely sparse linear systems of equations.

SQP is iterative QP and QP is an iterative interior point or active set algorithm, so we are two iterative algorithms deep. A linear systems of equations can be solved iteratively, so let's make it a third. It turns out, the biggest bottleneck in this nesting of sequential algorithms isn't necessarily the sequential nature. It's multiplying a giant sparse 20000x20000 matrix with a 20000 wide vector and doing this over and over and over again. That is what is fucking impossible to do in the 2 millisecond time budget you've been given, not the sequential parts.

So what does Boston Dynamics do for their impressive application of MPC? They don't even try. They just linearize the MPC and let the non sequential QP solver run as many iterations as it can until time is up, meaning that they don't even solve to optimality!

Now you might wonder why someone would want non linear MPC in the first place if it is so impractical. The reason is that MPC provides a general compute scaling solution to many problems that would require a lot of human ingenuity. It is the bitter lesson. Back when QP solvers were to slow, people used the pseudo inverse on non constrained QP problems. It's time for faster parallel hardware to make QP obsolete and let SQP take over.

Yes, parallel hardware is the key to a sequential problem.

Ok - wow please point at more reading across this :-)

You can easily use a million cores if you have a million independent tasks. Which you often have, if you are dealing with data rather than building a big centralized service.

> 2012 - moores law basically ends - nand gates do t get smaller just more cleverly wrapped. Single threaded execution more or less stops at 2 GHz and has remained there.

A 2GHz core from a 2012 is extremely slow compared to a 2GHz core of a modern CPU. The difference could be an order of magnitude.

There is more to scaling CPUs than the clock speed. Modern CPUs process many more instructions per clock on average.

Edit: it’s worth expanding on the data centre costs issues - if it’s fair to say we have stalled on “free”speed ups for standard software (ie C/Linux) - that is clock speeds more or less stopped and we get more 64 bit cores but each core is more or less no faster (hand wavy), then the amount of clients that can be packed into a data centre stays the same - you are renting out CPUs in a Docker VM - that’s basically one per Core(#). And while wafer scale gives you 900,000 cores in a 1u server, normal CPUs gives you what 64? 128?

Suddenly your cost for building a new data centre is something like twice the cost of the previous one (cold gets more expensive etc) and yet you only sell same amount of space. It’s not an attractive business in first place.

This was the push for lambda architecture etc - depending on usage you could have hundreds of people buying the same core. I would have put a lot of cash into making something that spins up docker instances so fast it’s like lambda - and guess what fly.io does?

I think fly.io’s obsession with engineering led them down a path of positive capital usage while AWS focused on rolling out new products on a tougher capital process.

Anyway - AI is the only thing that dense multi core data centres can use that packs in many many users compared to docker per core.

Unless we all learn how to think and code in parallel, we are leaving a huge amount of hardware gains on the table. And those gains are going to 100x in the next ten years and my bet is my janky-ass software will still not be able to use it - there will be a bifurcation of specialist software engineers who work in domains and with tooling that is embarrassingly parallel, and the rest of us will be on fly.io :-)

(#) ok so maybe 3 or 4 docker per core, with hyper visor doling out time slots, but much more than that and performance is a dog, and so the number of “virtual CPUs” you can sell is a limited number and creeps up despite hardware leaping ahead … the point I am making

Web applications and APIs for mobile apps are embarrassingly parallel, but many modern web languages and frameworks went back to the stone age on parallelism.

Ancient Java servlets back in the early 2000s were more suitable for performance on current gen hardware than modern NodeJS, Python etc...

There’s always Elixir!

It’s worth thinking about how to architect some of this in the new hardware.

To go extreme, wafer scale silicon - 900,000 8 bit cores, 120GB sRam. Hand wave on the 8bit for now and just think how to handle facebook or e-commerce. A static site is simple if it fits inside 120GB, but facebook is low write high read (Inassume) - but fetching from / writing to memory - making that parallel means rewriting a lot of things .. and e-commerce - suddenly ACID does not parallelise easily.

I mean this all seems doable with fairly fundamental changes to architecture memory concepts and … rdbms is challenging.

But I think we are way past the idea that a web framework author can just fix it - this is deeper - a new OS a new compiler and so on - but I could be wrong.

Parallelization of loading spinners?

[deleted]

We had close to 4GHz in 2005 with some P4s.

I shelled out for a 4090 when they came out thinking it would be the key factor for running local llms. It turns out that anything worth running takes way more than 24GB VRAM. I would have been better off with 2+ 3090s and a custom power supply. It’s a pity because I thought it would be a great solution for coding and a home assistant, but performance and quality isn’t there yet for small models (afaik). Perhaps DIGITS will scratch the itch for local LLM developers, but performant models really want big metal for now, not something I can afford to own or rent at my scale.

There was a post on r/localLlama the other day about a presentation by the company building Digits hardware for Nvidia. The gist was that Digits is going to be aimed at academic AI research folks and as such don't expect them to be available in large numbers (at least not for this first version). It was disappointing. Now I'm awaiting the AMD Strix Halo based systems.

Laptops with the same chip however...

I started buying Macs with more memory, no regrets. An M4 Max with 64GB (in a laptop, no less!) runs most small models comfortably (but get 96GB or more if you really intend to use 70B models regularly). And when I'm not running LLMs, the memory is useful for other stuff.

> And when I'm not running LLMs, the memory is useful for other stuff

Let's be honest, the other stuff is just Chrome: Tell me 96gb is enough?

Firefox and VS Code for me: I have 64GB and I can't run Llama 70B locally without closing a ton of windows and tabs first!

Have you tried auto tab discard?

Not the person you replied to but auto discard only helps so much. Even with it on I often use 80-90% memory (albeit on Windows, not sure if macOS is any different in this regard).

Firefox here too, no such problems. I do however run Google services in a separate browser (Brave or Chromium) because most of them hog the Firefox browser. For example running earth.google.com on FF is... a very special experience. :-/

Gosh. Good thing I haven't bought a GPU in almost a decade. With a little luck I'll catch this wave on the back end. I haven't had to learn web or mobile development thoroughly either

The best bang for the buck on VRAM is a maxed out Mac Studio.

No one who claims this ever posts a benchmark

Prompt eval is slow, inference for large models at high context is slow, training is limited and slow.

It's better than not having anything, but we got rid of our M1 Max 192GBs after about a year.

> No one who claims this ever posts a benchmark

I have a Mac with a lot of RAM for running models. I haven’t done it in a month because I can tell that it’s not only slow, but the output also doesn’t come close to what I can get from the latest from Claude or ChatGPT.

It’s actually amazing that I can run LLMs locally and get the quality of output that they give me, but it’s just a different level of experience than the state of the art.

I’m becoming convinced that the people who sing the praises of running locally are just operating differently. For them, slow and lower quality output aren’t a problem because they’re having fun doing it themselves. When I want to get work done, the hosted frontier models are barely fast enough and have hit or miss quality for me, so stepping down to the locally hosted options is even more frustrating.

I'm hoping to see some smaller MoE models released this year, trained with more recent recipes (higher quality data, much longer pretraining). Mixtral 8x7B was impressive when it came out, but the exact same architecture could be a lot more powerful today, and would run quite fast on Apple Silicon.

What will you pay me for the benchmarks, for the professional knowledge and analysis?

I can post benchmarks for these Mac machines and clusters of Studios and M4 Mac Minis (see my other HN posts last month, the largest Mac cluster I can benchmark for you has 4 TB of ultrafast unified memory and around 9216 M4 cores).

I mean, I can't pay you anything, but that sounds interesting as hell. Are there any interesting use cases to massive amounts of memory outside of training?

> No one who claims this ever posts a benchmark

I meant to explain why no one ever posts a benchmark, it's expensive as hell to do a professional benchmark against accepted standards. Its several days work, very expensive rental of several pieces of $10K hardware, etc. You don't often hand that over for free. With my benchmark results some companies can save millions if they take my advice.

>any interesting use cases to massive amounts of memory outside of training?

Dozens, hundreds. Almost anything you use databases, CPUs, GPUs or TPUs for. 90% of computing is done on the wrong hardware, not just datacenter hardware.

The interesting use case we discussed here on HN last week was running full DeepSeek-R1 LLms on 778 GB fast DRAM computers locally. I benchmarked getting hundreds of tokens per second on a cluster of M4 Mac minis or a cluster of M2 Mac Studio Ultras where others reported 0.015 or 6 tokens per second on single machines.

I just heard of a Brazilian man who build a 256 Mac Mini cluster at double the cost that I would. He leaves $600K value on the table because he won't reverse engineer the instruction set, rewrite his software or even call Apple to negotiated a low price.

HN votes me down for commenting that I, a supercomputer builder for 43 years, can build better cheaper faster low power supercomputers from Mac Mini's and FPGA's than from any Nvidia, AMD or Intel state of the art hardware, it even beats the fastest supercomputer of the moment or the Cerebras wafer engine V3 (on energy. coding cost and performance per watt per dollar).

I design and build wafer scale 2 million core reconfigurable supercomputers for $30K a piece that cost $150-$300 million to mass produce. That's why I know how to benchmark M2 Ultra and M4 Macs, as they are the second best chip a.t.m. that we need to compete against.

As a consulting job I do benchmarks or build your on-prem hardware or datacenter. This job consists mainly teaching the customer's programming staf how to program massively parallel software or convincing the CEO not to rent cloud hardware but buy on-prem hardware. OP at Fly.io should have hired me, then he wouldn't have needed to write his blog post.

I replied to your comment in hope of someone hiring me when they read this.

Interesting! Fingers crossed someone who's looking for your skillset finds your post.

What is your process to turn Mac minis into a cluster? Is there any special hardware involved? And you can get 100x tok/s vs others on comparable hardware, what do you do differently - hardware, software, something else?

>What is your process to turn Mac minis into a cluster

1) Apply science. Benchmark everyting until you understand if its memory bound, i/o bound or compute bound [1].

2) Rewrite software from scratch in a parallel form with message passing.

3) Reverse engineer native instruction sets of CPU, GPU and ANE or TPU. Same for NVIDIA (don't use CUDA).

No special hardware needed but adding FPGA's for optimizing the network between machines might help.

So you analyse the software and hardware, then restructure it by reprogramming and rewireing and adaptive compilers. Then you benchmark again and you find what hardware runs the algorithm fastest for less $ using less energy and weigh that against the extra cost for reprogramming.

[1] https://en.wikipedia.org/wiki/Roofline_model

I discussed all the points you ask about in my HN postings last month, but never in enough detail so you must ask me to specify and that's when people hire me.

As you can see from this comments thread, most people, especially programmers, lack the knowledge we computer scientist, parallel programmers and chip or hardware designers have.

>What is your process

Science. To measure is to know, my prof always said.

To answer your questions in detail, email me.

You first need to be specific. The problem is not how to turn Mac minis into a cluster, with or without custom hardware ( I do both) on code X or Y. Or how to optimize software or rewrite it from scratch (which its often cheaper).

First find the problem. In this case the problem is find the lowest OPEX and Capex to do the stated compute load versus changing the compute load. Turns out in a simulation or a cruder spreadsheet calculation it becomes clear that the energy cost dominates of hardware choice, it trumps the cost of programming, the cost of off the shelf hardware and the difference if you add custom hardware. M4's are lower power, lower OPEX and lower CAPEX especially if you rewrite your (Nvidia GPU) software. The problem is the ignorance of the managers and their employee programmers.

You can repurpose the 2 x 10 Gbps USB-C, the 10 Gbps Ethernet and the three 32 Gbps PCIe ports or Thunderbolts but you have to use better drivers. You need to weigh if double the 960 Gbps 16 GB unified memory for 2 x $400 is faster than 2 Tbps memory at 1.23 times the cost versus 3 x 4 x 32 Gbps PCIe 4.0 versus 3 x 120 Gbps unidirectionally is better for this particular algorithm and wheat changes if you uses both the 10 CPU cores, 10 x 400 GPU corses and 16 Neural Engine cores (at 38 trillion 16 bit OPS) will work batter than just the CUDA cores. Ususally the answers is: rewrite the alogoritm and use an adaptive compiler and then a cluster of smaller 'sweet spot' off the shelf hardware will outperform the most fancy high end hardware if the network is balanced. This varies at runtime so you'll only know if you now how to code. As Akan Kay said and Steve Jobs quoted: if your serious about software you should do your own hardware. If you can't, then you can approach the hardware with commodity components if that turns out to be cheaper. I estimate for $42K labour I can save you a few hundred $k.

Sounds interesting, but I don’t see any HN submissions on your profile last month. Are you referring to comments you made?

>Are you referring to comments you made?

Yes. Several pages of comments about M4 clusters, wafer scale integrations and a few about DeepSeek.

https://news.ycombinator.com/threads?id=morphle (a few pages- press more).

https://news.ycombinator.com/item?id=42799072

> it's expensive as hell to do a professional benchmark against accepted standards. Its several days work, very expensive rental of several pieces of $10K hardware, etc

When people casually ask for benchmarks in comments, they’re not looking for in-depth comparisons across all of the alternatives.

They just want to see “Running Model X with quantization Y I get Z tokens per second”.

> That's why I know how to benchmark M2 Ultra and M4 Macs, as they are the second best chip a.t.m. that we need to compete against.

Macs are great for being able to fit models into RAM within a budget and run them locally, but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine unless you’re deliberately excluding all of the systems that hobbyist commonly build under $30K which greatly outperform Mac hardware.

>They just want to see “Running Model X with quantization Y I get Z tokens per second”.

Influencers on Youtube will give them that [1] but its meaningless. If a benchmark is not part of an in-depth comparison than it doesn't mean anything and can't inform you on what hardware will run this software best.

These shallow benchmarks influencers post on youtube and twitter are not just meaningless but also take days to browse through. And they are influencers, they are meant to influence you and are therefore not honest or reliable.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

>but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine

I conclude that if you can't afford to develop custom chips than in certain cases a cluster of M4 Mac Mini's will be the fastest cheapest option. Cerebras Wafers or NVDIA GPUs have always been too expensive compared to custom chips or Mac Mini clusters, independent of the specific software workload.

I also meant to say that a cluster of $599 Mac Minis will outperform a $6500 M2 Ultra Mac Studio with 192GB and be half the price for higher performance and DRAM but only if you utilize the M4 Mac Mini aggregated 100 Gbps networking.

a million buckeroos

Absolutely! I have been playing with Ollama on a Macbook Pro 192 GiB RAM and it is able to run most models whereas my 3090 runs our of RAM.

Do you mean 128GB? Not aware of any variant of the Macbook Pro with that much RAM.

192GB is available for the M2 Mac Studio.

I was curious "how bad is it?" and it seems $5500-ish https://www.ebay.com/sch/i.html?_nkw=192gb+studio&_sop=15

$6500 depending on VAT. But 10-12 times M4 Mac mini's with 100 Gbps networking gives you triple the cores and 160 GB with 2.5 times the memeory bandwith if the sharding of the NN layers is done right.

$6500!! You may as well buy 5x3090s for $1000 each for 120GB ram, spend the extra $1500 on the sundries.

Like, I'm sure Nvidia is aware of Apple's "unified memory" as an alternative to their cards and yet...they aren't offering >24GB consumer cards yet, so clearly they don't feel threatened.

Don't get me wrong, I've always disliked Apple as a company, but the M series chips are brilliant, I'm writing this on one right now. But people seem to think that Apple will be able to get the same perf increases yoy when they're really stretching process limits by dumping everything onto the same die like that - where do they go from here?

That said Nvidia is using HBM so it does make me wonder why they aren't also doing memory on package with HBM, I think SK Hynix et al were looking at making this possible.

I'm glad we're headed in the direction of 3d silicon though, always seemed like we may as well scale in z, I imagine they can stack silicon/cooling/silicon/cooling etc. I'm sure they can use lithography to create cooling dies to sandwich between everything else. Then just pass connections/coolant through those.

Hoping that M4 Ultra Mac Pros will bump this again.

I haven't tested programming tasks with a local LLM vs. say, Claude 3.5. But it is nice to be able to run 14-32B LLMs locally and get an instant response. I have a single 3090.

Same here. I just built a pc with a 3090 for local llm and stable diffusion and have zero regrets.

Nothing stops you from getting a second 4090 as well. ^^

no https://www.reddit.com/r/24gb/

[deleted]

My thoughts, expenses and regrets exactly.

I respect them for being public about this.

With that said, this seems quite obvious - the type of customer that chooses Fly, seems like the last person to be spinning up dedicated GPU servers for extended periods of time. Seems much more likely they'll use something serverless which requires a ton of DX work to get right (personally I think Modal is killing it here). To compete, they would have needed to bet the company on it. It's way too competitive otherwise.

As someone who deploys a lot of models on rented GPU hardware, their pricing is not realistic for continous usage.

They're charging hyperscaler rates, and anyone willing to pay that much won't go with Fly.

For serverless usage they're only mildly overpriced compared to say Runpod, but I don't think of serverless as anything more than an onramp to renting dedicated machine, so it's not surprising to hear it's not taking off.

GPU workloads tend to have terrible cold-start performance by their nature, and without a lot of application specific optimizations it rarely ends up making financial sense to not take a cheaper continous option if you have an even mildly consistent workload. (and if you don't then you're not generating that much money for them)

My thing here is just: people self-hosting LLMs think about performance in tokens/sec, and we think about performance in terms of ms/rtt; they're just completely different scales. We don't really have a comparative advantage for developers who are comfortable with multisecond response times. And that's fine!

That reminds me when cloudflare launched their workers gpu product, it was specifically aimed at running models and the pricing was abstracted and based on model output. Did you look what they were doing when building gpu machines?

https://blog.cloudflare.com/workers-ai/

> GPU workloads tend to have terrible cold-start performance by their nature

My Fly machine loads from turned off to first inference complete in about 35 seconds.

If it’s already running, it’s 15 seconds to complete. I think that’s pretty decent.

As the sibling comment points out, usually cold starts are optimized on the order of milliseconds, so 20 seconds is a while for a user to be sitting around with nothing streamed.

And with the premium for per-second GPUs hovering around 2x that for hourly/monthly rentals, it gets even harder for products with scale to justify.

You'd want to have a lot of time where you're scaled to 0, but that in turn maps to a lot of cold starts.

It's really a shame GPU slices aren't a thing -- a monthly cost of $1k for "a GPU" is just so far outside of what I could justify. I guess it's not terrible if I can batch-schedule a mega-gpu for an hour a day to catch up on tasks, but then I'm basically still looking at nearly $50/month.

I don't know exactly what type of cloud offering would satisfy my needs, but what's funny is that attaching an AMD consumer GPU to a Raspberry Pi is probably the most economical approach for a lot of problems.

Maybe something like a system where I could hotplug a full GPU into a system for a reservation of a few minutes at a time and then unplug it and let it go back into a pool?

FWIW it's that there's a large number of ML-based workflows that I'd like to plug into progscrape.com, but it's been very difficult to find a model that works without breaking the hobby-project bank.

GPU slices are absolutely a thing. Only supported on the enterprise GPUs tho and requires an additional paid Nvidia software license.

This is what services like Vast.ai are for - super cheap GPUs you just use as long as you need etc etc.

Do you think that you can use those machines for confidential workflows for enterprise use? I'm currently struggling to balance running inference workloads on expensive AWS instances where I can trust that data remains private vs using more inexpensive platforms.

Of course you cannot use these machines "for confidential workflows for enterprise use", at least with AWS you know whose computer you're working with, but also keep in mind that it's really hard to steal your data as long as your data stays in memory, and you use something like mTLS to actually get it in and out of memory via E2EE. You can figure out the rest of your security model along the way, but anything sensitive (i.e. confidential) would surely fall way out of this model.

I read through the FAQ and the answer is "no", but they say it basically as "nobody really cares what your data is".

I wouldn't put anything confidential through it.

What's your workload and timeline? I'm wondering how much of that workload could be handled in-house.

Just currently exploring how custom AI workflows (e.g. text to sql, custom report generation using private data) can help given the current SOTA. Looking to develop tooling over the next 3-6 months. I'd like to see what we can come up with before dropping $50-100k on hardware.

I threw together a toy project to see if it would help me understand the basic concepts and my takeaway was that, if you can shape your input into something a dedicated classification model (e.g. YOLO for document layout analysis) can work with, you can farm each class out to the most appropriate model.

It turns out that I can run most of the appropriate models on my ancient laptop if I don't mind waiting for the complicated ones to finish. If I do mind, I can just send that part to OpenAI or similar. If your workflow can scale horizontally like my OCR pipeline crap, every box in your shop with RAM >= 16GB might be useful.

Apologies if this is all stuff you're familiar with.

i use them a lot and constantly forget to turn mine off and it just drains my credits. i really need to write a job to turn them off when it's idle for longer than 20minutes

Hmm, that looks interesting -- I might have to explore a bit. The low-end GPU pricing is pretty competitive.

I think time-slicing for GPUs is likely the solution here.

If you could checkpoint a GPU quickly enough it would be possible to run multiple isolated workloads on the same GPUs without any issues.

Nvidia vGPUs are time-sliced; MIG isn't. Neither work in arbitrary hypervised VMs.

Nvidia does not want slicing.

Eh, they have MIG.

Yeah, if you're running your own cluster.

Which you are! What ever happened to the MIG implementation work that y’all were working on? Last I heard it was “cursed” and nearly made someone go insane, which is very normal for NVIDIA hardware :)

No, I mean, if you're running your own cluster for yourself.

vultr has fractional GPUs you can get as a VPS. I think I was paying about $55/month to test one out.

[deleted]

> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

Fly.io seems to attract similar developers as Cloudflare’s Workers platform. Mostly developers who want a PaaS like solution with good dev UX.

If that’s the case, this conclusion seems obvious in hindsight (hindsight is a bitch). Developers who are used to having infra managed for them so they can build applications don’t want to start building on raw infra. They want the dev velocity promise of a PaaS environment.

Cloudflare made a similar bet with GPUs I think but instead stayed consistent with the PaaS approach by building Workers AI, which gives you a lot of open LLMs and other models out of box that you can use on demand. It seems like Fly.io would be in a good position to do something similar with those GPUs.

It also seems they got caught in the middle of the system integrator vs product company dilemma.

To me fly's offering reads like a system integrator"s solution. They assemble components produced mainly by 3rd parties into an offered solution. The business model of a system integrator thrives on doing the least innovation/custom work possible for providing the offering. You posotion yourself to take maximal advantage of investments and innovations driven by your 3rd party suppliers. You want to be squarely on their happy path.

Instead this artcle reads like fly, with good intention, was trying to divert their tech suppliers offer stream into niche edge cases outside of maistream support.

This can be a valid strategy for products very late into their maturity lifecycle where core innovation is stagnant, but for the current state of AI with extremely rapid innovation waves coarsing through the market, that strategy is doomed to fail.

Side note: "we were wrong" - are there any more noble and beautiful words in the English language?

“I was wrong.”

Closely followed by, “I was right.” :)

"Told you so."

That's personal pronouns.

I identify as a consipracy theorist, my pronouns are: Told/You/So.

Specifically, "told you so" as we all sink into the abyss.

It's great when people admit they were wrong but I can't help to find those headlines clickbaity.

A bit like "stop doing this..." and we think: omg, am I doing the same deadly mistake?

I love the idea that Kurt needed to better couch a post saying he was wrong about something.

I don't know.

Yep, those are better. Almost always the most honest words.

[deleted]

We wrote all sorts of stuff this week and this is what gets to the front page. :P

> We burned months trying (and ultimately failing) to get Nvidia’s host drivers working to map virtualized GPUs into Intel Cloud Hypervisor... We think there’s probably a market for users doing lightweight ML work getting tiny GPUs. This is what Nvidia MIG does, slicing a big GPU into arbitrarily small virtual GPUs. But for fully-virtualized workloads, it’s not baked; we can’t use it. Near as we can tell, MIG gives you a UUID to talk to the host driver, not a PCI device.

Apparently this is technically possible, if you can find the right person at Nvidia to talk about vGPU licensing and magic incantations. Hopefully someone reading this HN front page story can make the introduction.

We (I) spent a lot of time talking to several different teams at Nvidia about this. We were able to get VFIO vGPUs to the point where guest libraries would recognize them, but the process fell apart in the guest/host licensing dance, and we weren't really OK with the idea that there'd be a phone-home licensing dance every time a Fly Machine started. Unlike GPU enablement at GCP or AWS, the core DX of a Fly Machine is that it stop/start very quickly; think of it as a midpoint in the design space between Lambda and Fargate. This is what we're talking about when we say it's hard to fit GPUs into our DX.

> phone-home licensing dance every time a Fly Machine started

To userspace Nvidia license server (a) in each host, (b) for entire Fly cloud, or (c) over WAN to Nvidia cloud?

IIRC, (a) and (c), which I think sort of implies (b)?

Really what we'd have wanted to do would have been to give Fly Machines MIG slices. But to the best of my understanding, MIG is paravirtualized; it doesn't give you SR-IOV-style PCI addresses for the slices, but rather a reference Nvidia's userland libraries pass to the kernel driver, which is a dance you can't do across VM boundaries unless your hypervisor does it deliberately.

Hypothetical scenario for Nvidia licensing of fast-start microVMs:

1. Instead of blocking VM start for license validation, convert that step into non-blocking async submission of usage telemetry, allowing every VM to start instantly. For PoC purposes, Nvidia's existing stack could be binary patched to proxy the license request to a script that isn't blocking VM start, pending step 2 negotiation.

2. Reconcile aggregate vGPU usage telemetry from Nvidia Fly-wide license server (Step 1) with aggregate vGPU usage reports from Fly's orchestration/control plane, which already has that data for VM usage accounting. In theory, Fly orchestration has more awareness of vGPU guest workload context than Nvidia's VM-start gatekeeping license agent, so there might be mutual interest in trading instant VM start for async workload analytics.

Sure. We also could have virtualized CUDA ourselves, used MIG on the host-side, and done a proxy PCI passthrough driver thing in Cloud Hypervisor. I think we could have gotten it to work. But it would have been a huge lift. I'm glad we didn't try.

> proxy PCI passthrough driver thing

Do you mean vCS [1], which is already integrated and licensed by KVM/RedHat/Nutanix, Xen/Citrix and VMware?

It's distinct from SR-IOV, PCI passthrough, vGPU-for-VDI, and MIG.

[1] https://blogs.nvidia.com/blog/virtualcomputeserver-expands-v...

I assume anything we did to make MIG work for us would have been custom.

Going back to the blog post:

> Alternatively, we could have used a conventional hypervisor. Nvidia suggested VMware (heh). But they could have gotten things working had we used QEMU. We like QEMU fine, and could have talked ourselves into a security story for it, but the whole point of Fly Machines is that they take milliseconds to start.

Someone could implement virtio-cuda (there are PoCs on github [1] [2]), but it would be a huge maintenance burden. It should really be done by Nvidia, in lockstep with CUDA extensions.

Nvidia vCS makes use of licensed GPGPU emulation code in the VM device model, which is QEMU in the case of KVM and Xen. Cloud Hypervisor doesn't use QEMU, it has its own (Rust?) device model, https://github.com/cloud-hypervisor/cloud-hypervisor/blob/ma...

So the question is, how to reuse Nvidia's proprietary GPGPU emulation code from QEMU, with Cloud Hypervisor? C and Rust are not friends. Can a Firecracker or Cloud Hypervisor VM use QEMU only for GPGPU emulation, alongside the existing device model, without impacting millisecond launch speed? Could an emulated vGPGPU be hotplugged after VM launch?

There has been some design/PoC work for QEMU disaggregation [3][4] of emulation functions into separate processes. It might be possible to apply similar techniques so that Cloud Hypervisor's device model (in Rust) process could run alongside a QEMU GPGPU emulator (in C) process, with some coordination by KVM.

If this approach is feasible, the architecture and code changes should be broadly useful to upstream for long-term support and maintenance, rather than custom to Fly. The custom code would be the GPGPU emulator, which is already maintained by Nvidia and running within QEMU on RedHat, Nutanix, etc.

It would also advance the state of the art in security isolation and access control of emulated devices used by VMs.

[1] https://github.com/coldfunction/qCUDA

[2] https://github.com/juniorprincewang/virtio-cuda-module

[3] https://www.qemu.org/docs/master/devel/multi-process.html

[4] https://wiki.qemu.org/Features/MultiProcessQEMU

> Someone could implement virtio-cuda (there are PoCs on github [1][2]

Any company (let alone Fly) doing this won't go against Nvidia Enterprise T&C?

> how to reuse Nvidia's proprietary GPGPU emulation code from QEMU

If it has been contributed to QEMU, it isn't GPL/LGPL?

> Could an emulated vGPGPU be hotplugged after VM launch

gVisor instead bounces ioctls back and forth between "guest" and host. Sounds like a nice, lightweight (even if limited & sandbox-busting) approach, too. Unsure if it mitigates the need for the "licensing dance" tptacek mentioned above, but I reckon the security posture of such a setup is unacceptable for Fly.

https://gvisor.dev/docs/user_guide/gpu/

> would also advance the state of the art in security isolation and access control of emulated devices used by VMs

I hope I'm not talking to DeepSeek / DeepResearch (:

> Any company (let alone Fly) doing this won't go against Nvidia Enterprise T&C?

Good question for a lawyer. Even more reason (beyond maintenance cost) that it would be best done by Nvidia. qCUDA paper has a couple dozen references on API remoting research, https://www.cs.nthu.edu.tw/~ychung/conference/2019-CloudCom....

> If it has been contributed to QEMU, it isn't GPL/LGPL?

Not contributed, but integrated with QEMU by commercial licensees. Since the GPGPU emulation code isn't public, presumably it's a binary blob.

> I hope I'm not talking to DeepSeek / DeepResearch (:

Will take that as a compliment :) Not yet tried DS/DR.

NVIDIA support is not special as far as QEMU is concerned—the special parts are all in their proprietary device driver, and they talk to QEMU via the VFIO infrastructure for userspace drivers. They just reimplemented the same thing in Cloud Hypervisor.

Red Hat for one doesn't ship any functionality that isn't available upstream, much less proprietary, and they have large customers using virtual GPU.

> NVIDIA.. just reimplemented the same thing in Cloud Hypervisor.

Was that recent, with MIG support for GPGPU partitioning? Is there a public mailing list thread or patch series for that work?

Nvidia has a 90-page deployment doc on vCS ("virtual compute server") for RedHat KVM, https://images.nvidia.com/content/Solutions/data-center/depl...

Not NVIDIA; fly.io reimplemented the parts that CH didn't already have. I know that CH is developed on GitHub but I don't know whether the changes are public or in-house.

That said, the slowness of QEMU's startup is always exaggerated. Whatever they did with CH they could have done with QEMU.

[deleted]

Is it GPU’s, or is it Nvidia GPU’s?

The distinction is irrelevant to our customers.

Not to the people reading your article.

It is quite possible, bit surprised a minor systems detour would burn months but alas. This inspires me to figure out how to sell it, sounds expensive...

FWIW, I'd consider this good publicity. I have no use for GPUs in the cloud (at least not at the prices that they're available at (in general, not just on fly)). So if fly is moving there effort towards things I actually need (like managed databases) then that's going to increase my confidence in them as a platform quite a bit.

Sounds like you might have been wrong in spending time on the other stuff. ;)

Who even knows what the customer is ever going to want? Pivot. Pivot. Pivot.

PS: And pouring one out for the engineering hours that went into shipping GPUs. Sometimes it's a fine product, but just doesn't fit.

You just wrote a blog post about whats needed for a top HN post. You should have known it would do well :)

https://fly.io/blog/a-blog-if-kept/

I'm a little sore the FOIA thing didn't make it up here, but Kurt's post did. ;)

Got a link? I flipped through the last dozen or so blog posts and none of them hit on a search for "FOIA".

This is all too painful for me to talk about.

Empathy. Such a seemingly good idea. Maybe just a little ahead of its time.

Sign of the times.

> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

I don't want GPUs, but that's not quite the reason:

- The SOTA for most use cases for most classes of models with smallish inputs is fast enough and more cost efficient on a CPU.

- With medium inputs, the GPU often wins out, but costs are high enough that a 10x markup isn't worth it, especially since the costs are still often low compared to networking and whatnot. Factor in engineer hours and these higher-priced machines, and the total cost of a CPU solution is often still lower (always more debuggable).

- For large inputs/models, the GPU definitely wins, but now the costs are at a scale that a 10x markup is untenable. It's cheaper to build your own cluster or pay engineers to hack around the deficits of a larger, hosted LLM.

- For xlarge models™ (fuzzily defined to be anything substantially bigger than the current SOTA), GPUs are fundamentally the wrong abstraction. We _can_ keep pushing in the current directions (transformers requiring O(params * seq^2) work, pseudo-transformers requiring O(params * seq) work but with a hidden, always-activated state space buried in that `params` term which has to increase nearly linearly in size to attain the same accuracy with longer sequences, ...), but the cost of doing so is exorbitant. If you look at what's provably required to do those sorts of computations, the "chuck it in a big slice of vRAM and do everything in parallel" strategy gets more expensive compared to theoretical optimality as model size increases.

I've rented a lot of GPUs. I'll probably continue to do so in the future. It's a small fraction of my overall spending though. There aren't many products I can envision which could be built on rented GPUs more efficiently than rented CPUs or in-house GPUs.

They were double wrong. I work at a GPU cloud provider and we can't bring on the machines fast enough. Demand has been overwhelming.

People aren't going to fly.io to rent GPUs. That's the actual reality here.

They thought they could sidecar it to their existing product offering for a decent revenue boost but they didn't win over the prospect's mind.

Fly has compelling product offerings and boring shovels don't belong in their catalog

Sure. If it sounds like we're saying "cloud GPUs are not a product anybody wants", absolutely not. They're just not a knockout hit for us.

But why not add an option to rent them out without too many abstractions?

Because that's not what we're in business to do.

I spent a month setting up serverless endpoint for a custom model last year with Runpod. It was expensive and unreliable, in addition to long cold boot times. The product was unusable even as a prototype, to cover the costs, I'd have to raise money first.

In a different product, I was given some Google Cloud credits, which unlocked me to put the product in front of customer. This one also needed GPU but not as expensive as the previous. It works reliably and it's fast.

Personally, I had two use cases for GPU providers in past 3 months.

I think there's definitely demand for reliability and better pricing. Not sure Fly will be able to touch that market though as it's not known for both (stability & developer friendly pricing).

P.S If anyone is working on a serverless provider and want me to test their product, reach me out :)

Give https://modal.com a spin -- email me at deven [at] modal.com and happy to help you get set up

your fancy scrolling animation splash screen is very laggy on my M1 MBP, incase that is interesting to you.

thanks, I'll reach out!

Fwiw Runpod also has a startup program.

Ironically GCP and AWS GPUs are so overpriced that getting even half the number of credits from Runpod is like a 4x increase in "GPU runway", especially with .44/hr A40s.

Yeah but the quality on Runpod is not reliable enough for productionizing it. Do you know a product that works reliably and it's built with Runpod serverless?

would love for you to test a serverless llm product i'm working on, zack [at] mixlayer.com

I feel like these guys are missing a pretty important point in their own analysis. I tried setting up a ollama LLM on a fly.io GPU machine and it was near impossible because of fly.io limitations such as: 1. Their infrastructure doesnt support streaming responses well at all (which is important part of the LLM experience in my view) 2. The LLM itself is massive, and cant be part of the docker image I was building and uploading. Fly doesnt have a nice way around this, so I had to setup a whole heap of code to pull it in on the fly machines first invocation, which doesnt work well if you start to run multiple machines. It was messy and ended up with a long support ticket with them that didnt get it working any better so I gave up.

What kind of issues did you have with streaming? I also set up ollama on fly.io, and had no issues getting streaming to work.

For the LLM itself, I just used a custom startup script that downloaded the model once ollama was up. It's the same thing I'd do on a local cluster though. I'm not sure how fly could make it better unless they offered direct integration with ollama or some other inference server?

I mean, yes? Managing giant model weight files is a big problem with getting people on-demand access to Docker-based micro-VMs. I don't think we missed that point so much as that we acknowledged it, and found some clarity in the idea that we weren't going to break up our existing DX just to fix it. If there were lots and lots and lots of people trying to self-host LLMs running into this problem, it would have been a harder call.

Did you consider other use cases in which people need custom models and inference other than just open source LLMs ?

Yes. Click through to the L40S post the article links to (the L40S's aren't going anywhere).

There are people doing GPU-enabled inference stuff on Fly.io. That particular slice of the market seems fine?

Most developers avoid GPU because of pricing. It’s simply too expensive to run 24/7 and there is the overhead of managing / bootstrapping instances loading large models to do intermittent instances. That’s the gist of it I think.

Unless you have constant load that justify 24/7 deployments most devs will just use an API. Or find solutions that doesn’t need you to pay > $1 / hour.

It feels to me that instead of quitting on it, you should double down.

The reason we don't want GPU, it's that renting is not priced well enough and the technology isn't quite there yet either for us to make consistently good use of it.

Removing the offer is just exacerbating the current situation. It feels both curves are about to meet.

In either case you'll have the experience to bring back the offer if you feel it's needed.

I really liked playing around with fly gpus, but it's just too expensive for hobby-use. Same goes for the rest of fly.io honestly. The DX is great and I wish I could move all of my homelab stuff and public websites to it, but it'd be way too expensive :(

This is near and dear to me, because I want people to run stuff like homelabs and side projects.

What part of the cost gets out of hand? Having to have a Machine for every process? Do you remember what napkin math pricing you were working with?

Hmm, having a machine for every process is part of it but I actually like that kind of isolation. Storage and bandwidth also add up fast.

For example, I could get a digitalocean vm with 2gb ram, 1vcpu, 50gb storage, 2tb bandwidth for $12/mo.

For the same specs at fly.io, it'd be ~$22/mo not including any bandwidth. It could be less if it scales to zero/auto stops.

I recently tried experimenting with two different projects at fly. One was an attic server to cache packages for NixOS. Only used by me and my own vms. Even with auto scaling to zero, I think it was still around $15-20/mo.

The other was a fly gpu machine with Ollama on it. The cold start time + downloading a model each time was kind of painful, so I opted for just adding a 100gb volume. I don't actually remember what I was paying for that, but probably another 20/mo? I used it heavily for a few days to play around and then not so much later. I do remember doing the math and thinking it wouldn't be sustainable if I wanted to use it for stuff like home-assistant voice assistant or going through pdfs/etc with paperless.

On their own, neither of these are super expensive. But if I want to run multiple home services, the cost is just going to skyrocket with every new app I run. If I can rent a decent dedicated server for $100-$200/mo, then I at least don't have to worry about the cost increasing on me if a machine never scales to zero due to a healthcheck I forgot about or something like that.

Sorry if it's a bit rambly, happy to answer questions!

No problem at all!

I would be curious how the Attic server would have gone with a Tigris bucket and local caching. Not sure how hard that is to pull off, but Tigris should be substantially cheaper than our NVMes and if you don't really NEED the io performance you're not getting anything for that money. Which is a long winded way of saying "we aren't great at block storage for anything but OLTP workloads and caches".

One thing we've struggled to communicate is how _cheap_ autosuspend/autostop make things. If that Machine is alive for 8 hours per day you're probably below $8/mo for that config. And it's so fast that it's viable for it to start/stop 45 times per day.

It's kind of hard to make the thing stay alive with health checks, unless you're meaning external ones?

We are suboptimal for things that make more sense as a bunch of containers on one host.

Tbh I haven't looked at Tigris at all. I still have my attic server deployed (just disabled/not in use) so I might give it a shot just to compare pricing. I do remember a decent portion of the cost being storage-related, so it's a good idea.

I'll have to look at autosuspend again too. I remember having autostop configured, but not autosuspend. I could see that helping with start times a lot for some stuff. It's not supported on GPU machines though, right? I thought I read that but don't see it in the docs at a quick glance.

> It's kind of hard to make the thing stay alive with health checks, unless you're meaning external ones?

Sorry, I did mean external healthchecks. Something like zabbix/uptimekuma. For something public facing, I'd want a health check just to make sure it's alive. With any type of serverless/functions, I'd probably want to reduce the healthcheck frequency to avoid the machine constantly running if it is normally low-traffic.

> We are suboptimal for things that make more sense as a bunch of containers on one host.

I think my ideal offering would be something where I could install a fly.io management/control plane on my own hardware for a small monthly fee and use that until it runs out of resources. I imagine it's a pretty niche case for enterprise unless you can get a bunch of customers with on-prem hardware, but homelabbers would probably be happy.

That sounds kinda weird to me. I use fly.io exactly because the pricing works out for hobby use. I can enable my machine for a few hours, run a bunch of inference, and turn it off again. The whole auto start/stop thing makes it seamless too.

If you don't mind, what size models are you running and do you know around what you were paying?

fly.io was the first provider I tried any gpu offerings at, I probably should give it another shot now that I've used a few others.

Vast has a datacenter H200 for less than what their A100 goes for.

Sure, but vast feels like renting a GPU from a rando.

Either you're familiar with GPU pricing and being willfully ignorant, or you're not familiar with the pricing in which case let someone who is point out:

- "Datacenter" means it's comparable to Runpod's secure cloud pricing.

- A spot instance of an H200 under someone's living room media console wouldn't go for A100 rates.

$3.50 will also get you an H100 at a laundry list of providers people build real businesses on.

Certainly all better track records than fly.io, especially on a post where they explain it's not working out for them as an offering and then promise they'll keep it shambling along.

You seem like you're familiar with vast. Have you used their autoscaler/serverless offering before? I haven't tried it yet, but it wasn't immediately obvious if I could have something like ollama running and scaled to zero instances when not in use.

Who do you use instead for hobby projects?

Not them, but Runpod and Vast are my gotos. Runpod cost slightly more, but is in turn more reliable, so for "hobby pro" I'd go with them, otherwise Vast.

Salad Cloud is also very interesting if your models can fit on a consumer GPU, but it's a different model than typical GPU providers.

I have a decent sized homelab in my basement that I use for most stuff, and then a couple cheap-ish dedicated servers for public-facing things. Nothing has GPUs though, so I don't have a good solution for llm/ai projects yet.

I used to use cheap vms/vps from lowendtalk deals, but usually they're on over-subscribed hosts and can't do anything heavy.

Actual host recommendations: I like Racknerd and Hivelocity currently. OVH too, but I've read a lot of horror stories so I guess ymmv.

Gpus don't fit the usual "start with free tier then upgrade when monetizing" approach most devs have with these kinds of platforms.

For simple inference, its too expensive for a project that makes no money. Which is most projects.

> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

My current company has some finance products. There was machine learning used for things like fraud and risk before the recent AI excitement.

Our executives are extremely enthused with AI, and seemingly utterly uncaring that we were already using it. From what I can tell, they genuinely just want to see ChatGPT everywhere.

The fraud team announce they have a "new" AI based solution. I assume they just added a call to OpenAI somewhere.

The emotion in this article hits home to me. There have been several points in my career where I worked hard for long hours to develop a strong, clever solution to a problem that ultimately was solved by something cheap because the ultimate consumer didn’t care about what we expected them to care about.

It sucks from a business perspective of course, but it also sucks from the perspective of someone who takes pride in their work! I like to call it “artisan’s regret”.

[dead]

> But inference latency just doesn’t seem to matter yet, so the market doesn’t care.

This is a very strange statement to make. They are acting like inference today happens with freshly spun up VMs and model access over remote networks (and their local switching could save the day). It’s actually hitting clusters of hot machines with the model of choice already loaded into VRAM.

In real deployments, latency can be small (if implemented well), and speed is comes down to the right GPU config for the model (why fly doesn’t offer).

People have built better shared resource usage inference systems for Loras (openAI, Fireworks, Lorax) - but it’s not VMs. It’s model aware, the right hardware for the base model, and optimizing caching/swapping the Loras.

I’m not sure the Fly/VM way will ever be the path for ML. Their VM cold start time doesn’t matter if the app startup requires loading 20GB+ of weights.

Companies like Fireworks are working fast Lora inference cold starts. Companies like Modal are working on fast serverless VM cold starts with a range of GPU configs (2xH100, A100, etc). These seem more like the two cloud primitives for AI.

I think what they mean about latency not mattering is that latency to the LLM provider doesn’t matter. So why run it yourself when there are API’s you can hit that provide a better overall experience (and seems to be dropping in cost 90% year over year).

Oh that makes more sense. My bad.

> developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

Is there not a market for the kind data science stuff where GPUs help but you are not using an LLM. Like statistical models on large amounts of data and so on.

Maybe fly.io customer base isn't that sort of user. But I was pushing a previous company to get AWS GPUs because it would save us money vs CPU for the workload.

There is a market but it most likely requires a thick software layer to enter the competitive space. Modal Labs, Anyscale, and Outerbounds are examples of companies competing for "data science stuff" and have thick software layers over the VMs.

not from fly.io, but my experience is that most data scientists will just prefer to lump it with the tools they know (pandas / R) on CPUs, rather than delving into things like rapids https://rapids.ai -- even if it makes things faster/cheaper.

I might have had a bad sample set so far. But the "doing statistics" bit seems to be the interesting thing for them. the tooling doesn't really factor into solutions/plans that often. and learning something new because "engineer say it shinier" doesn't really seem to motivate them much :/

Do many DS use Google Colab and click the GPU option? That made me think GPUs would be more popular (due to speed).

Also GPUs may be used when productionizing work done by DS but maybe I am in a tiny niche here of (Data Science) intersection (Scale up) minus (Deep learning LLM etc.)

There is no market for MIG in the cloud. People talk about it a lot, but in reality, nobody wants a partial GPU (at least not paying for it).

One interesting thing about all this is that 1 GPU / 1 VM doesn't work today with AMD GPUs like MI300x. You can't do pcie passthrough, but AMD is working on adding it to ROCm. We plan to be one of the first to offer this.

I think they’re too early for their core market. It’s taking indie and 0-1 devs awhile to dig into ML because it’s a huge complex space. But some of us are starting to put together interesting little pipelines with real, solid applications.

I feel like one of the mistakes being made again and again in the virtualization space is not realizing there's a difference between a competitor possibly running containers on the same machine with your proprietary data, and Dave over in Customer Relations running a container on your same machine.

If Dave does something malicious, we know where Dave lives, and we can threaten his livelihood. If your competitor does it you have to prove it, and they are protected from snooping at least as much as you are so how are you going to do that? I insist that the mutually assured destruction of coworkers substantially changes the equation.

In a Kubernetes world you should be able to saturate a machine with pods from the same organization even across teams by default, and if you're worried that the NY office is fucking with the SF office in order to win a competition, well then there should be some non-default flags that change that but maybe cost you a bit more due to underprovisioning.

You got a machine where one pod needs all of the GPUs and 8 cores, great. We'll load up some 8 core low memory pods onto there until the machine is full.

I was at another team making a similar bet. Felt off to me at the time, but I assumed I just didn't understand the market.

I also think the call that people want LLMs is slightly off. More correct to say people want a black box that gives answers. LLMs have the advantage that nobody really knows anything about tuning them. So, it is largely a raw power race.

Taking it back to ML, folks would love a high level interface that "just worked." Dealing with the GPUs is not that, though.

They should invest and focus on making their platform more reliable. Without that they will continue to be just a hobby toy to play with and nothing more

It’s nice seeing a major outage a day after this.

> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

No, I want GPU. BERT models are still useful.

The point is your service is too expensive that only one or two months of renting is enough to build a PC from scratch and place it somewhere in your workplace to run 24/7. For applications that need GPU power, usually downtime or latency does not really matter. And you always add an extra server to ensure.

> Instead, we burned months trying (and ultimately failing) to get Nvidia’s host drivers working to map virtualized GPUs into Intel Cloud Hypervisor. At one point, we hex-edited the closed-source drivers to trick them into thinking our hypervisor was QEMU.

What do Nvidia’s lawyers think of this? There are some things that best not mentioned in a blog post, and this is one of them.

I imagine you can do this without any reverse engineering of Nvidia's drivers, but §2.3 of the NVIDIA Driver License Agreement makes it hard (not impossible).

Yeah I think we'll be fine.

I get the impression that running LLMs is a pain in general, always seems to need the right incantation of nvidia drivers, linux kernel and a boat load of VRAM, along with making sure the Python ecosystem or whatever you are running for inference has the right set of libaries and then if you want multi-tenant processing across VMs - forget it or pay $$$ to nvidia.

The whole cloud computing world was built on hypervisors and CPU virtualization, I wonder if we'll see a similar set of innovations for GPUs at commodity level pricing. Maybe a completely different hardware platform will emerge to replace the GPU for these inference workloads. I remember reading about Google's TPU hardware and was thinking that would be the thing - but I've never seen anyone other than Google talk about it.

GPU is all about performance. Nearly all the time, very high level languages have nothing to do there.

The CPU part of high level user applications will probably be written in very high level languages/runtimes with, sometimes, some other parts being bare metal accelerated (GPU or CPU).

Devs wanting hardcore performance should write their stuff directly in GPU assembly (I think you can do that only with AMD) or at best with a SPIR-V assembler.

Not to mention doing complex stuff around the linux closed source nvidia driver is just asking for trouble. Namely, either you deploy hardware/software nvidia did validade, or just prepare to suffer... it means 'middle-men' deploying nvidia validaded solutions have near 0 added value.

I'm admittedly a complete LLM noob, so my question might not even make sense. Or it might exist and I haven't found it quite yet.

But have they considered pivoting some of said compute to some 'private, secure LLM in a box' solution?

I've lately been toying with the idea of training from extensive docs and code, some open, some not, for both code generation and insights.

I went down the RAG rabbit hole, and frankly, the amount of competing ideas of 'this is how you should do it', from personal blogs to PaaS companies, overwhelmed me. Vector dbs, ollama, models, langchain, and various one off tools linking to git repos.

I feel there has to be substantial market for whoever can completely simplify that flow for dummies like me, and not charge a fortune for the privilege.

The problem is currently all the "competing" ideas have a ton of tradeoffs and is rarely one size fits all. Furthermore, it's not clear if the idea you choose will be become obsolete by the underlying model architecture getting better. On top of all that, you are essentially competing with Anthropic/OpenAI/Google, where your only advantage is "privacy". Anyone who deeply cares about privacy, and is willing to pay for it, may just likely do it on their own (especially if your CTO is pouting money into "investing" in AI). Anyone who doesn't will likely not want to 6-7 months behind what you can get at OpenAI or Google.

This is well written. I appreciated the line, "startups are a race to learn stuff."

It's as difficult as a serverless provider to grow as it was for CPUs before GPUs came along.

Many companies overinvest in fully-owned hardware, rather than renting from clouds. Owning hardware means you underwrite unrented inventory costs and prevents you from scaling. H100 pricing is now lower than any self-hosted option, even without factoring the TCO & headcount.

(Disclaimer: I work at a GPU cloud Voltage Park -- with 24k H100s as low as $2.25/hr [0] -- but Fly.io is not the only one I've noticed purchase hardware when renting might have saved some $$$)

[0] https://dashboard.voltagepark.com/

It feels like giving up on this a bit too soon? I mean, they realized the problem quite right.. their offering doesn't entirely makes sense for their audience when it comes to GPU.

_But_ the demand of open source models is just beginning. If they really have a big inventory of GPUs under-utilized and users want particular solutions on demand.... give it to them???

Like TTS STT video creation, real time illustration enhancement, deepseek and many others. You guys are great at devops, make useful offerings on demand, similar to what HuggingFace offers, no???

> A whole enterprise A100 is a compromise position for them; they want an SXM cluster of H100s.

For a lot of use-cases you need at lest two A100s with a very fast interconnect, potentially many more. This isn't even about scaling with requests but about running one single LLM instance.

Sure you will find all of ways how people managed to runt his or that on smaller platforms, problem is that quite often doesn't scale to what is needed in production for a lot of subtle and less subtle reasons.

Yes devs want LLMs, but also the price of inference compute plummeted 90% over the last 18 months, which is primarily in gpus

So it’s not just that openai and anthropic apis are good enough, they are also cheap enough, and still overpriced compared to the industry

Your GPU investment wont do as well as you thought, but also you are wasting time on security. If the end user and market doesnt care then you can consider not caring as well. Worst case you can pay for any settlement with …. more gpu credits.

> They want LLMs.

That's why NVIDIA has NIMs [0]. A super easy way to use various LLMs.

[0] https://developer.nvidia.com/nim

Not sure about this:

> like with our portfolio of IPv4 addresses, I’m even more comfortable making bets backed by tradable assets with durable value.

Is that referencing the gpus, the hardware? If yes, why should they have a durable value? Historically hardware like that deprecated fast and reaches a value of 0, energy efficiency alone kills e.g. old server hardware. Something different here?

Historically GPUs didn't cost 30,000 a pop. Also, the end of Moore's law etc.,.

I think it's referencing only the IPv4 block, but it is a bit confusing. It doesn't make sense to be ref'ing the GPUs because their value is definitely not durable.

I don't know about "durable", but they're not written off. There is absolutely a market for all this hardware.

There’s certainly more retained value in the physical stuff than in developer time.

Yeah fair enough. I think it's just the subjectivity of "durable" at play here. The value of the GPUs may halve in a single year (e.g. H100s), but they'll never* drop to zero in a month. That's at least some kind of durability, because you can get a transaction done in a month.

* never say never

Absolutely - GPUs are definitely not a very liquid asset. As someone who works at a GPU neocloud provider (Voltage Park), server assets at scale definitely face a huge slippage, you can buy for $1 and get quotes for $1.50 but only be able to sell for $0.60

I mean consumer GPUs, yes.

But server GPUs tend to deprecate slower.

Which we also see with e.g. A100 80GiB is approaching 5 years of age, but still sold and used widely and still cost ~20k$USD (and I remember a noticable higher price before deep seek...).

The thing is sure a A100 80GiB is a much older arch then successors, but the main bottleneck is the memory of which it has 80GiB.

> A100 80GiB is approaching 5 years of age […] and still cost ~20k$USD

What was the price at launch?

Hah. We're doing AI, but we're doing vision-based stuff and not LLMs. For us, the problem has been deploying models.

Google and AWS helpfully offered their managed LLM AI services, but they don't really have anything terribly more useful than just machines with GPUs. Which are expensive.

I'm going to check fly.io...

What kind of models are you deploying and what type of problems are you having with deploying them?

Aerial imagery analysis. It's a mix of classic computer vision and AI for some purposes.

What a great blog post. Hope you figure it out, Fly.io. "If you will it, it is no dream."

> The biggest problem: developers don’t want GPUs. They don’t even want AI/ML models. They want LLMs.

I considered using a Fly GPU instance for a project and went with Hetzner instead. Fly.io’s GPU offering was just way too expensive to use for inference.

Hetzner is more expensive by default though? It starts at $200/month. Which is fine if you are running for 720 hours every month, but you can run more cheaply on fly if it doesn’t get used more than 150ish hours in a month.

They might just be early.

The smaller models are getting more and more capable, for high-frequency use-cases it'll probably be worth using local quantized models vs paying for API inference.

What a good and open and honest blog post. And I liked a lot the way it is interlinked with other interesting posts from that blog. I wish I'll have some time to read more articles from that blog.

> We were wrong about Javascript edge functions, and I think we were wrong about GPUs.

Actually, you're still wrong about JavaScript edge functions. CF Workers slap.

They were wrong for us. Cloudflare is in a much different position than we were in 2019 trying to get people to write new Javascript. Clearly, for us, running people's existing applications natively was the better call. We're not dunking on Cloudflare's model.

My issue is that I may or may not understand what's going on, but I simply, for the most part, do not want to spend time maintaining any more than I have to.

Article about GPUs, comments arguing over the definition of complexity in Kubernetes. This is what you call “learned helplessness.”

In most cases developers don't want GPUs, they just want a way to express a computation graph, and let the system perform the computation.

Off topic, but the font in the article is hard on the eyes.

> GPUs terrified our security team.

Ha ha, it didn't terrify Modal. It ships with all those security problems, and pretends it doesn't have them. Sorry Eric.

if your security team is 0 people, do you terrify all or none of them?

[deleted]

Has service reliability improved at all? I tried Fly at two different points in time and I’ve never had a worse experience with a service.

You didn’t say at which points in time so it’s kind of hard to say yes but I will say “yes, reliability has improved”.

Okay, I'll try a different question. How's reliability these days, lolo?

you guys have all this juicy GPU and infrastructure. why not offer models as apis?

i would pay to have apis for:

sam2, florence, blip, flux 1.1, etc.

whatever use case I would have reached Fly for on GPU, i can't justify _not_ using Replicate. maybe Fly can do better offer premium queues for that with their juicy infra?

you're right! as a software dev I see dockerization and foisting these models as a burden, not a necessity.

Someone should do that! Doesn't need to be us, though.

Maybe you can have claude build it on top of fly.io? ;)

I noticed quite a few spelling and grammar mistakes - could do with a bit of an edit pass?

In the current days of AI I think spelling and grammar mistakes is perhaps a great way to tell it is still written by human...... ( Until AI copy this )

I'm sure AI is already capable of linking anonymous forum handles by writing style. Best to reinvent ourselves every month like one would with monthly password resets

I do, in fact, instruct LLMs to make spelling and grammar mistakes when I have them reply to cold emails.

Do you run those LLMs on Fly? ;)

Nah, fly.io has a company culture that is all about having lots of bugs and issues, and that includes blog posts.

The idea that a cloud compute provider can’t make GPU compute into an profitable business is pretty laughable.

For what it’s worth I don’t think we entirely disagree: it has at times felt absurd that it didn’t make as much money as it maybe otherwise could. We made a bet that the type of cloud platform we wanted to build could be well served by GPUs. It wasn’t as good a bet as we thought. There is probably a different type of cloud product we could build that would be better set up to sell gpus but we are still committed to the primitives our machine product has to offer.

I have to agree with this. Look at GPU utilization at AWS, Azure, .. they are running close to 100%.

for our p5 quota I had to talk to our TAM team on AWS, while most of our quota requests are instant usually.

But the people buying GPU’s on AWS are not the same market as the ones on fly.io

The whole thing is sorta antithetical.

Out of curiosity, how much runway does fly.io have (without raising new funding?)

I'd guess significantly less after this debacle.

[deleted]

Currently getting a 502 error when trying to access fly.io

If low cost GPUs are not what they are offering then what are they offering anymore that I wouldn't get a big cloud vender. This looks like self inflicted mortal wound.

"Mortal wound" lol.

Kudos to owning up to your failed bet on GPUs even if you are putting a lot of blame on Nvidia for it. And to be fair, you're not wrong. Nvidia's artificial market segmentation is terrible and their drivers aren't that great either.

The real problem is the lack of security-isolated slicing one or more GPUs for virtual machines. I want my consumer-grade GPU to be split up into the host machine and also into virtual machines, without worrying about resident neighbor cross-talk! Gosh that sounds like why I moved out of my apartment complex, actually.

The idea of having to assign a whole GPU via PCI passthrough is just asinine. I don't need to do that for my CPU, RAM, network, or storage. Why should I need to do it for my GPU?

What GPUS services will you keep?

Next week: fly introduces game streaming technology for indie game devs.

"We started this company building a Javascript runtime for edge computing."

Wait... what?

I've been a Fly customer for years and it's the first time I hear about this.

Nvidia deliberately makes this hard.

Opportunity for Intel and AMD.

My takeaway from the article is that there's no real market for GPU VMs so it's pointless for Intel and AMD to make them work.

[flagged]