Stop Designing Your Web Application for Millions of Users When You Dont Have 100

At a previous job, there was an argument over a code review where I had done some SQL queries that fixed a problem but were not optimal. The other side were very much "this won't work for 1000 devices! we will not approve it!" whereas my stance was "we have a maximum of 25 devices deployed by our only customer who is going to leave us next week unless we fix this problem today". One of the most disheartening weeks of my software development life.

(that was also the place I had to have a multi-day argument over the precise way to define constants in Perl because they varied in performance except it was a long running mod_perl server process and the constants were only defined at startup and it made absolutely zero difference once it had been running for an hour or more.)

I actually like having room for optimization, especially when running my own infra servers included.

As an example, I can think of half a dozen things I can currently optimize just in the DB layer, but my time is being spent (sensibly!) in other areas that are customer facing and directly impacting them.

So fix what needs to be fixed, but if there was a major load spike due to onboarding of new clients/users, I could in a matter of hours have the DB handling 100x the traffic. That's a nice ace in the back of my pocket.

And yes, if I had endless time I'd have resolved all issues.

Usually a good trick is to run small deployments at high logging levels. Then, as soon as there are performance issues, you can dial down the logging and get the hours of respite needed to actually make a bigger algorithmic improvement.

If that optimization is mere hours of work, I would go for it outright. BTW when you have an overwhelming wave of signups, you are likely to have more pressing and unexpected issues, and badly lack spare hours.

Usually serious gains that are postponed would require days and weeks of effort. Maybe mere hours of coding proper, but much longer time testing, migrating data, etc.

I think that's the pitfall: there are infinite things a skilled developer can do within "mere hours of work".

The key is to find which ones are the most effective use of one's limited hours.

I developed a small daily game and it has now grown to over 10K DAU, so now I've started going back to refine the low hanging fruits which just didn't make sense to touch when I had just 10s of players a day.

We might all be operating under different ideas of what "matter of hours" means. Often times in a project, at least for me, the range of things that I can start and finish in mere hours is not actually infinite, but rather constrained. Only the simplest and smallest things can be done so fast or faster. So many other things just take longer, at least a day and a half, and can take extra long when other teams are involved. More experienced developers can learn how to break things down into increments, so while an entire big feature might take weeks to do, you're making clear chunks of progress in much smaller units. Those chunks are probably not mostly sized in mere hours pieces, though...

You're right that priority matters. Just beware of priority queue starvation. Still, if some newly discovered bug isn't urgent, even if I think it'd be rather easy (under an hour) to fully address, I'd rather not break my current flow, and just keep working on the thing I had earlier decided was highest priority. A lot of the time something will prevent direct progress and break the flow anyway, having smaller items available to quickly context switch to and finish is a good use of those gap times.

The "DB handling 100x the traffic" example above isn't quite well defined. I wonder if it's making queries return 100x faster? Or is it making sure the queries return at roughly their current speeds even if there's 100x more traffic? Either way, I can make arguments for doing the work proactively rather than reactively, but I'd at least write down the half a dozen things. Then maybe someone else can do them, and maybe those things can be done in around half a dozen tiny increments of 30 minutes or less each, instead of all at once in hours.

> Or is it making sure the queries return at roughly their current speeds even if there's 100x more traffic?

From what I remember, it was this. The DB was MySQL and a whole bunch of stuff would have been less efficient when there were 1000 devices instead of 25. But on the other hand the system was broken and the customer was threatening to cancel everything and fixing the DB stuff was going to take a fair amount of rearchitecting (not least dumping MySQL for something less insane) that we didn't have the time or resources to do in a hurry.

Exactly. Going for something that shaves 5% off your hosting bill or your build time only makes sense when you are already colossal. But something that halves your bill, or halves your build time, or halves your latency, is likely impactful if your project achieves the scale when the proceeds from it can support you.

And very often an optimization improves things not by factor of 1.1 but by factor of 10 or more.

OTOH it's worth to be mindful of your "tech debt" as a "tech credit" you took to be able to do more important things before it's time to repay. Cutting your hosting bill from $100 to $50 may be much less important than doubling your revenue from $500 to $1000.

> if there was a major load spike due to onboarding of new clients/users

This company was a hardware company with reasonably complex installation procedures - even going full whack, I doubt they could have added more than 20 new devices a week (and even then there'd be a hefty lead time to get the stuff manufactured and shipped etc.)

like != need

This is why everyone uses microservices and React. We don't know how these work, but we are netfligs/farcebook level companiez, so we must.

What? How does react fit into that? It's just a web framework that works for any scale basically. If you are a react dev and familiar with react, it's as easy as anything else to use. Now sure if you're not familiar with it you'd have to learn it but that's not an issue with react.

~90% of SPAs could be static HTML with a few kilobytes of vanilla JS sprinkled in for AJAX.

Why? The whole thread is about how keeping it simple is supposedly better. Using react when you know react is easier than static html+ajax. It handles a lot of stuff for you and you can then use the incredibly rich react ecosystem for everything else (authentification, forms). Even a pretty basic web app usually requires those, so again, it's simpler to just use react and it adds basically no overhead for a react dev.

Plus, the big issues with react usually only manifest themselves in larger web apps, so it's basically "perfect" for simple stuff. You could make a functional react page in a single file in a few hours.

And they could also be react apps. What's your point? For a lot of people, a suitable react app is easier to make than the equivalent static HTML page with vanilla JS. The "it should be static HTML with a sprinkle of JS so the filesize is extra small!" guy is the one overengineering, here. The SPA's 20 customers are all using modern web browsers on modern first-world internet connections and don't care or notice that the page is not Stallman-approved.

Stallman-approved pages will happily welcome any and all users visiting your site. Modern SPAs maybe not so much due to the overengineering and mostly unnecessary features.

[dead]

Ugggggggggggh. My favorite part of the job is scaling/optimization, and my least favorite part of the job is... this.

There are a LOT of "engineers" who understand that one thing might be faster than another thing, but lack the chops/wisdom to understand when it actually matters.

Best example I've seen is wanting to optimize a frontend JS for loop that gets triggered maximally only once per page load, runs max 100 times in a loop with complicated data structures that will confuse any reader of that code why it was done like this and not just with a for loop.

It's such a Dunning-Kruger effect. You get legitimately talented engineers who are talented enough to do a legitimately hard thing: optimizing a complex piece of code.

So, they don't realize that there's kind of another level of understanding beyond that (when to optimize vs. how to optimize)

There's also the fact that some people just like optimizing shit for selfish or even slightly nefarious reasons. Some just like doing it because it's fun. Others like it because it gives them visible and easily quantified "wins." On the rarer and more nefarious side of things, they may enjoy now having "ownership" over a piece of complex code that only they can understand....

I don't know how to exactly classify this, but based on my experience it feels like a certain type of engs who have only been exposed to this type of way of doing things from when they started, like at uni. It's to do with going through certain type of education, after that certain type of enterprisey environment following very strict rules of design patterns and applying these patterns to absolutely everything they do. They haven't done many quicker projects or side projects to realize that simpler things can work absolutely well enough to realize there can be flexiblity. I think they are good people and outside of work I get along well with them, but during work I do feel frustrated and when I try to explain it to them, I get kind of a West World like response "Doesn't look like anything to me". Like they can't get it, because they only have the uni and enterprise experience, and not experience to build things on their own volition.

Yeah. I think formal CS education often does a good job of teaching students to optimize in a myopic way - calculating big-O notation, understanding basic algorithms and data structures etc.

It's also rigid by design and (probably) necessity: students don't typically encounter unknown, unchanging, and evolving constraints and requirements throughout the project. Lastly they are not typically maintaining a medium/large codebase in conjunction with other engineers.

So you tend to get these smart kids don't know when to optimize. They can make a 15-line function "faster" but they don't have a sense of the impact this has on a larger project and they don't know how to weigh the impact of an optimization vs. the effort involved and future complexity it might incur.

The lack of knowledge would be fine on its own, but back to the Dunning-Kruger effect: they typically don't know that they don't know this stuff.

Not sure how you change that at the CS education level, honestly. At the very least you could stress to the kids that in the real world you have conditions and goals much different than in CS classrooms...

> I had to have a multi-day argument over the precise way to define constants in Perl

Couldn't you have used whatever the other person was suggesting even if the change was pointless?

These days, I would, yeah, but this was a long time ago and I was a lot more invested in not letting nonsense go past without a fight.

Yeah it's one of those things where it's like...

...okay, I can let this nonsense go today with very little impact, but if I do... will the team need to deal with it 100 times in the future? (And more broadly, by silence am I contributing to the evolution of a bad engineering culture ruled by poor understandings?)

It is very very difficult to know where to draw the line.

Three axes in my mind to take into account: Likelihood, impact, and complexity. The article seems to mostly be about likelihood and complexity. I'd place more importance on impact and complexity. It's worth saving a headache later if it's simple to do right now, however unlikely it is, as long as it doesn't create a maintenance burden before that point.

In my last company that was always the golden rule - manage the present challenge with an eye on the future - but let's get things done.

I'm stunned to realize that most developers just blindly follow whatever are the newest hottest "good practices" and completely ignore the actual goals that code should try to achieve. Literally yesterday I had an argument with a coworker because he wouldn't buy the argument "High-availability costs us $2000 per year extra, downtime costs us $100 per day, so if the database breaks once per year, there's no point having high-availability". Some time ago someone important from upper management announced that we need to pump out features faster to stay afloat as a business, even if they're a bit broken, and two hours later the most senior programmer in my team introduced us to new bureaucracy "because that's the best practice" and he wouldn't understand that what he's doing is directly at odds with the company's goals. Same guy also often accuses me of "not seeing the bigger picture". I gave up trying to reason with coworkers because it's just pain and changes nothing in the long run besides exhausting me.

On the other side of the argument, does $2000 a year seem a lot for high-availability?

Does downtime really cost only $100 per day? How was it calculated? How much does your business make? It would seem it should make more than 365 * $100 = $36,500 to be able to be in a position to hire people in the first place.

Database downtime would potentially:

1) Break trust with customers.

2) Take away focus from your engs for a day or more which is also a cost. If an eng costs $200 per day for example, and you have 4 engs involved with, it's already a cost of $800, not to mention increase of odds of stress and burn out. And in US engs would cost of course much more than that, I was just picking a more favourable sum.

3) If it's a database that breaks, it's likely to cause further damages due to issues with data which might take indefinite time to fix.

Overall in most businesses it would seem a no brainer to pay $2000 a year to have high availability and avoid any odds of a database breaking. It's very little compared to what you pay your engineers.

The whole thing is a side project with 5 customers total. If it dies it'll take a while before anyone notices.

It's a side project but you are calling them coworkers and a company with company's goals?

> I gave up trying to reason with coworkers because it's just pain and changes nothing in the long run besides exhausting me.

What are you even supposed to do in such situations?

It might be not possible to make them see things your way or vice versa.

You probably don’t want to get the person fired and going to look for a new place of employment just because of one or two difficult people doesn’t seem all that nice either.

You can get the management on your side (sometimes), but even then that won’t really change anything about similar future situations.

To me, it seems like there aren’t that many options, whenever it’s anything more abstract than being able to point to a sheet of test results and say “A is better than B, we are doing A”.

I’ve had fundamental disagreements with other people as well, nowadays I don’t really try to convince them of anything and try to limit how deep those discussions get because often they’re quite pointless, whenever they’re not purely data based. That doesn’t actually fix anything and the codebases do get less pleasant to work with at times for someone like me.

With SQL you really should support 10x your current user count, rounded up to the nearest power of 10. If you have 25 devices, this means you should support 250 rounded up to 1000. If you don't, it will come back to bite you really hard and really quickly. This is just something that you learn from experience.

Obviously I am not talking about an emergency situation, but about planned design.

totally agree with your view,

the big design upfront moment again, maybe because of the current economy, we need focus more on be profitable in short term, i think it's great and always focus on optimize for now, and test for specs (specs in sense of requirements of customer)

I really don't think that's the case. If you ask a CEO/CTO of a startup, he would fire the guy who did the latter instead of the middle approach. Longer term stability and development velocity are very important concerns in engineering management. This is pure inexperience - it wouldn't take too long to setup for an experienced engineer anyways, they probably did it 5 times last month and have a library of knowledge and templates. I can't call a guy who isn't capable of setting up a project this way swiftly without seeing it as a problem "senior".

What is the middle approach in this scenario? What specifically are the templates you refer to?

The middle approach would be implementing the infrastructure in a basic way that is simplistic, but provides benefit and can be expanded.

How does that relate to the scenario and what of the templates?

> The middle approach would be implementing the infrastructure

That's great if you're implementing but it doesn't really work when you're coming in to an existing infrastructure (or codebase) that other people manage.

To spend time on thinking about performance, and then not write the code.

I mean sure you would want to do that, but the above was very specific situation with existing feature set and an imminent novel problem.

People seriously underestimate the amount of clients you can serve from a single monolith + SQL database on a VPS or physical hardware. Pretty reliable as well, not many moving parts, simple to understand, fast to setup and keep up to date. Use something like Java, C#, Go or Rust. If you need to scale you can either scale vertically (bigger machines) or horizontally (load-balancer).

The SQL database is probably the hardest part to scale, but depending on your type of app there is a lot of room with optimizing indices or add caching.

In my last company we could easily develop and provide on-call support for multiple production critical deployments with only 3 engineers that way. Got so little calls that I had trouble to remember everything and had to look it up.

Scaling databases is easy. But you can really blow performance out of the water if you don't know SQL and use an ORM the naive way. Micro services and things like graphql make it worse. Now you are doing joins in memory in your FF-ing graphql endpoint instead of in the database.

A simple cheap database should be able to handle millions of rows and a handful of concurrent users. Meaning that if your users drop by a few times a week, you can have hundreds/thousands of those before your server gets busy. Run two cheap vms so you can do zero downtime deployments. Put a load balancer in front of it. Get a managed database. Grand total should be under 100$ per month. If you are really strapped for cash, you can get away with <40$.

>And a big, serious database server can handle insane number of rows and concurrent users. Stackoverflow famously runs on [1] a single database server (plus a second for failover, plus another pair for the rest of the stackexchange network).

[1] Or used to run, this factoid is from many years ago, at its peak popularity.

Vertical scaling is easy, but horizontal scaling is something that gets complex very fast for SQL databases. More tools, more setup, more things can go wrong and you have to know. If you have no shard like a tenant_id joins easily become something that involves the network.

Managed databases ofc take a lot of that work away from you, but some customers want or need on-premise solutions due to legal requirements or not wanting to get locked into proprietary offerings.

People also underestimate how financially predictable this setup is - you purchase a VPS or bare metal box on DigitalOcean/Hetzner/OVH for e.g. $50/mo, and that price will likely stay the same for the next 5 years. Try that with any of the cloud providers.

This part is often neglected when running a company, where owners usually hope infra costs will decrease over time or remain proportional to company income. However, I'm still waiting to see that.

Good point. Hardware resources are significantly more expensive in the cloud. It is not even close. For basic setups and no insane scaling requirements (most CRUD apps) I don't think you can beat the VPS/bare-metal offers from the mentioned providers. Support is often superb as well.

> The SQL database is probably the hardest part to scale

Use an ORM and SQLite, I bet you a beer that you won’t hit the perf ceiling before getting bored of your project.

I agree that SQLite is awesome for many use-cases. Running in the same process allows to have a very small latency. No additional server to manage is also a big plus. Concurrency can be an issue with writes though. Plus if you need to scale your backend you can't share the same database.

ORMs are very hit-and-miss for me. Had bad experiences with performance issues and colleagues who don't understand SQL itself, leading to many problems. Micro-ORMs that just do the mapping, reduce boilerplate and otherwise get out of your way are great though.

It really depends on the project. For a CRUD application? Sure. For an application that processes hundreds of events per second, stores them, and makes them wueryable using various filters? Can get tough real quick.

That's where SQLite really shines - used it in a massively scaled facial recognition system, with several hundred simultaneous cameras, a very large FR database with over 10K people, each having dozens of face entries, and all the events that generates from live video viewing large public venues. SQLite was never a bottleneck, not at all.

Single sqlite db or did you shard it ?

Single SQLite db embedded within the FR server, with some crazy huge database size of 75+ million faces too. This system is now deployed in a huge percentage of the western world's airports.

Kotlin probably deserves an honorary mention instead of Java.

This topic lacks nuance.

I agree in focusing on building things that people want, as well as iterating and shipping fast. But guess what? Shipping fast without breaking things requires a lot of infrastructure. Tests are infrastructure. CI and CD are infrastructure. Isolated QA environments are infrastructure. Monitoring and observability are infrastructure. Reproducible builds are infrastructure. Dev environments are infrastructure. If your team is very small, you cannot ship fast, safely, without these things. You will break things for customers, without knowing, and your progress will grind to a halt while you spend days trying to figure out what went wrong and how to fix it, instead of shipping, all while burning good will with your customers. (Source: have joined several startups and seen this first hand.)

There is a middle ground between "designing for millions of users" and "build for the extreme short term." Unfortunately, many non-technical people and inexperienced technical people choose the latter because it aligns with their limited view of what can go wrong in normal growth. The middle ground is orienting the pieces of your infrastructure in the right direction, and growing them as needed. All those things that I mentioned as infrastructure above can be implemented relatively simply, but sets the ground work for future secure growth.

Planning is not the enemy and should not be conflated with premature optimization.

Nowadays there are a lot of tools to setup this infra in a standard way very quickly though - in terms of CI/CD, tests, e2e tests etc.

There is no "standard way", only rough principles, and it all depends on the unique DNA of the company (cloud, stack, app deployment targets, etc). Yes there are a lot of tools, and experienced infrastructure engineers spend a lot of time integrating them. Often times they won't work without enormous effort, because of early "move fast & break things" design decisions made by the org. My experience has been that a startup is only using these tools from early on if they have experienced engineers and a management that understands the importance of building on a reliable foundation.

There are quick ways to setup a satisfying dev exp for small teams for full stack apps in terms of what will get you going for a while:

1. Git based CI/CD pipelines.

2. Multiple and arbitrarily configured amount of instances.

3. Easy rollbacks, no off time deployments.

4. Different set of environments. Branch/PR based preview links.

5. Out of the box E2E frameworks, etc.

I do agree however that you need someone in the first place to know that those things exist, but in a start up that would just be one experienced eng who is aware that they exist.

But they are standard in terms of that these are things that 99% of SaaS based Start Ups would definitely want.

Guilty! I spent so much time recently asking myself and trying to optimize my app and stack: - what if people upload files that big, what about the bandwidth cost? - what if they trigger thousands of events X and it costs 0.X$ per thousands?

Fast forward months, it's a non issue despite having paying customers. Not only I grossly exagerated the invidual user's resources consumption, but I also grossly exagerated the need for top notch k8s auto-scaling this and that. Turns out you can go a long way with something simpler...

Yeah, I have to keep fighting the feeling that I'm not doing things the "right" way as they are so unlike the $JOBs$ I had before, with availability zones, auto-scaling, CDNs, etc.

I have the most adhoc, dead simple and straightforward system and I can sleep peacefully at night, while knowing I will never pay more than $10 a month, unless I decide to upgrade it. Truly freeing (and much easier to debug!)

The ethos described in TFA extends a lot further than this vague idea of "scale". Be honest. We've all built things without a test framework and 99.9% coverage. Many all the way to production and beyond. Many of us skimp on accessibility. Works For Me™ means it Works For Client™ and that's all anyone will budget for. Even client optimisations (bundling/shaking/compressing/etc) get thrown out the window in a hurry.

The problems with accepting this way of thinking is you never budget for cut corners. After you build the MVP, all your client/boss wants is the next thing. They don't want to pay you again to do the thing you just did, but right.

And if you never get approval for testing, a11y, optimisations, the first time you hear about it is when it has lost you users. When somebody complains. When something breaks. Those numbers really matter when you're small. And it always looks bad for the devs. Your boss will dump on you.

So just be careful what corners you're cutting. Try to involve the whole team in the process of consciously not doing something so it's not a shock when you need to put things right later on.

Yes but...

Still make some effort to build as if this were a professional endeavor; use that proof of concept code to test ideas, but rewrite following reasonable code quality and architecture practices so you don't go into production with lack of ability to make those important scaling changes (for if/when you get lucky and get a lot of attention).

If your code is tightly coupled, functions are 50+ lines long, objects are mutated everywhere (and in places you don't even realize), then making those important scaling changes will be difficult and slow. Then you might be tempted to say, "We should have built for 1 million users." Instead, you should be saying, "We should have put a little effort into the software architecture."

There are two languages that start with "P" which seem to often end up in production like this.

Figure out your data structure definitions early, along with where those structures come from and where they’re going, and write disposable code around them that builds them and gets them where they need to be. Stable data definitions make it easy to replace bits and pieces of your application as you go. Especially if you view mutability not as the default, but as a performance optimization you can reach for if you need it. (You often do not)

> There are two languages that start with "P" which seem to often end up in production like this.

I think thats more due to their tendency to be grabbed by n00bs than any deficiencies in said languages.

[dead]

[caveat: not read the text because if you click the partners link onthe nag box, it looks like I need to object to each “legitimate interest” separately and I've not got time for that – congratulations darrenhorrocks.co.uk on keeping your user count lower, so you don't have to design for higher!]

The problem often comes from people solving the problem that they want to have, not the ones that they currently have. There is a pervasive view that if your site/app goes viral and you can't cope with the load, you lose the advantage of that brief glut of attention and might never get it again, if there is a next time some competing site/app might get the luck instead. There is some truth in this, so designing in a way that allows for scaling makes some sense, but perhaps many projects give this too much priority.

Also, designing with scaling in mind from the start makes it easier to implement later, if you didn't you might need a complete rewrite to efficiently scale. Of course keeping scaling in mind might mean that you intend a fairly complete redo at that point, if you consider the current project to be a proof of concept of other elements (i.e. the application's features that are directly useful to the end user), the difference being that in this state you are at least aware of the need rather than it being something you find out when it might already be too late to do a good job.

One thing that a lot of people overengineering for scale from day 1, with a complex mesh of containers running a service based design miss, when they say “with a monolith all you can do is throw hardware at the problem”, is that scaling your container count is essentially throwing (virtual) hardware at the problem, and that this is a valid short-term solution in both cases, and until you need to regularly run at the higher scale day-in-day-out the simpler monolith will likely be more efficient and reduce running costs.

You need to find the right balance of “designing with scalability in mind”, so it can be implemented quickly when you are ready, which is not easy to judge so people tend to err on the side of just going directly for the massively scalable option despite the potential costs of that.

>It looks like I need to object to each “legitimate interest” separately and I've not got time for that

I absolutely don't understand why some websites do this. Either don't show them or don't make them annoying to disable. Let me explain:

Legitimate interest is one of the lawful reasons for processing personal data. They don't have to ask for your permission. Usually adspam cookies are not in your legitimate interest, so they have to resort to another lawful basis, which is user consent. But they claim "legitimate interest" covers these cookies, so why even ask?

But on the other hand, I often stubbornly disable legitimate interest cookies, and not once I broke the website this way. This is suspicious - "legitimate interest" means that it's crucial to doing what you want to do on the website, for example a session cookie or language selection cookie. If the website works normally without a "legitimate interest" cookie, them the interest was not legitimate at all. I assume this is just some trick abused by advertisers to work around GDPR, and I wish them all 4% of global turnover fine.

> Legitimate interest is one of the lawful reasons for processing personal data.

Only if it is genuinely legitimate interest, in which case as you say they shouldn't even need to ask.

The reason they make them annoying to disable is the hope that people will stop bothering. In fact, they are usually hidden inside nested concertina UI elements so people don't even see the option to object. Dark pattern through and through.

The reason they ask even when only collecting/tracking strictly necessary data, is to try turn the public against the regulations by adding an unnecessary irritation. They want us to blame the legislators for causing us an inconvenience, when in fact the companies are doing so very deliberately because they are annoyed at the inconvenience they are caused by having to be transparent about the stalking that they do (or try to do).

> I assume this is just some trick abused by advertisers to work around GDPR

Sort of. And other similar regulations. Though it isn't actually GDPR-compliant by my reading, not by a long shot, the way most implement it.

When you have more micro-services than users.

This phenomenon needs a term, how about Premature Architecture?

I like it much. It goes well with Premature celebration from startupers. "We raised 100M USD, we made it!", when the company is on the verge of collapse every day and is losing 100M USD per year and has no business model rather than buying something 2 USD and selling it 1 USD.

Well, if that isn't a problem for a future me, I don't know what is.

Premature architectural optimization.

Or in other words, you're not Google or Facebook. You almost certainly don't need the level of performance and architectural design those companies need for their billion + user systems.

And it doesn't help that a lot of people seem to drastically underestimate the amount of performance you can get from a simpler setup too. Like, a simple VPS can often do fine with millions of users a day, and a mostly static informational site (like your average blog or news site) could just put it behind Cloudflare and call it a day in most cases.

Alas, the KISS principle seems to have gone the way of the dodo.

I dunno, I've lived the other side of this where people made boneheaded choices early on, the product suddenly got traction, and then we were locked into lousy designs. At my last company, there were loads of engineers dedicated to re-building an entire parallel application stack with a view to an eventual migration.

A relatively small amount of upfront planning could have saved the company millions, but I guess it would have meant less work for engineers so I suppose I should be glad that firms keep doing this.

More experienced software engineers can usually make the right design decisions that don't mean rebuilding when/if it comes time to scale things. This is usually about avoiding bad decisions because they know better.

In the abstract, it's impossible to judge decisions. What's worse than a company that spending millions on refactoring? Not being able to spend money on that because the company went out of business before launch because the devs were too busy engineering the most perfect, webscale architecture for the system that they weren't able to launch a product before running out of runway.

How is this done? And how is it not done?

I agree with this if you are talking fb/google scale; you will very likely not even get close to that, ever. But millions of users, depending on how it hits the servers over time, is really not very much. I run a site with well over 1m active users, but it runs on $50/mo worth of VPSs with an LB and devving for it doesn't take a millisecond longer than for a non scaling version.

Counter argument is that, you do not know whether your system will go big next week. If it indeed does, now you do not have the Google/Meta/Amazon engineer legion to immediately scale it up without losing users[1]. Also, a gambit of scalable system initially helps to solve the later death-march, when management wants to ship the MVP in global production and then come howling when users are bouncing because the system was not designed for scale.

[1] Unlike early days, users are very quick to dismiss or leave immediately if the thing is breaking down and of course they will go rant at all social media outlets about the bad experience further making life hell for "build things that don't scale".

Counter point - even if you spend more time on scale and system setup you will still be prone to failure without having this scale tested and stepped up progressively.

And I think ironically you are more likely to get higher scale if you spend less time on scaling, since you spend more time building other things that users care about.

And frequently if focusing on scale, you will run into bugs that you wouldn't if you would just use a simple one box monolith. Your incident resolving might take longer with a scalable microservices arch because debugging and everything becomes much more complex.

You have limited resources where you are assigning skill points to your character. But the thing is, if you do a more complex arch, it will keep taking away those skill points not only in the beginning but over time.

The other point is what if your Company's exit strategy is dependent on going big. Then it is pertinent that you plan for that 0.01% chance because there is no other exit.

”I better buy a smaller clothes now in case I suddenly lose weight next week”

Designing for low latency (even if only for a few clients) can be worth it though. Each action taking milliseconds vs. each action taking sections will lead to vastly different user experiences and will affect how users use the application.

No, I don't think I will let you share my personal data with 200 select partners (:

Focusing on customers and MVPs over complex architecture makes sense, but whether you have 10 or 1 million users, for any real business, you need to be ready to recover from outages.

Building for resilience early sets the foundation for scaling later. That’s why I’m not a fan of relying on "one big server." No matter how powerful, it can still fail.

By focusing on resilience, you're naturally one step closer to scaling across multiple servers. Sure, it’s easy to overcomplicate things, but investing in scalable infrastructure from the start has benefits, even with low traffic—it's all about finding the right balance.

The truth is in the middle. I've now had two startups where I came in to fix things where the system would fall over if there were more than 1 user. Literally; no transactional logic and messy interactions with the database combined with a front end engineers just making a mess of doing a backend.

In one case the database was "mongo realm", which was something our Android guy randomly picked. No transactions, no security, and 100% of the data was synced client side. Also there was no IOS and web UI. Easiest decision ever to scrap that because it was slow, broken, and there wasn't really a lot there to salvage. And I needed those other platforms supported. It's the combination of over and under engineering that is problematic. There were some tears but about six months later we had replaced 100% of the software with something that actually worked.

In both cases, I ended up just junking the backend system and replacing it with something boring but sane. In both cases getting that done was easy and fast. I love simple. I love monoliths. So no Kubernetes or any of that micro services nonsense. Because that's the opposite of simple. Which usually just means more work that doesn't really add any value.

In a small startup you should spend most of your time iterating on the UX and your product. Like really quickly. You shouldn't get too attached to anything you have. The questions that should be in the back of your mind is 1) how much time would it take a competent team to replicate what you have? and 2) would they end up with a better product?

Those questions should lead your decision making. Because if the answers are "not long" and "yes", you should just wipe out the technical debt you have built up and do things properly. Because otherwise somebody else will do it for you if it really is that good of an idea.

I've seen a lot of startups that get hung up on their own tech when it arguably isn't that great. They have the right ideas and vision but can't execute because they are stuck with whatever they have. That's usually when I get involved actually. The key characteristic of great UX is that things are simple. Which usually also means they are simple to realize if you know what you are doing.

Cumulative effort does not automatically add up to value; often it actually becomes the main obstacle to creating value. Often the most valuable outcome of building software is actually just proving the concept works. Use that to get funding, customer revenue, etc. A valid decision is to then do it properly and get a good team together to do it.

> In both cases, I ended up just junking the backend system and replacing it with something boring but sane. In both cases getting that done was easy and fast.

This kind of rewrite is usually quick and easy not because of the boring architecture (which can only carry the project from terrible velocity to decent velocity) but because the privilege of hindsight reduces work: the churn of accumulated changes and complications and requirements of the first implementation can be flattened into one initially well designed solution, with most time-consuming discussions and explorations already done and their outcomes ready to be copied faithfully.

This also applies when calculating losses from paying third party services.

Agreed, a simple index.php with jQuery storing data in an SQLite db file is sufficient for earning millions.

No I m not trolling. This is exactly what Peter Levis do

almost 20 years ago the exact same sentiment was expressed in the ground breaking classic "I'm going to scale my foot up your ass" by Ted Dziuba

http://widgetsandshit.com/teddziuba/2008/04/im-going-to-scal...

I've got 99 problems but scalability ain't one.

I've seen a bunch of things.

Sometimes you have people who try to build a system composed of a bunch of microservices but the team size means that you have more services than people, which is a recipe for failure because you probably also need to work with Kubernetes clusters, manage shared code libraries between some of the services, as well as are suddenly dealing with a hard to debug distributed system (especially if you don't have the needed tracing and APM).

Other times I've seen people develop a monolithic system for something that will need to scale, but develop it in a way where you can only ever have one instance running (some of the system state is stored in the memory) and suddenly when you need to introduce a key value store like Valkey or a message queue like RabbitMQ or scale out horizontally, it's difficult and you instead deal with HTTP thread exhaustion, DB thread pool exhaustion, issues where the occasional DB connection hangs for ~50 seconds and stops everything because a lot of the system is developed for sequential execution instead of eventual consistency.

Yet other times you have people who read about SOLID and DRY and make an enterprise architecture where the project itself doesn't have any tools or codegen to make your experience of writing code easier, but has guidelines and if you need to add a DB table and work with the data, suddenly you need: MyDataDto <--> MyDataResource <--> MyDataDtoMapper <--> MyDataResourceService <--> MyDataService <--> MyDataDao <--> MyDataMapper/Repository with additional logic for auditing, validation, some interfaces in the middle to "make things easier" which break IDE navigation because it goes to where the method is defined instead of the implementation that you care about and handlers for cleaning up related data, which might all be useful in some capacity but makes your velocity plummet. Even more so when the codebase is treated as a "platform" with a lot of bespoke logic due to the "not invented here" syndrome, instead of just using common validation libraries etc.

Other times people use the service layer pattern above liberally and end up with hundreds of DB calls (N+1 problem) instead of just selecting what they need from a DB view, because they want the code to be composable, yet before long you have to figure out how to untangle that structure of nested calls and just throw an in-memory cache in the middle to at least save on the 95% of duplicated calls, so that filling out a table in the UI wouldn't take 30 seconds.

At this point I'm just convinced that I'm cursed to run into all sorts of tricky to work with codebases (including numerous issues with DB drivers, DB pooling libraries causing connections to hang, even OpenJDK updates causing a 10x difference in performance, as well as other just plain weird technical issues), but on the bright side at the end of it all I might have a better idea of what to avoid myself.

Damned if you do, damned if you don't.

The sanest collection of vague architectural advice I've found is the 12 Factor Apps: https://12factor.net/ and maybe choosing the right tools for the job (Valkey, RabbitMQ, instead of just putting everything into your RDBMS, additional negative points for it being Oracle), as well as leaning in the direction of modular monoliths (one codebase initially, feature flags for enabling/disabling your API, scheduled processes, things like sending e-mails etc., which can be deployed as separate containers, or all run in the same one locally for development, or on your dev environments) with as many of the dependencies runnable locally

For the most part, you should optimize for developers, so that they can debug issues easily, change the existing code (loose coupling) while not drowning in a bunch of abstractions, as well as eventually scale, which in practice might mean adding more RAM to your DB server and adding more parallel API containers. KISS and YAGNI for the things that let you pretend that you're like Google. The most you should go in that direction is having your SPA (if you don't use SSR) and API as separate containers, instead of shipping everything together. That way routing traffic to them also becomes easier, since you can just use Caddy/Nginx/Apache/... for that.

> a system composed of a bunch of microservices but the team size means that you have more services than people

The thing I keep trying to get people to recognize in internet discussions of microservices is that they're a solution to the organizational problems of very large companies. The right size is one "service" per team but keeping the team size below the "two pizza limit" (about eight people, including the line manager and anyone else who has to be in all the meetings like scrum masters etc).

If your website needs to scale to hundreds of developers, then you need to split up services in order to get asynchronous deployment so that teams can make progress without deadlocking.

Scaling for a high number of users does not require microservices. It does as you say require multiple instances which is harder to retrofit.

> additional negative points for it being Oracle

Amen.

Attempting to please everyone pleases noone.

All infrastructure please.

I'm currently wrestling a stupid orchestration problem - DNS external from my domain controllers - because the architecture astronaut thought we'd need to innovate on a pillar of the fucking internet

[deleted]

The main thing is choosing the right technology that will allow scaling if it's part of the plan. It's not that you need to build Twitter-level traffic scale from Day 1, but you should of course choose a tech stack and platform that you can scale if you need.

(Twitter famously learned this the hard way)

If I loved Ruby, why would I choose Rails for a high-traffic social media app? If I loved Python, why choose Django for an API-first service that doesn't have an Admin dashboard? Yet I see it all the time - the developer only knows PHP/Laravel so everything gets built in Laravel. Point is... why lock yourself into a monolithic setup when other options are available even in those languages? You can definitely choose the "wrong stuff" and for bad reasons, even if it works today. "Doing things that don't scale" seems frankly stupid whenever scaling is a big part of the plan for the company, especially when there are plenty of options available to build things in a smart, prepared way.

But go ahead, install your fav monolith and deploy to Heroku for now, it will just be more work later when it has to be dismantled in order to scale parts independent from others (an API gateway that routes high traffic, a serverless PDF renderer, job scripts like notifications, horizontal scaling of instances, etc.).

It's smarter though to just choose a more future-proof language and framework setup. The Node setup on my rpi4 has been on GCP, AWS, Heroku, Render in various forms (as a monolith, as microservices, in between) - repo-wise, it's a "mono server" of 15+ APIs, apps, and websites that separate businesses in town rely on, yet I can work in it as one piece, even easily move it around providers if I want, horizontally scale as needed with no code changes, and because of essentially the folder structure of the app (I copied Vercel) and how Node require works, any export can be either imported to another file or deployed as a serverless function itself.

There's nothing about the codebase that forces me to choose "not scale" or "scale". I even have platform-agnostic code (libraries) that can run in either a browser or a server, talk about flexibility! No other language is as fast and scalable while also being this flexible.

[deleted]

Do you mean, don't consider accessibility, security and privacy because statistically speaking your 100 customers won't care about those things?

You must have read a different blog post because I don't see anything arguing agsinst accessibility, security or privacy in this one?

No, your only 100 customers. And here your customers are not huge corporations with thousands of actual users each, but just individual users.

But certainly, you should care about security and privacy even if you have just one customer.