How NASA built Artemis II’s fault-tolerant computer

The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.

During the time of the first Apollo missions, a dominant portion of computing research was funded by the defense department and related arms of government, making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm. Now that we have a huge free market for things like online shopping and social media, this is a bit of a neglected field and suffers from poor investment and mindshare, but I think it's still a fascinating field with some really interesting algorithms -- check out the work of Frank Mueller or Johann Blieberger.

It still lives on as a bit of a hard skill in automotive/robotics. As someone who crosses the divide between enterprise web software, and hacking about with embedded automotive bits, I don't really lament that we're not using WCET and Real Time OSes in web applications!

I suppose that rough-edgeness of the RTOSes is mostly due to that mainstream neglect for them - they are specific tools for seasoned professionals whose own edges are dent into shapes well-compatible for existing RTOSes.

ever use wordstar on Z80 system with a 5 MB hard drive?

responsive. everything dealing with user interaction is fast. sure, reading a 1 MB document took time, but 'up 4 lines' was bam!.

linux ought to be this good, but the I/O subsystem slows down responsiveness. it should be possible to copy a file to a USB drive, and not impact good response from typing, but it is not. real time patches used to improve it.

windows has always been terrible.

what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.

QNX is also a good example of an RTOS that can be used as a desktop. Although an example with a lot of political and business problems.

Every single hardware subsystem adds lag. Double buffering adds a frame of lag; some do triple-buffering. USB adds ~8ms worse-case. LCD TVs add their own multi-frame lag-inducing processing, but even the ones that don't have to load the entire frame before any of it shows, which can be a substantial fraction of the time between frames.

Those old systems were "racing the beam", generating every pixel as it was being displayed. Minimum lag was microseconds. With LCDs you can't get under milliseconds. Luckily human visual perception isn't /that/ great so single-digit milliseconds could be as instantaneous, if you run at 100 Hz without double-buffering (is that even possible anymore!?) and use a low-latency keyboard (IIRC you can schedule more frequent USB frames at higher speeds) and only debounce on key release.

8khz polling rate mouse and keyboard, 240hz 4K monitor (with Oled to reduce smearing preferably, or it becomes very noticeable), 360hz 1440p, or 480hz 1080p, is current state of the art. You need a decent processor and GPU (especially the high refresh rate monitors as you’re pushing a huge amount data to your display, as only the newest GPUs support the newest display port standard) to run all this, but my Windows desktop is a joy to use because of all of this. Everything is super snappy. Alternatively, buying an iPad Pro is another excellent way to get very low latencies out of the box.

I really love this blog post from Dan Luu about latency. https://danluu.com/input-lag/

That's a good one. I probably should have brought up variance though. These cache-less systems had none. Windows might just decide to index a bunch of stuff and trash your cache, and it runs slow for a bit while loading gigabytes of crap back into memory. When I flip my lightswitch, it's always (perceptibly) the same amount of time until the light comes on. Click a button on the screen? Uh...

I believe this is kind of survivor-bias. It's very rare that RTOSes have to handle allocating GBs of data, or creating thousands of processes. I think if current RTOSes run the same application, there would be no noticeable difference compared to mainstream OS(Could be even worse because the OS is not designed for that kind of usecases)

[deleted]

>what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.

... it's not the OS that's source of majority of lag

Click around in this demo https://tracy.nereid.pl/ Note how basically any lag added is just some fancy animations in places and most of everything changes near instantly on user interaction (with biggest "lag" being acting on mouse key release as is tradition, not click, for some stuff like buttons).

This is still just browser, but running code and displaying it directly instead of going thru all the JS and DOM mess

if you ever worked on automotive you know it's bs.

since CAN all reliability and predictive nature was out. we now have redundancy everywhere with everything just rebooting all the time.

install an aftermarket radio and your ecu will probably reboot every time you press play or something. and that's just "normal".

> making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm.

Oh wow, really? I never knew that. huh.

I feel like as I grow older, the more I start to appreciate history. Curse my naive younger self! (Well, to be fair, I don't know if I would've learned history like that in school...)

Contrary to propaganda from the likes of Ludwig von Mises, the free market is not some kind of optimal solution to all of our problems. And it certainly does not produce excellent software.

Mises never claimed that the free market produced the most optimal solutions at a given moment. In fact Mises explicitly stated many times that the free market does indeed incur in semi-frequent self-corrections, speculations and manipulations by the agents.

What Mises proposition was - in essence - is that an autonomous market with enough agents participating in it will reach an optimal Nash equilibrium where both offer and demand are balanced. Only an external disruption (interventionism, new technologies, production methods, influx or efflux of agents in the market) can break the Nash equilibrium momentarily and that leads to either the offer or the demand being favored.

> optimal Nash equilibrium where both offer and demand are balanced

This roughly translates to "optimal utopian society which cannot be criticised in any way" right? Right??

It depends on by what metric you define what is optimal.

For the health system or public transport the nash equilibrium of offer and demand is not what feels optimal to most people.

For manufacturing s.th. like screws, nails or hammers; I really can't see what should be wrong with it.

Or, paper clips…

I don't know if you are being sarcastic. But no, it's not an "utopia" by any means and the free market still has many pitfalls and problems that I described. However, is the best system we have to coordinate the production, distribution and purchasing of services and goods on a mass scale.

Somehow the last sentence of your comment caught me as if there's something wrong with it. I don't thing it's wrong, but I think it should be generalised.

Free market is an approach to negotiation, analogous to ad-hoc model in computer science, as opposed to client-server model - which matches command economy. There are tons of nuances of course regarding incentives, optimal group sizes, resource constraints etc.

Free market is also like evolution - it creates thing that work. Not perfect, not best, they just work. Fitting the situation, not much else (there is always a random chance of something else).

Also there's the, often, I suppose, intentional confusion of terms. The free market of the economic theory is not an unregulated market, it's a market of infinitesimal agents with infinitesimal influence of each individual agent upon the whole market, with no out-of-market mechanisms and not even in-market interaction between agents on the same side.

As a side note, I find it sadly amusing that this reasonable discussion is only possible because it's offtopic to the thread's topic. Had the topic been more attractive to more politically and economically agitated folk, the discussion would be more radicalised, I suppose.

[deleted]

Many intellectuals have this problem. They make interesting, precise statements under specific assumptions, but they get interpreted in all kinds of directions.

When they push back against certain narratives and extrapolations they usually don’t succeed, because the same mechanism applies here as well.

The only thing they can do about it, is throwing around ashtrays.

What a great visual. I haven't heard that phrase before.

It's a fun image, but I was not my idea. I was playfully referring to this:

https://en.wikipedia.org/wiki/The_Ashtray_(Or_the_Man_Who_De...

Although in this original case the image (that allegedly happened) used to criticize the philopher (Kuhn), so kind of the other side of the coin of what I said above.

So it will reach equilibrium unless literally anything disrupts that equilibrium. Got it.

The free market tends to equilibrium yes. That indeed is a novel realization.

An "autonomous market with enough agents" is carrying a lot of weight there, like "rational actors" and "as sample size goes to infinity'.

It is not carrying a lot of weight. Macroeconomics are different from microeconomics. On a micro scale agents have enough weight on the system where a specific action might break a model. On a macro scale each individual agent's action carries less weight and therefore the system becomes predictable.

On a micro scale it is possible, and sometimes favorable, to intervene. On a macro scale to intervene economically becomes impossible due to the economic calculation problem. It is widely accepted in modern economics that the unit of maximum extent where economical intervention is possible is a business/company/enterprise. Or in sociological terms the maximum unit is the family. Anything broader than that and the compound effect of the economic calculation problem becomes apparent and inefficiencies accumulate. Autonomous decentralized mechanisms (like a free market) are the only solution to it, but not the most optimal.

The problem with this is that "breaking the Nash equilibrium momentarily" is a spherical cow.

"Momentarily" can mean years or even decades, and millions of people can suffer or die as a result.

Markets do not model that especially well. When it comes down to these situations, it's not about the rising price of food motivating producers to enter the market - it's about the people starving. During a war, no amount of money can cause more munitions to appear fast enough. Blast-resistant concrete can take weeks or months to cure, workforces take time to train. These "momentary" disruptions can swamp the whole.

I can't think of a time when I've found an absolutist position useful or intelligent, in any field. Free-market absolutism is as stupid as totalitarianism. The content of economics papers does not need to be evaluated to discard an extreme position, one need merely say "there are more things in earth and heaven than are dreamed of in your philosophies"

Great point, if the only constant is change, then philosophy should follow (or lead).

Propaganda is quite a strong term to describe the works of an economist. If one wants to debate the ideas of von Mises, it'd be useful to consider the Zeitgeist at that time. Von Mises preferred free markets in contrast to the planned economy of the communists. Partly because the latter has difficulties in proper resource allocation and pricing. Note that this was decades before we had working digital computers and digital communication systems, which, at least in theory, change the feasibility of a planned economy.

Also, the last time I checked, the US government produced its goods and services using the free market. The government contractors (private enterprises) are usually tasked with building stuff, compared with the government itself in a non-free, purely planned economy (if you refer to von Mises).

I assume that you originally meant to refer to the idea that without government intervention (funding for deep R&D), the free market itself would probably not have produced things like the internet or the moon landing (or at least not within the observed time span). That is, however,a rather interesting idea.

Governament contracts are very restricted behind layers of certifications and authorizations.

For example, you can't freely produce missiles and have it in wallmart where "the governament" purchase at shelf price.

What a world that would be. Would change the game of 'deer hunting' for sure.

> The government contractors (private enterprises) are usually tasked with building stuff

Ah yes, situation where the government makes a plan and then hands it to the one (1) qualified defense contractor whose facilities are build in swing states to benefit specific congressional campaigns is completely different from central planning.

There are some resemblances, which indicate that you might not have a fully functioning free market. But central planning in the context of von Mises refers to something else. It's about the organization of whole national economies, as in planned economies, a thing you find in communist states or Lenin's "war communism".

You should read up on Yanis Varoufakis' history and just how bad his solution for Greece went. That will explain the extreme amounts of anger on both his side, the side of Greeks and the side of the EU and worldwide financial community (and the EU itself used to be an industry cartel, so you can guess how much every government institution in the EU aligns with the worldwide financial community). This guy will never be allowed to do anything remotely serious in economics ever again, and he knows it very well. His Diem24 project is failing, and he knows that too. He feels the ECB, specifically Mario Draghi, Jeroen Dijsselbloem and Christine Lagarde are responsible for this downfall and talks about them in a way that makes you say "he can't be allowed near them. Seriously. Call the police". But in the constant tragedy of his life: He's probably right they caused his downfall.

He caused a MAJOR issue for Greece that still affects everyone in his country today, after reassuring people for 2+ years it was never going to happen: https://en.wikipedia.org/wiki/Greek_government-debt_crisis

(He'd kill me for saying this but he was lying back then too. He was trying to pull a Thatcher (I could compare him to someone else that did the same a long time ago but ... let's just say if you know you know). He was trying to double Greece's public debt by lying to everyone about what he was doing. He failed, and then started threatening, and when his threats didn't work, he got fired by Greece's prime minister, his oldest friend. It ended the friendship. He lost. And he's not a good enough sport to accept that he lost, frankly he got caught and couldn't talk his way out of it. This, despite the fact that he was finance minister, and so will be paid, very well I might add, for the rest of his life despite what he did, and despite the fact that every Greek today is still paying the price for what he did)

Oh and he's pro-Russia. All Russia wants in Ukraine, according to Yanis, is help the European poor. More detailed he is of the opinion that the current course of action of the EU will lead to a war with Russia, in which a lot of European poor will be forced to fight in an actual war, facing bullets and bombs in trenches. This could be avoided by giving Ukraine and the Baltics to Russia. In the repeating tragedy of Yanis Varoufakis' life, I have to say, yet again: he may be right (I just strongly disagree that offering Ukraine and the Baltics up to Russia is an acceptable solution to this problem, and in any case, this is neither his, nor my choice to make)

He does not live in Greece, his own country, he lives in the UK, making the case for Russia.

https://www.yanisvaroufakis.eu/category/ukraine/

And I get it, his life has become this recurring tragedy. His father was a victim of a rightist dictatorship in Greece, and he was imprisoned and tortured for that, as well as losing his job, living in poverty for a very long time (yes, Greece was an extreme right dictatorship not that long ago, really, go look it up). Yanis Varoufakis himself became the victim of a cabal of laissez-faire very, very rich people who destroyed his career right at the peak of everything he achieved. He has been the victim of one or another form of extreme-right policy (in the sense of laissez-faire parties that capture governments) since he was 4 years old, right up to today. Over 60 years his life was sabotaged in 1000 different ways, some very direct. And, sadly, I agree with his "extreme-right" enemies: he can never be in allowed near any position of power ever again because of this, which isn't even his fault. (extreme-right according to him, I would refer to his enemies as "the status quo", and point out it's working pretty well for everyone)

> He caused a MAJOR issue for Greece that still affects everyone in his country today, after reassuring people for 2+ years it was never going to happen:

care to explain what exactly he caused and how that still affects everyone in his country? in particular how he managed to jump several years backward in the timeline?

All I can say is "keep reading". Because it takes a BIG turn for the worse at one point, and that's where he's involved.

> He caused a MAJOR issue for Greece

That link goes to the Greece financial crisis which, according to the Wikipedia page, started in 2009. Varoufakis was elected minister of finance in early 2015 and resigned only half a year later. From the outside, it seems impossible that his half year miniterial tenure could have caused a crisis half a decade earlier. At the time, Greece had already defaulted twice on their loans and were about to do it a third time.

Economics is propaganda. It’s not an empiracle science, and it’s claims are mostly used to promote ideologies consistent with government policy or the ideology of powerful individuals with the surplus’s wealth available to pay someone to build a quantitative defense of said ideology. What else would you call it?

It's a social science? Economics is much broader and much less unified than you purport it to be. The (social) science of (in this case) Macroeconomics is just that, an observational science, a bunch of theories and observations (controlled experiments are not really feasible). The propaganda is caused by politicians, administrators, and policymakers, not really the scientists. There I agree with you, central bankers are a prime example of such propaganda. Ever wondered why almost everywhere the inflation target is 2%? Not 1%, not 3%, but exactly 2%? There is no real scientific reason behind it, that is just policy, or propaganda if you want to name it like that.

Social scientist carry out experiments / causal analysis on granular data. Macro economics (not micro) I should clarify meets the definition of propaganda because its theories do not have solid backing with experimentation or data. It is primarily used by the state to manufacture consent for economic policies that implement incentive structures that benefit the most wealthy people in society. It’s not that complicated.

Are _you_ making software for the government?

[deleted]

Time triggered Ethernet is part of aircraft certified data bus and has a deep, decades long history. I believe INRIA did work on this, feeding Airbus maybe. It makes perfect sense when you can design for it. An aircraft is a bounded problem space of inputs and outputs which can have deterministic required minima and then you can build for it, and hopefully even have headroom for extras.

Ethernet is such a misnomer for something which now is innately about a switching core ASIC or special purpose hardware, and direct (optical even) connects to a device.

I'm sure there are also buses, dual redundant, master/slave failover, you name it. And given it's air or space probably a clockwork backup with a squirrel.

A real squirrel would need acorns, I would assume it's a clockwork squirrel too.

Aircraft also have software and components, that form a "working" proclaimed eco-system in lockstep- a baseline. This is why there are paper "additions" on bug discovery until the bug is patched and the whole ecosystem of devices is lifted to the next "baseline".

Some of us still work on embedded systems with real-time guarantees.

Believe it or not, at least some of those modern practices (unit testing, CI, etc) do make a big (positive) difference there.

The depressing part is that these "modern practices" were essentially invented in the 1960s by defense and aerospace projects like the NTDS, LLRV/LLTV, and Digital Fly-by-Wire to produce safety-critical software, and the rest of the software industry simply ignored them until the last couple of decades.

Tesla’s Cybertruck uses that in its ethernet as well!

All the ADAS automotive systems use this, there are several startups in this space as well, such as Ethernovia.

All thanks to twisted pair https://de.wikipedia.org/wiki/BroadR-Reach

I think he refers to SpaceWire https://en.wikipedia.org/wiki/SpaceWire.

You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.

That alone is worth my tax dollars.

Don’t count your chickens before they hatch.

I'm not sure you really understood my comment. A large portion of the kind of value I'm talking about comes from attempting the hard thing. If these chickens do not hatch that will be tragic, but we will still have learned something from it. In some ways, we will have learned even more, by getting taught about what we don't know.

Anyway, let's all hope for a safe landing tonight.

[deleted]

Agile is not meant to make solid, robust products. It’s so you can make product fragments/iterations quickly, with okay quality and out to the customer asap to maximize profits.

“Agile” doesn’t mean that you release the first iteration, it’s just a methodology that emphasizes short iteration loops. You can definitely develop reliable real-time systems with Agile.

I would differentiate between iterative development and incremental development.

Incremental development is like panting a picture line by line like a printer where you add new pieces to the final result without affecting old pieces.

Iterative is where you do the big brush strokes first and then add more and more detail dependent on what to learn from each previous brush strokes. You can also stop at any time when you think that the final result is good enough.

If you are making a new type of system and don’t know what issues will come up and what customers will value (highly complex environment) iterative is the thing to do.

But if you have a very predictable environment and you are implementing a standard or a very well specified system (van be highly complicated yet not very complex), you might as will do incremental development.

Roughly speaking though as there is of course no perfect specification which is not the final implementation so there are always learnings so there is always some iterative parts of it.

> “Agile” doesn’t mean that you release the first iteration

Someone needs to inform the management of the last three companies I worked for about this.

Management understand it less than anyone else does.

A physicist who worked on radiation-tolerant electronics here. Apart from the short iteration loops, agile also means that the SW/HW requirements are not fully defined during the first iterations, because they may also evolve over time. But this cannot be applied to projects where radiation/fault tolerance is the top priority. Most of the time, the requirements are 100% defined ahead of time, leading to a waterfall-like or a mixed one, where the development is still agile but the requirements are never discussed again, except in negligible terms.

I think people mean so many different things when talking about agile. I'm pretty sure a small team of experts is a good fit for critical systems.

A fixed amount of meetings every day/week/month to appease management and rushing to pile features into buggy software will do more harm than good.

SCRUM methodology absolutely prioritizes a "Potentially Shippable Product Increment" as the output of every sprint.

It does but this is the idea that I think one has to bend or ignore the most since people always bend or ignore bits of agile.

i.e. being able to print "Hello World" and not crash might make something shippable but you wouldn't actually do it.

I think the right amount of "bend" of the concept is to try to keep the product in a testable state as much as possible and even if you're not doing TDD it's good to have some tests before the very end of a big feature. It's also productive to have reviews before completing. So there's value in checking something in even before a user can see any change.

If you don't do this then you end up with huge stories because you're trying to make a user-visible change in every sprint and that can be impossible to do.

You can absolutely build robust products using agile. Apart from some of the human benefits of any kind of incremental/iterative development, the big win with Agile is a realistic way to elicit requirements from normal people.

The generous way of seeing it is that you don't know what the customer wants, and the customer doesn't know all that well what they want either, and certainly not how to express it to you. So you try something, and improve it from there.

But for aerospace, the customer probably knows pretty well what they want.

You hopefully know thats not true. But it's a matter of quality goals. Need absolute robustness? Prioritize it and build it. Need speed and be first to market? Prioritize and build it. You can do both in an agile way. Many would argue that you won't be as fast in a non-agile way. There is no bullet point in the agile manifest saying to build unreliable software.

Yeah, I know it’s not true in the sense that that’s not what it’s meant to do, but I’m saying practically that’s what usually ends up happening.

The manifesto refers to “working software”. It does not say anything about “okay quality”.

... and it mechanically promotes planned obsolescence by its nature (likely to be of disastrous quality). The perfect mur... errr... the perfect fraud.

> “Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”

Not sure i agree with the premise that "doing agile" implies decision making at odds with architecture: you can still iterate on architecture. Terraform etc make that very easy. Sure, tech debt accumulates naturally as a byproduct, but every team i've been on regularly does dedicated tech debt sprints.

I don't think the average CRUD API or app needs "perfect determinism", as long as modifications are idempotent.

In theory, yes you could iterate on architecture and potentially even come up with better one with agile approach.

In practice, so many aspects follow from it that it’s not practical to iterate with today’s tools.

Agile is like communism. Whenever something bad happens to people who practice agile, the explanation is that they did agile wrong, had they being doing the true agile, the problem would've been totally avoided.

In reality, agile doesn't mean anything. Anyone can claim to do agile. Anyone can be blamed for only pretending to do agile. There's no yardstick.

But it's also easy to understand what the author was trying to say, if we don't try to defend or blame a particular fashionable ideology. I've worked on projects that required high quality of code and product reliability and those that had no such requirement. There is, indeed, a very big difference in approach to the development process. Things that are often associated with agile and DevOps are bad for developing high-quality reliable programs. Here's why:

The development process before DevOps looked like this:

    1. Planning
    2. Programming
    3. QA
    4. If QA found problems, goto 2
    5. Release

The "smart" idea behind DevOps, or, as it used to be called at the time "shift left" was to start QA before the whole of programming was done, in parallel with the development process, so that the testers wouldn't be idling for a year waiting for the developers to deliver the product to testers and the developers would have faster feedback to the changes they make. Iterating on this idea was the concept of "continuous delivery" (and that's where DevOps came into play: they are the ones, fundamentally, responsible to make this happen). Continuous delivery observed that since developers are getting feedback sooner in the development process, the release, too, may be "shifted left", thus starting the marketing and sales earlier.

Back in those days, however, it was common to expect that testers will be conducting a kind of a double-blindfolded experiment. I.e. testers weren't supposed to know the ins and outs of the code intentionally, s.t. they don't, inadvertently, side with the developers on whatever issues they discover. Something that today, perhaps, would've been called "black-box testing". This became impossible with CD because testers would be incrementally exposed to the decisions governing the internal workings of the product.

Another aspect of the more rigorous testing is the "mileage". Critical systems, normally, aren't released w/o being run intensively for a very long time, typically orders of magnitude longer than the single QA cycle (let's say, the QA gets a day of computer time to run their tests, then the mileage needs to be a month or so). This is a very inconvenient time for development, as feature freeze and code freeze are still in effect, so the coding can only happen in the next version of the product (provided it's even planned). But, the incremental approach used by CD managed to sell a lie that says that "we've ran the program for a substantial amount of time during all the increments we've made so far, therefore we don't need to collect more mileage". This, of course, overlooks the fact that changes in the program don't contribute proportionally to the program's quality or performance.

In other words, what I'm trying to say is that agile or DevOps practices allowed to make the development process cheaper by making it faster while still maintaining some degree of quality control, however they are inadequate for products with high quality requirements because they don't address the worst case scenarios.

As 70's child that was there when the whole agile took over, and systems engineer got rebranded as devops, I fully agree with them.

Add TDD, XP and mob programming as well.

While in some ways better than pure waterfall, most companies never adopted them fully, while in some scenarios they are more fit to a Silicon Valley TV show than anything else.

It's not like the approach they took is any different. Just slapped 8x the number of computers on it for calculating the same thing and wait to see if they disagree. Not the pinnacle of engineering. The equivalent of throwing money at the problem.

>Just slapped 8x the number of computers on it

‘Just’ is not an appropriate word in this context. Much of the article is about the difficulty of synchronization, recovery from faults, and about the redundant backup and recovery systems

What happens when they don't?

If you have a point to make, make it.

What my question is hinting at is that there's actually some really interesting engineering around resolving what happens when the systems disagree. Things like Paxos and Raft help make this much more tractable for mere mortals (like myself); the logic and reasoning behind them are cool and interesting.

Though here the consensus algorithm seems totally different from Paxos/Raft. Rather it's a binary tree, where every non-leaf node compares the (non-silent) inputs from the leaf, and if they're different, it falls silent, else propagates the (identical) results up. Or something something.

Wasn't that way better, there's no need to drop bait. Thanks.

If you look at code as art, where its value is a measure of the effort it takes to make, sure.

Or if you're building something important, like a spaceship.

In that case, our test infrastructure belongs in the Louvre…

If your implication is that stencil art does not take effort then perhaps you may not fully appreciate Banksy. Works like Gaza Kitty or Flower Thrower don’t just appear haphazardly without effort.

I take the opposite message from that line - out of touch teams working on something so over budget and so overdue, and so bureaucratic, and with such an insanely poor history of success, and they talk as if they have cured cancer.

This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.

Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.

You mean like this?

"With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."

https://blog.codinghorror.com/building-a-computer-the-google...

The biggest innovation from Google regarding hardware was understanding that the dropping memory prices had made it feasible to serve most data directly from memory. Even as memory was more expensive, you could serve requests faster, meaning less server capacity, meaning reduced cost. In addition to serving requests faster.

The problem they solved isn't easy. But its not some insane technical breakthrough either. Literally add redundancy, thats the ask. They didnt invent quantum computing to solve the issue did they? Why dunk on sprints?

Wow. What a hand wave away of the intrinsic challenge of writing fault tolerant distributed systems. It only seems easy because of decades of research and tools built since Google did it, but by no means was it something you could trivially add to a project as you can today.

> fault tolerant distributed systems

I mean there were mainframes which could be described as that. IBM just fixed it in hardware instead of software so its not like it was an unknown field.

Even if that were actually true (it’s not in important ways) Google showed you could do this cheaply in software instead of expensive in hardware.

You’re still hand waving away things like inventing a way to make map/reduce fault tolerant and automatic partitioning of data and automatic scheduling which didn’t exist before and made map/reduce accessible - mainframes weren’t doing this.

They pioneered how you durably store data on a bunch of commodity hardware through GFS - others were not doing this. And they showed how to do distributed systems at a scale not seen before because the field had bottlenecked on however big you could make a mainframe.

Google then had complete regret not doing this with ECC RAM: https://news.ycombinator.com/item?id=14206811

It got them to where they need to be to then worry about ECC. This is like the dudes who deploy their blog on kubernetes just in case it hits front page of new york times or something.

> then had complete regret not doing this with ECC RAM

Yeah, my takeaway is Google made the right choice going with non-ECC RAM so they could scale quickly and validate product-market fit. (This also works from a perspective of social organisation. You want your ECC RAM going where it's most needed. Not every college dropout's Hail Mary.)

A great version of this and how ex-DEC engineers saved Google and their choice of ECC RAM - inventing MapReduce and BigTable https://www.youtube.com/watch?v=IK0I4f8Rbis

No, space is just hard.

Everything is bespoke.

You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.

People died on the Apollo missions.

It just costs that much.

Please, this is hacker news. Nothing else is hard outside of our generic software jobs, and we could totally solve any other industry in an afternoon.

I mean I can just replace Dropbox with a shell script.

That's funny because you could! Dropbox started a shell script :)

Funny though I would assume HN people would respect how hard real-time stuff and 'hardened' stuff is.

I think GP is referencing this somewhat [in]famous post/comment: https://news.ycombinator.com/item?id=8863#9224

HN audience has shifted, there is less technically minded people and more hustlers and farmers from other social media waste spaces. But alas.

"No wireless. Less space than a Nomad. Lame."

No, wait, that was that other site.

Yep, spend 100 billion on what should have cost 1/50that cost, and send people up to the moon with rockets that we are still keeping our fingers crossed wont kill them tomorrow, and we have to congratulate them for dunking on some irrelevant career?

One simply does not [“provision” more hardware|(reboot systems)|(redeploy software)] in space.

Modern software development is a fucking joke. I’m sorry if that offends you. Somehow despite Moore’s law, the industry has figured out how to actually regress on quality.

Lately it strikes me there's a big gap between the value promised and the value actually delivered, compared to a simple home grown solutions (with a generic tool like a text editor or a spreadsheet, for example). If they'd just show how to fish, we wouldn't be buying, the magic would be gone.

In this sense all of the West is full of shit, and it's a requirement. The intent is not to help and make life better for everyone, cooperate, it is to deceive and impoverish those that need our help. Because we pity ourselves, and feed the coward within, that one that never took his first option and chose to do what was asked of him instead.

This is what our society deviates us from, in its wish to be the GOAT, and control. It results in the production of lives full of fake achievements, the constant highs which i see muslims actively opt out of. So they must be doing something right.

We have a lot more software developers than 50 years ago and intelligence is still normally distributed.

What’s your point?

The average coder in the 1970s was a lot smarter than today. Think about the people who would be interested to start a career in this field at that time.

Oh I see what you mean. I agree 100%

And overall performance in terms of visible UX.

What would you suggest? Vibe coding a react app that runs on a Mac mini to control trajectory? What happens when that Mac mini gets hit with an SEU or even a SEGR? Guess everyone just dies?

No, of course not! It would be far better to have an openClaw instance running on a Mac Mini. We would only need to vibe code a 15s cron job for assistant prompting...

USER: You are a HELPFUL ASSISTANT. You are a brilliant robot. You are a lunar orbiter flight computer. Your job is to calculate burn times and attitudes for a critical mission to orbit the moon. You never make a mistake. You are an EXPERT at calculating orbital trajectories and have a Jack Parsons level knowledge of rocket fuel and engines. You are a staff level engineer at SpaceX. You are incredible and brilliant and have a Stanley Kubrick level attention to detail. You will be fired if you make a mistake. Many people will DIE if you make any mistakes.

USER: Your job is to calculate the throttle for each of the 24 orientation thrusters of the spacecraft. The thrusters burn a hypergolic monopropellent and can provide up to 0.44kN of thrust with a 2.2 kN/s slew rate and an 8ms minimum burn time. Format your answer as JSON, like so:

     ```json
    {
      x1: 0.18423
      x2: 0.43251
      x3: 0.00131
       ...
    }
     ```

one value for each of the 24 independent monopropellant attitude thrusters on the spacecraft, x1, x2, x3, x4, y1, y2, y3, y4, z1, z2, z3, z4, u1, u2, u3, u4, v1, v2, v3, v4, w1, w2, w3, w4. You may reference the collection of markdown files stored in `/home/user/geoff/stuff/SPACECRAFT_GEOMETRY` to inform your analysis.

USER: Please provide the next 15 seconds of spacecraft thruster data to the USER. A puppy will be killed if you make a mistake so make sure the attitude is really good. ONLY respond in JSON.

All Im suggesting is to be humble about your mediocre solutions. This is not the only solution and not that ingenious necessarily. Why do you need to bring up vibecoding here? Because people who criticize arrogant nasal engineers are also AI idiots by default?

Can't tell if "arrogant nasal engineers" is a typo or a hilarious attempt at an insult.

Nasal demons is a common reference to C and C++ Undefined Behaviour.

When an AI codes for you, you get Undefined Behaviour in every language.

Wild shit to be advising other people to be humble whilst talking directly out of your ass about technology you clearly do not understand and engineers you have no respect for.

Perhaps self-reflect.

How do you know that op doesn't know what he is talking about?

I have written code for real time distributed systems in industrial applications. It runs since years 24/7 and there never was a failure in production.

I also think nasa is full of shit.

Well for one, if you follow their profile and a few more clicks, you get to their resume, and while it's an impressive one and I'm sure they know a lot of shit I don't, what's notably missing is anything even remotely close to Aerospace, rocketry, guidance systems, positioning, etc.

For another, if an engineer has an axe to grind with a public facing project, I would expect them to just grind the thing, not echo a bunch of the same lame and stale talking points every layperson does (bureaucracy bad, government bad, old tech, etc.). I'm not saying NASA in general and Artemis in particular are flawless, I'm just saying if you're going to criticize it, let's hear it. Otherwise you just sound like another contrarian trying to get attention, like a 14 year old boy saying Hitler had some good points.

> ...they talk as if they have cured cancer.

I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.

So the quote is right somewhat, right? If you are writing to non technical people and you use such high wording.

No, it's not right. When put in context, the quote claims that that manner of speaking is used because the speaker has an unwarranted belief that they've done something absolutely incredible and unprecedented. In actuality, the manner of speaking is being used because the intended audience of the article is likely to have little-to-no knowledge of the technical details of what the speaker is talking about.

For example, if the article was aimed at folks who were familiar with the underlying techniques, the last two paragraphs of the "Enforcing Determinism" section would be compressed into [0]

  Each FCM is time-synced and runs a realtime OS. Failures to meet processing deadlines (or excessive clock drift) reset the FCM. Each FCM uses triply-redundant RAM and NICs. *All* components use ECC RAM. Any failures of these components reset the FCM or other affected component.

But you can't assume that a fairly nontechnical audience will understand all that, so your explanation grows long because of all of the basic information it contains. People looking for an excuse to sneer at something will often misinterpret this as the speaker failing to recognize that the basic information they're providing is about things that are basic.

[0] I'm assuming that the time being wildly out of sync will indicate FCM failure and trigger a reset. [1] I'm also assuming that a sufficiently-large failure of a network switch results in the reset of that network switch. If the article was intended for a more technical audience, that level of detail might have been included, but it wasn't, so it isn't.

[1] If it didn't, why even bother syncing the time? I find it a little hard to believe that the FCMs care about anything other than elapsed time, so all you care about is if they're all ticking at the same rate. I expect the way you detect this is by checking for time sync across the FCMs, correcting minor drift, and resetting FCMs with major drift.

>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.

>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.

One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.

These CPUs are typically implemented as lockstep pairs on the same die. In a lockstep architecture, both CPUs execute the same operations simultaneously and their outputs are continuously compared. As a result, the failure rate associated with an undetected erroneous calculation is significantly lower than the FIT rate of an individual CPU.

Put another way, the FIT (Failure in Time) value for the condition in which both CPUs in a lockstep pair perform the same erroneous calculation and still produce matching results is extremely small. That is why we selected and accepted this lockstep CPU design

the probability of simultaneous cosmic ray bit-flip in 2 CPUs, in the same bit, is ridiculously low, there might be more probability of them getting hit by a stray asteroid, propelled by a solar flare.

but still, murphy's law applies really well in space, so who knows.

I initially found this odd too. However, I think the catastrophic failure probability is the same as the prior system, and presumably this new design offers improvements elsewhere.

Under the 3-voting scheme, if 2 machines have the same identical failure -- catastrophe. Under the 4 distinct systems sampled from a priority queue, if the 2 machines in the sampled system have the same identical failure -- catastrophe. In either case the odds are roughly P(bit-flip) * P(exact same bit-flip).

The article only hints at the improvements of such a system with the phrasing: " simplifies the complex task", and I'm guessing this may reduce synchronization overhead or improve parallelizability. But this is a pretty big guess to be fair.

For errors due to radiation the probability is extremely low, since it would need to flip the same bit at the same time in two different places.

Then why 8 instead of 3?

They know their developers and engineers suck almost as hard as their management decisions so they added some more redundancy.

I wondered about this as well.

OTOH, consider that in the "pick the majority from 3 CPUs" approach that seems to have been used in earlier missions (as mentioned in the article) would fail the same way if two CPUs compute the same erroneous result.

Indeed. It seems like system 1 and 2 could fail identically, 3, 4, 5, 6, 7, 8 are all correct, and as described the wrong answer from 1 and 2 would be chosen (with a "25% majority"??).

In the Shuttle they would use command averaging. All four computers would get access to an actuator which would tie into a manifold which delivered power to the flight control surface. If one disagreed then you'd get 25% less command authority to that element.

> In the Shuttle they would use command averaging

I think the Shuttle, operating only in LEO, had more margin for error. Averaging a deep-space burn calculation is basically the same as killing the crew.

Sure, but these maneuvers aren't done realtime and aren't as time-sensitive; a burn is calculated and triple checked well in advance. If there was an error, there's always time to correct it.

In the case of moon landings, the only truly time-critical maneuvers are the ones right before landing... and unfortunately, a lot of fairly recent moon probes have failed due to incorrect calculations, sensor measurements, logic errors, etc.

The GNC loop runs several times per second. The desired output will consequently be increased by the working computers to achieve the target. The computer does not "dead reckon" anything.

Travelling through Max-Q in Earth atmosphere on ascent is far more dangerous.

[deleted]

> Travelling through Max-Q in Earth atmosphere on ascent is far more dangerous

Fair enough. I don't know enough about Orion's architecture to guess at propellant reserves, and how life-or-death each burn actually is.

When I was first starting out as a professional developer 25 years ago doing web development, I had a friend who had retired from NASA and had worked on Apollo.

I asked him “how did you deal with bugs”? He chuckled and said “we didn’t have them”.

The average modern AI-prompting, React-using web developer could not fathom making software that killed people if it failed. We’ve normalized things not working well.

there's a different level of 'good-enough' in each industry and that's normal. When your highest damage of a bad site is reduced revenue (or even just missed free user), you have lower motivation to do it right compared to a living human coming back in one piece.

Yes, of course, but a culture of “good enough” can go too far. One may work in a lower-risk context, but we can still learn a lot from robust architectural thinking. Edge cases, security, and more.

Low quality for a shopping cart feels fine until someone steals all the credit card numbers.

Likewise, perfectionism when it is unneeded can slow teams down to a halt for no reason. The balance in most cases is in the middle, and should shift towards 100% correctness as consequences get more dire.

This is not to say your code should be a buggy mess, but 98% bug free when you're a SaaS product and pushing features is certainly better than 100% bug free and losing ground to competitors.

True, though I'd say more bug impact than bug free-ness. If the 2% of bugs is in the most critical area of your app and causes users to abandon your product then you're losing ground.

That's one thing I think is good to learn from mission critical architecture: an awareness of the impact and risk tolerance of code and bugs, which means an awareness of how the software will be used and in what context by users.

Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

Nasa CFS, is written is plain C (trying to follow MISRA C, etc). It's open on girhub abd used by many companies. It's typically run over freertos or RTEMS, not sure here.

Personally I find the project extremely messy, and kinda hate working with it.

Not sure about the primary FSW but the BFS uses cFS[0]. As the sibling comment mentions, you can check it out on GitHub. Sadly I believe NASA keeps most of their best code private, probably siloed into mission-specific codebases. Still, the cFS repo is an awesome crash course on old-school Flight Software techniques.

[0] https://youtu.be/4doI2iQe4Jk?si=ucMoIdw7x_QgZR32

Some related good books I have been studying the past few years or so. The Spark book is written by people who've worked on Cube sats:

  * Logical Foundations of Cyber-Physical Systems

  * Building High Integrity Applications with SPARK 

  * Analysable Real-Time Systems: Programmed in Ada

  * Control Systems Safety Evaluation and Reliability (William M. Goble)

I am developing a high-integrity controls system for a prototype hoist to be certified for overhead hoisting with the highest safety standards and targeting aerospace, construction, entertainment, and defense.

NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.

Nice “well, actually”. I’m sure Lockheed were building this quad-redundant, radiation-hardened PowerPC that costs millions of dollars and communicates via Time-Triggered Ethernet anyway, whether NASA needed one or not.

Probably, if it already wasn’t developed for DoD.

For example, the OS it seems to be running is integrity 178.

https://www.ghs.com/products/safety_critical/integrity_178_s...

Aerospace tech is not entirely bespoke anymore, plenty of the foundational tech is off the shelf.

Historically, the main difference between ICBM tech and human spaceflight tech is the payload and reentry system.

This is the equivalent of prompt engineering.

True, but BFS was mainly done in-house. Source: my best friend and I worked on some parts of it.

Lockheed Martin and their subcontractors did the implementation.

We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.

I do.

Are you interested in sharing more details to make your claim more believable?

[deleted]

Eh, in these kinds of subcontractor relationships there is a lot of work and communication on both sides of the table.

will nobody think of the megacorps!!!

[dead]

[flagged]

I'm curious: In the current moon flyby, how often did some of these fallback methods get active? Was the BFS ever in control at any point? How many bitflips were there during the flight so far?

I sure wish they would talk about the hardware. I spent a few years developing a radiation hardened fault tolerant computer back in the day. Adding redundancy at multiple levels was the usual solution. But there is another clever check on transient errors during process execution that we implemented that didn't involve any redundancy. Doesn't seem like they did anything like that. But can't tell since they don't mention the processor(s) they used.

One of the things I loved about the Shuttle is that all five computers were mounted not only in different locations but in different orientations in the shuttle. Providing some additional hardening against radiation by providing different cross sections to any incident event.

I did VOS and database performance stuff at Stratus from 1989-95. Stratus was the hardware fault tolerant company. Tandem, our arch rivals, did software fault tolerance. Our architecture was “pair and spare”. Each board had redundant everything and was paired with a second board. Every pin out was compared on every tick. Boards that could not reset called home. The switch from Motorola 68K to Intel was a nightmare for the hardware group because some instructions had unused pins that could float.

"High-performance supercomputers are used for large-scale fault injection, emulating entire flight timelines where catastrophic hardware failures are introduced to see if the software can successfully ‘fail silent’ and recover."

I assume this means they are using a digital twin simulation inside the HPC?

I always wondered if the "radiation hardening" approaches of the challenges like this https://codegolf.stackexchange.com/questions/57257/radiation... (see the tag for more https://codegolf.stackexchange.com/questions/tagged/radiatio...) would be of any practical use... I assume not, as the problem is on too many levels, but still, seems at least tangentially relevant!

I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.

Not a direct answer but probably as good information as you can get: https://static.googleusercontent.com/media/research.google.c...

Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).

And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).

Typo: “both” ~ “not”

IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).

That was in the 2000s though, and for embedded memory above 65nm.

And obviously on earth.

[dead]

Some people are claiming it's the good old RAD750 variant. Is there anything that talks about the actual computer architecture? The linked article is desperately void of technical details.

It's a new (2002) variant of the same RAD750 architecture.

  CPUs:  IBM PowerPC 750FX (Single-core,  900 MHz, 32-bit, radiation hardened) 
  RAM:  256 MB (per processor)
  OS: VxWorks (Real-time OS)
  Network: TTEthernet (Time-Triggered Ethernet) at 1 Gbps
  programming: MISRA C++, flight control laws from Simulink adn MATLAB.

The part about triple-redundant voting systems genuinely blew my mind — it's such a different world from how most of us write software day to day, and honestly kind of humbling.

I wonder how the voting components are protected from integrity failures?

The Hyperia roller coaster ride at Thorpe Park uses triple-redundant voting. Which I thought was cool.

> It’s a complex machine. There’s three computers all talking to each other for a start, and they have to agree on everything.

Primary, Real-Time Secondary and Third for regulating votes.

https://www.bbc.co.uk/news/articles/ckkknz9zpzgo

Headline needs its how-dectomy reverted to make sense

(Off-topic:) Great word. Is that the usual word for it? Totally apt, and it should be the standard.

Does anyone know how this compares to Crew Dragon or HLS?

Multiple and dissimilar redundancy is nice and all that, but is there a manual override? Apollo could be (and at least in Apollo 11 and 13 it had to), but is this still possible and feasible? I'd guess so, as it's still manned by (former) test pilots, much like Apollo.

> “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”

It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!

I wonder how they made the voted-answer-picker fail-resistant

NASA describes some impressive work for runtime integrity, but the lack of mention of build-time security is surprising.

I would expect to see multi-party-signed deterministic builds etc. Anyone have any insight here?

What would the threat profile be here to require that? Regardless, I'd be a little surprised if they didn't have anything like that; provenance is very important in aerospace, with hardware tracked to the point that NTSB investigators looking at a crash can tell what ingot a bolt was made from

How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything

Orbital datacenters is a hypothetical infinite money glitch that could exist between the times:

- after general solution to extra-terrestrial manufacturing bootstrap problem is found, and, - before the economy patches the exploit that a scalable commodity with near-zero cost and non-zero values can exist.

It'll also destroy commercial launch market, because anything of size you want can be made in space, leaving only tiny settler transports and government sovreign launches to be viable, so not sure why commercial space people find it to be a commercially lucrative thing? The time frame within this IMG can exist can also be zero or negative.

The assumption is also like, they'll find a way to rent out some rocks for cash, so anyone with access to rocks will be doing as it becomes viable, and so, I'm not even sure if "space" part of space datacenters even matter. Earth is kinda space too in this context.

Orbital data centers are still nothing more than the current hyperloop.

Orbital data centres are a stupid concept.

You don't need 4x redundancy for everything. If no humans are aboard, you have 2x redundancy and immediately reboot if there is a disagreement.

They dont go into here.. but I thought that NASA also used like 250nm chips in space for radiation resistance. Are there even any radiation resistance GPUs out there?

Absolutely not, although the latest fabs with rad-tolerant processors are at ~20 nm. There are FDSOI processes in that generation that I assume can be made radiation-tolerant.

NOPE, RAD hardened space parts basically froze on mid 2000s tech: https://www.baesystems.com/en-us/product/radiation-hardened-...

It seems not; anti-interference primarily relies on using older manufacturing processes, including for military equipment, and then applying an anti-interference casing or hardware redundancy correction similar to ECC.

[dead]

It would be nice to see some of the software source. I’m super interested and i think I helped pay for it

They should have also built a fault tolerant toilette.

The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.

Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.

Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

2 outlooks.

Two.

Probably same way they’ve built fault-tolerant toilet.

ctrl+f toilet, thank you for already commenting this

So honest and perhaps a bit stupid question.

Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?

They’re not mission-critical equipment. If they fail, nobody dies.

They’re not radiation hardened, so given enough time, they’d be expected to fail. Rebooting them might clear the issue or it might not (soft vs hard faults).

Also impossible to predict when a failure would happen, but NASA, ESA and others have data somewhere that makes them believe the risk is high enough that mission critical systems need this level of redundancy.

>>They’re not mission-critical equipment. If they fail, nobody dies.

Yes, for sure, but that's not my question - it's not a "why is this allowed" but "why isn't this causing more visible problems with the iphones themselves".

Like, do they need constant rebooting? Does this cause any noticable problems with their operation? Realistically, when would you expect a consumer grade phone to fail in these conditions?

A lot of "space-rated" components come from consumer space, with certification that it can work in space.

IIRC the Helicopter on Mars using the same snapdragon CPU in your phone.

Also, bit flip can happen without you knowing. A flip in free ram, or in a temp file that is not needed anymore won't manifest into any error, but then, your system is not really deterministic anymore since now you rely on chance.

Random bit flips due to radiation are infrequent - the stat is something like one but flip per megabyte per 40,000 data centre RAM modules per year - ie extremely uncommon, but common enough to matter at scale.

Space is a harsher environment but they’re only up there for like a week. So, if there were an incident, it would be more likely to kill the devices, but it’s not very likely to happen during the short period of time (while still being more likely than on earth’s surface).

That said, part of the point of them taking these devices up is to find out how well they perform in practice. We just don’t really know how these consumer devices perform in space.

It will be interesting to see the results when they’re published!

Typo in the first sentence of the first paragraph is oddly comforting since AI wouldn't make such a typo, heh.

Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.

I think everyone should now make mistakes so we ca distinguish human vs ai.

This can be optimised for no doubt, adversarial training is like that

[deleted]

if I remember correctly the space shuttle had four computers that all did the same processing and a fifth that decided what was the correct answer if they all didn't match or some went down

can't find a wikipedia article on it but the times had an article in 1981

https://www.nytimes.com/1981/04/10/us/computers-to-have-the-...

apparently the 5th was standby, not the decider

The Artemis computer handles way more flight functions than Apollo did. What are the practical benefits of that?

This electrify & integrate playbook has brought benefits to many industries, usually where better coordination unlocks efficiencies. Sometimes the smarts just add new failure modes and predatory vendor relationships. It’s showing up in space as more modular spacecraft, lower costs and more mission flexibility. But how is this playing out in manned space craft?

[flagged]

[dead]

They run 2 Outlook instances. For redundancy. /s

[flagged]

Don't post generated comments or AI-edited comments. HN is for conversation between humans.

https://news.ycombinator.com/newsguidelines.html

> Dissimilar redundancy eliminates that risk. A completely different OS, different codebase, different development team.

Not entirely true. I've heard during my uni years of a case were two independent teams used the same textbook for implementing a feature, which had an error, and thus resulting in the same failure mode.

Ha, very curious what the issue was and what textbook

and yet.. https://news.ycombinator.com/item?id=47615490

That was a laptop, not one of the Artemis computers.

It kinda crazy how this mission didn't become mainstream media until as of late.

The fail-silent design is the part worth paying attention to. The conventional approach to redundancy is to compare outputs and vote — three systems, majority wins. What NASA did here instead is make each unit responsible for detecting its own faults and shutting up if it can't guarantee correctness. Then the system-level logic just picks the first healthy source from a priority list.

That's a fundamentally different trust model. Voting systems assume every node will always produce output and the system needs to figure out which output is wrong. Fail-silent assumes nodes know when they're compromised and removes them from the decision set entirely. Way simpler consensus at the system level, but it pushes all the complexity into the self-checking pair.

The interesting question someone raised — what if both CPUs in a pair get the same wrong answer — is the right one. Lockstep on the same die makes correlated faults more likely than independent failures. The FIT numbers are presumably still low enough to be acceptable, but it's the kind of thing that only matters until it does.

This is similar to the difference between using error-correcting codes and using erasure codes combined with error-detecting codes.

The latter choice is frequently simpler and more reliable for preventing data corruption. (An erasure code can be as simple as having multiple copies and using the first good copy.)

Spoken like an LLM.

> make each unit responsible for detecting its own faults and shutting up if it can't guarantee correctness

Does this mean you have to trust the already compromised system?

How you can remove component from decision set if it is the only component in the whole decision set?