Teams and Processes13 minute read

Innovation With Life-critical Systems

When complex systems are life-critical, it can be extremely difficult to modernize and upgrade them safely. While many engineers will never work on such systems, there is much to be learned from those who do.

In this article, Toptal Solutions Architect Dr. Kyle Kotowick explains how to properly maintain and update systems that are too important to fail.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

When complex systems are life-critical, it can be extremely difficult to modernize and upgrade them safely. While many engineers will never work on such systems, there is much to be learned from those who do.

In this article, Toptal Solutions Architect Dr. Kyle Kotowick explains how to properly maintain and update systems that are too important to fail.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Kyle Kotowick, Ph.D.
Verified Expert in Engineering

Kyle is a leader in solution architecture, holding a PhD in Human Systems Integration from the Massachusetts Institute of Technology.

PREVIOUSLY AT

Massachusetts Institute of Technology
Share

Every enterprise has a “critical” infrastructure. Generally, this means that if the system goes offline, parts (or all) of the business will come to a grinding halt until the techs can get it running again. This often occurs when large software, hardware, or network upgrades are required: newly deployed systems contain unexpected bugs that cause a system failure, or users make mistakes with the new system because they’re unfamiliar with it, and productivity stops until the techs can work through the deployment bugs or train the users. For the most part, a period of downtime or reduced productivity is an acceptable risk in exchange for the promise of improved performance and efficiency of new technology, but that isn’t universal. Most businesses can afford a certain amount of downtime without risking much revenue or antagonizing their clients.

But what happens when the systems you need to modify are life-critical systems, where life safety is dependent on being able to use the system reliably?

This article steps away from the more traditional software development on which we spend most of our time and, instead, takes a look at the human element in life-critical systems. My thoughts on this topic stem from my experience as the Director of Information Technology for a 911 ambulance service, as a technology specialist for an international humanitarian disaster response organization, and with my Ph.D. in Human Systems Integration completed at the Massachusetts Institute of Technology.

Before we begin, I’d like to explain why this may be relevant to you. While it may not be obvious that your project actually involves a life-critical system, it’s much more likely than you might think. All of the following qualify, as well as countless more topics:

  • Automotive. Working on a project with a vehicle manufacturer, or any of their suppliers? Recruited out of university by a self-driving car startup? Automatic braking, cruise control, lane control, computer vision, obstacle recognition, electronic engine control modules, etc. Every one of these is a life-critical system, where a failure can be fatal.
  • Aviation. When you’re 30,000’ in the air, almost any system failure can be life-critical. Considering recent events with the Boeing 737 MAX, there are the obvious life-critical systems of autopilot and computerized flight control, but there are also a lot of things you may not think about. At home, if the fan in your HVAC system fails and produces a bit of smoke, you open the window or step outside for some fresh air. Have you ever tried opening the window and stepping outside during a trans-Atlantic flight? Even the most basic of systems, from the power outlets in the galley to the brakes on the wheels of the drink cart, can be life-critical on aircraft.
  • Communications. Most developers or engineers will, at some point in their careers, work on a communications system project in one capacity or another. To many people, telecommunications don’t initially seem life-critical; after all, civilization fared just fine before telephones and the internet. As someone who has deployed to disaster zones where telecommunications infrastructure has been destroyed, let me tell you what actually happens: people become very ill or injured and can’t call 911; elderly residents can’t call their kids to ask them to pick up their prescriptions from the pharmacy; emergency responders can’t communicate with their dispatchers or with each other; and people who can’t contact their family members become concerned and put themselves in extremely dangerous situations to try to find them. Communications systems are absolutely life-critical.

In today’s world of heavy reliance on technology, projects you’ve never considered to even be semi-important could end up being a vital component of a life-critical system.

But If It Ain’t Broke, Don’t Fix It

Have you ever heard or used the word “heritage” in relation to a technological system? It’s a strong word, invoking thoughts of long-standing traditions, lineage, and difficult times of old. In the engineering world, it’s used to denote designs that have been around for a long time and have been proven to work reliably and is often seen as a desirable trait for reducing risk. In reality, it’s a euphemism for archaic technology that was never updated due to risk aversion, and it is pervasive in industries where system failures can lead to dire consequences.

There is good reason behind this affinity towards heritage software and hardware. It’s known to work, it’s unlikely that unknown bugs will arise, and its costs are stable and predictable. An excellent example is the spaceflight industry: the Russian Soyuz spacecraft has been launching astronauts into space for over 50 years with only minor revisions during that time, and it continues to be used because it is reliable and safe. Unfortunately, this means that it is also extremely inefficient compared to modern designs: while the Soyuz costs NASA $81 million USD per seat to fly astronauts to the space station using its heritage hardware, SpaceX and Boeing are expected to offer seats for $58 million USD each using their modern rocket designs.

It is understandable that few people want to upgrade old systems that still work; executives don’t want the risk, technicians don’t want to deal with the mysterious server in the closet with an uptime of 12 years, and clients don’t want to have to learn new designs. Unfortunately, there is a tipping point between risk minimization and cost savings: heritage designs will need to be upgraded eventually, even in high-risk environments.

The remainder of this article covers some of the more important steps in the process of doing so when your systems are life-critical, such as those used by first responders, military units, or aircraft.

Convincing the Brass

In my experience, possibly the hardest step of the process is convincing decision-makers and stakeholders that upgrades are needed. Systems that operate in life-critical environments are often governed by extensive legal regulation and organization policy, meaning that you have to convince them that they should not only take the risk and spend the money but that they should also engage in what could easily be several years of bureaucratic tape-cutting. The strongest opposition you’ll face will likely be from the legal team, who will lay out in excruciating detail the potential liability you’ll be opening the organization up to by introducing new technology. They’re right: liability is a major issue, and if something breaks and someone gets hurt or dies, it could be a massive ethical, reputational, and financial burden.

Regardless of whether you’re working with a Fortune 500 corporation or with your local volunteer fire department, it always comes down to the same thing: money. Corporations don’t want to lose it, and non-profits don’t have much to work with in the first place. The only reliable way that I have found to convince an organization’s leadership to take the risk of changing a life-critical system is to demonstrate that, probabilistically, they are either more likely to make/save money than to lose it, or that they can reduce their liability by modernizing their technology and improving safety.

What that means is that you need to do the math. What is the likelihood that there will be extended downtime or future crashes due to bugs (based on previous deployments in your organization, or data from other organizations)? What is the expected impact if it does fail, and what would that cost? Conversely, what are the expected performance or reliability gains, and what would they be worth? If you can show that there’s a high probability you’ll come out ahead, there’s a good chance that you’ll get the green light.

It’s Not About You

You may be familiar with the phrase “by engineers, for engineers,” an idiom suggesting that engineers built something to be used by people with qualifications similar to their own. It’s an extremely common occurrence and was one of the main precipitating factors for the rise of computer usability studies in the early 1990s. As engineers, we often have different mental models of how technological systems work than the average end-user does, leading to a tendency to design systems with the assumption that the end-user already knows how it functions. In conventional systems, this leads to errors and unhappy clients; in life-critical systems, it can lead to death.

Consider the case of Air France Flight 447. On June 1, 2009, the Airbus A330 was over the Atlantic Ocean en route from Rio de Janeiro to Paris. Ice crystals obstructed the pitot tubes, causing inconsistencies in the air speed measurements. The flight computer disengaged the autopilot, recognizing that it could not reliably fly the plane itself with incorrect air speed data. It then placed itself into an “extended flight envelope” mode, which allowed the pilots to perform maneuvers that the computer wouldn’t normally allow. You can imagine the engineers who built the system thinking “if the autopilot disengages itself, it’s probably because there’s a situation where the pilots might need extra control!

This is the natural train of thought for the engineers, who understand what kinds of things might cause the autopilot to disengage. For the pilots, it was not the case. They forced the aircraft into a steep upward climb, ignoring the “STALL” warnings, until it lost all airspeed and plummeted to the ocean. All 228 passengers and crew were killed. While there are multiple ideas as to the exact motivation for their actions, the prevailing theory is that the pilots assumed the flight computer would prevent control inputs that would stall the aircraft (which is true for the normal flight envelope), and did not realize that it had switched to the extended envelope mode because they did not share the engineers’ mental model of the logic that drove the computer’s decisions.

While a bit of a circuitous route, this leads us to my main point: the actions that you want users to take under specific conditions must be the actions that feel natural to the user.

Users can be instructed to follow specific procedures, but they’re simply not always going to remember the right thing to do or understand what is happening under high-stress conditions. It is your responsibility to design software, controls, and interfaces in an intuitive manner that makes users inherently want to do the things that they are supposed to.

What this means is that it is absolutely critical that end-users are engaged in every single stage of the design and development process. There can be no assumptions made about what users will probably do; it must all be evidence-based. When you design interfaces, you must conduct studies to determine the thought patterns that they induce in users and adjust as necessary. When you test your new systems, you must test them with real users in real environments under realistic conditions. And unfortunately, when you alter your designs, you must do these steps all over again.

Although every complex system is unique, I’d like to mention three design considerations, in particular, that should be applied universally:

  • Controls should be representative of the actions they invoke. Fortunately, this one is fairly common, generally seen in the form of selecting relevant icons for GUI buttons or relevant shapes for physical controls. “New File” buttons use a blank sheet of paper icon, and landing gear levers in aircraft have a knob in the shape of a wheel. However, it is easy to become complacent. Why do we still see 3.5” floppy disk icons for “Save” buttons? Does anyone younger than 25 even know what a floppy disk is? We continue to use it because we think it’s representative, but it’s really not anymore. It requires constant effort to ensure that representations of controls are meaningful to users, but it’s also necessary and difficult to balance that with continuity.
  • If a warning system breaks, it must be detectable. We often use warning lights that activate when there’s a problem, such as a flashing red indicator. It’s great for getting a user’s attention, but what happens if the light burns out? The spacecraft used in the Apollo lunar missions had dozens of warning lights for all sorts of systems, but they did not function in a conventional manner. Instead of illuminating when there was a problem, they remained illuminated when everything was fine and turned off when there was a problem. This ensured that a burned out warning light wouldn’t cause the astronauts to miss a potentially fatal system error. I’m not saying that you should use this design, since light bulbs have come a long way in reliability since the 1960s, but it gives an idea of how in-depth some of your planning has to be. How many times has a system crashed but you didn’t know about it, because the logging or notifications weren’t functioning properly?
  • Mode transitions must be obvious to the user. What happens if an Airbus A330 transitions from normal control mode to advanced control mode without the user noticing, and suddenly the aircraft does things the user didn’t think it could do? What happens if a self-driving car disengages its autopilot, leaving the user unexpectedly in full control? What happens when there is any major transition in mode or functionality that requires an immediate change in the user’s actions, but it takes the user a minute or two to figure out what just happened? While a variety of operating modes may be necessary in a complex system, modes cannot transition without adequate notice, explanation, and interaction with the user in doing so.

Rolling Life-Critical Systems Out of the Shop

In line with industry best practices, a beta phase is crucial for deployments of new life-critical systems. Tests of your new technology may have helped you correct the majority of bugs and usability issues, but hidden dangers may surface when it has to be used together with other systems in real-world environments. In the systems engineering discipline, this is known as “emergence.” Emergent properties are “unexpected behaviors that stem from interaction between the components of an application and their environment,” and by their very nature are essentially impossible to detect in a lab setting. By inviting a representative group of end-users to trial your new system under careful supervision, many of these properties can be detected and evaluated for negative consequences that may appear in full-scale deployment.

Another topic that often arises in architecture discussions with my clients is that of a phased rollout. This is the choice between gradually replacing elements of a pre-existing system with elements of a new one until eventually everything is replaced vs. changing everything at once. There are pros and cons for each: a phased rollout allows for gradual acclimatization of users to the new design, ensuring that changes come in manageable amounts and users aren’t overwhelmed; on the other hand, it can drag the process out over extended periods and result in a constant state of transition. Immediate rollouts are beneficial in that they “rip the band-aid off” and speed things along, but the sudden drastic changes can overwhelm users.

For life-critical systems, I’ve found that phased rollouts are generally safer, as they allow incremental evaluation of individual components in a production environment and allow smaller reversions if something needs to be rolled back. This is something that needs to be evaluated on a case-by-case basis, however.

Normalization of Deviance

OK, so your user testing helped you design an intuitive system, your beta phase identified emergent issues, your phased rollout allowed users to become comfortable with the design, and everything is running smoothly. You’re done, right? Unfortunately not.

Issues with your system will inevitably arise, there’s no getting around that. When users come across these issues, they will often develop workarounds instead of reporting the problem to the tech team. The workarounds will become standard practice, shared as “tips” from veterans to rookies. Sociologist Diane Vaughan coined a phrase to describe this phenomenon: “normalization of deviance.” Users become so accustomed to deviant behavior that they fail to remember that it is, in fact, deviant.

The classic example is the Space Shuttle Challenger: a component in the solid rocket boosters was regularly observed to erode during launch, and although everyone knew that eroding rocket components was a bad thing, it happened so often that waivers were regularly issued and it was considered normal. On January 28, 1986, the problem that everyone originally knew shouldn’t be allowed, but that they had normalized, resulted in the explosion of Challenger and the deaths of seven astronauts.

As an administrator of a life-critical system, it is up to you to prevent such occurrences from happening. Study how your users interact with the system. Shadow them for a couple of days and see if they’re using unexpected workarounds. Periodically send out surveys to ask for suggested changes and improvements. Dedicate time and effort to improving your system before your users find ways to work around the issues without you.

Training for Performance Under Pressure

It is often the case that failures in life-critical systems occur when users suffer from stress, adrenaline surges, and tunnel vision. It happened to the pilots on Air France 447, it’s happened to paramedics who can’t remember how to operate their cardiac monitor on their first cardiac arrest patient, and it’s happened soldiers who can’t operate their radios properly while under fire.

Some of this risk is ameliorated by using intuitive designs as discussed earlier, but that alone is insufficient. Particularly in environments where high-stress scenarios do occur but occur infrequently, it is essential to train your users not just how to operate your system, but how to think clearly under such conditions. If users memorize actions relating to operating equipment, they can’t deal with unexpected failures because the actions they learned may no longer be appropriate; if they train to think logically and clearly under stress, they can respond to changing circumstances and help your system stay alive when something breaks.

Conclusion

Obviously, developing, deploying, and maintaining life-critical software is a hell of a lot more complex than can be detailed in a single article. However, these areas of consideration may help give you a better idea of what to expect when you’re thinking about working on such a project. In summary:

  • Innovation is necessary, even when everything is working smoothly
  • It’s hard to convince executives that the risk is worthwhile, but numbers don’t lie
  • End-users must be involved in every step of the process
  • Test and re-test with real users in real environments under realistic conditions
  • Don’t assume that you got it right the first time; even though it’s working, there may be undetected problems that your users aren’t telling you about.
  • Train your users on not just how to use the system, but how to think clearly and adapt under stress.

In closing, I’d like to note that in complex safety critical systems, things will go wrong at some point no matter how much planning, testing, and maintenance you do. When those systems are life-critical, it’s quite possible that a failure can lead to loss of life or limb.

In the event that such a tragedy does occur with something you’re responsible for, it will be a crushing weight on your conscience and you will likely think that you could have prevented it if you’d paid more attention or worked harder. Maybe that’s true, maybe it’s not, but it is impossible to foresee every possible occurrence.

Work meticulously, don’t get cocky, and you’ll be making the world a better place.

Understanding the basics

  • What is a life-critical system in software engineering?

    A life-critical system is a system whose failure or malfunction may result in death or serious injury. It comprises all software and hardware necessary to perform a critical function.

  • What is dependability in software engineering?

    Dependability is a measure of a system’s availability, reliability, and maintainability. In general, it is a measure of the confidence that a system will perform as expected.

  • What is a safety-critical element?

    Safety-critical elements are systems or components that are designed to prevent, control, mitigate, or respond to system malfunctions or accidents that could lead to injury or death.

Hire a Toptal expert on this topic.
Hire Now
Kyle Kotowick, Ph.D.

Kyle Kotowick, Ph.D.

Verified Expert in Engineering

Ottawa, ON, Canada

Member since January 10, 2019

About the author

Kyle is a leader in solution architecture, holding a PhD in Human Systems Integration from the Massachusetts Institute of Technology.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

PREVIOUSLY AT

Massachusetts Institute of Technology

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

Join the Toptal® community.