Kyle Kotowick, Ph.D.
Dr. Kotowick is a leader in solution architecture, holding a Ph.D. in Human Systems Integration from the Massachusetts Institute of Technolo
When complex systems are life-critical, it can be extremely difficult to modernize and upgrade them safely. While many engineers will never work on such systems, there is much to be learned from those who do.
In this article, Toptal Solutions Architect Dr. Kyle Kotowick explains how to properly maintain and update systems that are too important to fail.
When complex systems are life-critical, it can be extremely difficult to modernize and upgrade them safely. While many engineers will never work on such systems, there is much to be learned from those who do.
In this article, Toptal Solutions Architect Dr. Kyle Kotowick explains how to properly maintain and update systems that are too important to fail.
Dr. Kotowick is a leader in solution architecture, holding a Ph.D. in Human Systems Integration from the Massachusetts Institute of Technolo
Every enterprise has a “critical” infrastructure. Generally, this means that if the system goes offline, parts (or all) of the business will come to a grinding halt until the techs can get it running again. This often occurs when large software, hardware, or network upgrades are required: newly deployed systems contain unexpected bugs that cause a system failure, or users make mistakes with the new system because they’re unfamiliar with it, and productivity stops until the techs can work through the deployment bugs or train the users. For the most part, a period of downtime or reduced productivity is an acceptable risk in exchange for the promise of improved performance and efficiency of new technology, but that isn’t universal. Most businesses can afford a certain amount of downtime without risking much revenue or antagonizing their clients.
But what happens when the systems you need to modify are life-critical systems, where life safety is dependent on being able to use the system reliably?
This article steps away from the more traditional software development on which we spend most of our time and, instead, takes a look at the human element in life-critical systems. My thoughts on this topic stem from my experience as the Director of Information Technology for a 911 ambulance service, as a technology specialist for an international humanitarian disaster response organization, and with my Ph.D. in Human Systems Integration completed at the Massachusetts Institute of Technology.
Before we begin, I’d like to explain why this may be relevant to you. While it may not be obvious that your project actually involves a life-critical system, it’s much more likely than you might think. All of the following qualify, as well as countless more topics:
In today’s world of heavy reliance on technology, projects you’ve never considered to even be semi-important could end up being a vital component of a life-critical system.
Have you ever heard or used the word “heritage” in relation to a technological system? It’s a strong word, invoking thoughts of long-standing traditions, lineage, and difficult times of old. In the engineering world, it’s used to denote designs that have been around for a long time and have been proven to work reliably and is often seen as a desirable trait for reducing risk. In reality, it’s a euphemism for archaic technology that was never updated due to risk aversion, and it is pervasive in industries where system failures can lead to dire consequences.
There is good reason behind this affinity towards heritage software and hardware. It’s known to work, it’s unlikely that unknown bugs will arise, and its costs are stable and predictable. An excellent example is the spaceflight industry: the Russian Soyuz spacecraft has been launching astronauts into space for over 50 years with only minor revisions during that time, and it continues to be used because it is reliable and safe. Unfortunately, this means that it is also extremely inefficient compared to modern designs: while the Soyuz costs NASA $81 million USD per seat to fly astronauts to the space station using its heritage hardware, SpaceX and Boeing are expected to offer seats for $58 million USD each using their modern rocket designs.
It is understandable that few people want to upgrade old systems that still work; executives don’t want the risk, technicians don’t want to deal with the mysterious server in the closet with an uptime of 12 years, and clients don’t want to have to learn new designs. Unfortunately, there is a tipping point between risk minimization and cost savings: heritage designs will need to be upgraded eventually, even in high-risk environments.
The remainder of this article covers some of the more important steps in the process of doing so when your systems are life-critical, such as those used by first responders, military units, or aircraft.
In my experience, possibly the hardest step of the process is convincing decision-makers and stakeholders that upgrades are needed. Systems that operate in life-critical environments are often governed by extensive legal regulation and organization policy, meaning that you have to convince them that they should not only take the risk and spend the money but that they should also engage in what could easily be several years of bureaucratic tape-cutting. The strongest opposition you’ll face will likely be from the legal team, who will lay out in excruciating detail the potential liability you’ll be opening the organization up to by introducing new technology. They’re right: liability is a major issue, and if something breaks and someone gets hurt or dies, it could be a massive ethical, reputational, and financial burden.
Regardless of whether you’re working with a Fortune 500 corporation or with your local volunteer fire department, it always comes down to the same thing: money. Corporations don’t want to lose it, and non-profits don’t have much to work with in the first place. The only reliable way that I have found to convince an organization’s leadership to take the risk of changing a life-critical system is to demonstrate that, probabilistically, they are either more likely to make/save money than to lose it, or that they can reduce their liability by modernizing their technology and improving safety.
What that means is that you need to do the math. What is the likelihood that there will be extended downtime or future crashes due to bugs (based on previous deployments in your organization, or data from other organizations)? What is the expected impact if it does fail, and what would that cost? Conversely, what are the expected performance or reliability gains, and what would they be worth? If you can show that there’s a high probability you’ll come out ahead, there’s a good chance that you’ll get the green light.
You may be familiar with the phrase “by engineers, for engineers,” an idiom suggesting that engineers built something to be used by people with qualifications similar to their own. It’s an extremely common occurrence and was one of the main precipitating factors for the rise of computer usability studies in the early 1990s. As engineers, we often have different mental models of how technological systems work than the average end-user does, leading to a tendency to design systems with the assumption that the end-user already knows how it functions. In conventional systems, this leads to errors and unhappy clients; in life-critical systems, it can lead to death.
Consider the case of Air France Flight 447. On June 1, 2009, the Airbus A330 was over the Atlantic Ocean en route from Rio de Janeiro to Paris. Ice crystals obstructed the pitot tubes, causing inconsistencies in the air speed measurements. The flight computer disengaged the autopilot, recognizing that it could not reliably fly the plane itself with incorrect air speed data. It then placed itself into an “extended flight envelope” mode, which allowed the pilots to perform maneuvers that the computer wouldn’t normally allow. You can imagine the engineers who built the system thinking “if the autopilot disengages itself, it’s probably because there’s a situation where the pilots might need extra control!”
This is the natural train of thought for the engineers, who understand what kinds of things might cause the autopilot to disengage. For the pilots, it was not the case. They forced the aircraft into a steep upward climb, ignoring the “STALL” warnings, until it lost all airspeed and plummeted to the ocean. All 228 passengers and crew were killed. While there are multiple ideas as to the exact motivation for their actions, the prevailing theory is that the pilots assumed the flight computer would prevent control inputs that would stall the aircraft (which is true for the normal flight envelope), and did not realize that it had switched to the extended envelope mode because they did not share the engineers’ mental model of the logic that drove the computer’s decisions.
While a bit of a circuitous route, this leads us to my main point: the actions that you want users to take under specific conditions must be the actions that feel natural to the user.
Users can be instructed to follow specific procedures, but they’re simply not always going to remember the right thing to do or understand what is happening under high-stress conditions. It is your responsibility to design software, controls, and interfaces in an intuitive manner that makes users inherently want to do the things that they are supposed to.
What this means is that it is absolutely critical that end-users are engaged in every single stage of the design and development process. There can be no assumptions made about what users will probably do; it must all be evidence-based. When you design interfaces, you must conduct studies to determine the thought patterns that they induce in users and adjust as necessary. When you test your new systems, you must test them with real users in real environments under realistic conditions. And unfortunately, when you alter your designs, you must do these steps all over again.
Although every complex system is unique, I’d like to mention three design considerations, in particular, that should be applied universally:
In line with industry best practices, a beta phase is crucial for deployments of new life-critical systems. Tests of your new technology may have helped you correct the majority of bugs and usability issues, but hidden dangers may surface when it has to be used together with other systems in real-world environments. In the systems engineering discipline, this is known as “emergence.” Emergent properties are “unexpected behaviors that stem from interaction between the components of an application and their environment,” and by their very nature are essentially impossible to detect in a lab setting. By inviting a representative group of end-users to trial your new system under careful supervision, many of these properties can be detected and evaluated for negative consequences that may appear in full-scale deployment.
Another topic that often arises in architecture discussions with my clients is that of a phased rollout. This is the choice between gradually replacing elements of a pre-existing system with elements of a new one until eventually everything is replaced vs. changing everything at once. There are pros and cons for each: a phased rollout allows for gradual acclimatization of users to the new design, ensuring that changes come in manageable amounts and users aren’t overwhelmed; on the other hand, it can drag the process out over extended periods and result in a constant state of transition. Immediate rollouts are beneficial in that they “rip the band-aid off” and speed things along, but the sudden drastic changes can overwhelm users.
For life-critical systems, I’ve found that phased rollouts are generally safer, as they allow incremental evaluation of individual components in a production environment and allow smaller reversions if something needs to be rolled back. This is something that needs to be evaluated on a case-by-case basis, however.
OK, so your user testing helped you design an intuitive system, your beta phase identified emergent issues, your phased rollout allowed users to become comfortable with the design, and everything is running smoothly. You’re done, right? Unfortunately not.
Issues with your system will inevitably arise, there’s no getting around that. When users come across these issues, they will often develop workarounds instead of reporting the problem to the tech team. The workarounds will become standard practice, shared as “tips” from veterans to rookies. Sociologist Diane Vaughan coined a phrase to describe this phenomenon: “normalization of deviance.” Users become so accustomed to deviant behavior that they fail to remember that it is, in fact, deviant.
The classic example is the Space Shuttle Challenger: a component in the solid rocket boosters was regularly observed to erode during launch, and although everyone knew that eroding rocket components was a bad thing, it happened so often that waivers were regularly issued and it was considered normal. On January 28, 1986, the problem that everyone originally knew shouldn’t be allowed, but that they had normalized, resulted in the explosion of Challenger and the deaths of seven astronauts.
As an administrator of a life-critical system, it is up to you to prevent such occurrences from happening. Study how your users interact with the system. Shadow them for a couple of days and see if they’re using unexpected workarounds. Periodically send out surveys to ask for suggested changes and improvements. Dedicate time and effort to improving your system before your users find ways to work around the issues without you.
It is often the case that failures in life-critical systems occur when users suffer from stress, adrenaline surges, and tunnel vision. It happened to the pilots on Air France 447, it’s happened to paramedics who can’t remember how to operate their cardiac monitor on their first cardiac arrest patient, and it’s happened soldiers who can’t operate their radios properly while under fire.
Some of this risk is ameliorated by using intuitive designs as discussed earlier, but that alone is insufficient. Particularly in environments where high-stress scenarios do occur but occur infrequently, it is essential to train your users not just how to operate your system, but how to think clearly under such conditions. If users memorize actions relating to operating equipment, they can’t deal with unexpected failures because the actions they learned may no longer be appropriate; if they train to think logically and clearly under stress, they can respond to changing circumstances and help your system stay alive when something breaks.
Obviously, developing, deploying, and maintaining life-critical software is a hell of a lot more complex than can be detailed in a single article. However, these areas of consideration may help give you a better idea of what to expect when you’re thinking about working on such a project. In summary:
In closing, I’d like to note that in complex safety critical systems, things will go wrong at some point no matter how much planning, testing, and maintenance you do. When those systems are life-critical, it’s quite possible that a failure can lead to loss of life or limb.
In the event that such a tragedy does occur with something you’re responsible for, it will be a crushing weight on your conscience and you will likely think that you could have prevented it if you’d paid more attention or worked harder. Maybe that’s true, maybe it’s not, but it is impossible to foresee every possible occurrence.
Work meticulously, don’t get cocky, and you’ll be making the world a better place.
A life-critical system is a system whose failure or malfunction may result in death or serious injury. It comprises all software and hardware necessary to perform a critical function.
Dependability is a measure of a system's availability, reliability, and maintainability. In general, it is a measure of the confidence that a system will perform as expected.
Safety-critical elements are systems or components that are designed to prevent, control, mitigate, or respond to system malfunctions or accidents that could lead to injury or death.
Located in Ottawa, ON, Canada
Member since January 10, 2019
Dr. Kotowick is a leader in solution architecture, holding a Ph.D. in Human Systems Integration from the Massachusetts Institute of Technolo
World-class articles, delivered weekly.
World-class articles, delivered weekly.
Join the Toptal® community.