Accountability: A Message To All Software Engineers

Published on 19 Nov 2010

This article is a part of a series of blog posts that served as an assignment for the course titled Social Implications of Computing (CIS*3000) during my undergraduate studies at the University of Guelph. It was originally published on a free Wordpress.com site I had created for the course.

When groups of professionals work together towards the production of a good or a service, there is a certain sense of responsibility that must be expected from each and every one of these individuals. Consider a construction project of a firm: the architects and designers are responsible for designing the blueprints of efficient, appealing, and practical structures; the engineers are expected to correspond according to these designs and perform their tasks with utmost attention towards accuracy. Each one of these individuals is responsible for his/her contribution towards the erection of the structure. Now when disaster strikes, the firm is accountable for the damage caused. This might sound normal and fair at first; people are expected to be responsible for their actions, and when their actions prove to be harmful, they are penalized accordingly. However, what some might miss, is that there is a distinct difference between accountability and responsibility. Not every person responsible for something is held accountable for it as well.

Responsibility vs. Accountability

As software engineers, it is important to understand this difference very clearly. Over the past few decades, there have been many disasters due to reasons varying from incompetency to mere ignorance. Many companies have seen their downfall due to liabilities, and in most of the cases, rightfully so. However, not all those responsible for the damages were held accountable for them as well; software engineers have a trend of being given more than second chances. This will be clearly evident through observing a famous blunder of the past.

Therac-25: Background

The accidents of the Therac-25 are arguably the biggest computer-related disasters to ever occur. The Therac-25 was a computer-controlled radiation therapy machine, manufactured by Atomic Energy of Canada Limited (AECL) that was used in the treatment of cancer during the 1970s. It is a linear particle accelerator (linac); electrons are accelerated to extremely high speeds in order to create high-energy radiation beams that can be used to destroy tumors while causing minimal damage to surrounding body tissue. Exposure to these high levels of radiation, even for fairly brief periods, is very dangerous, so the application of the Therac-25 required extreme accuracy and alertness. The main design aspect of the machine was its use of a turntable to rotate the equipment into the path of the beam in order to produce two modes: electron mode and photon mode. There was another mode that involved no particle acceleration, but involved a light beam that was used to guide the positioning of the patient. The electron-beam mode delivered small does of high-energy electrons over short periods of time, while the other therapy mode produced the maximum energy-level (25 MeV) by delivering X-Rays that collided onto a single target.

Therac-25: The Fatality

Disaster struck when due to an issue with the software built into the machine, the high-power beam was activated instead of the intended mode which resulted in the patient receiving fatal doses of radiation. While there were many factors that called for the fatal incidents, the biggest blunder was arguably the poor programming technique used to determine the status of the machine. The software revolved around the execution of subroutines and tracking of states through variables. A flag variable called Tphase was used to indicate which subroutine was to be executed. A procedure known as Set Up Test was used to perform various checks to ensure the machine was in the correct position. A separate flag variable was used to determine if a specific device of the machine was in the correct position. Until clearance for all devices was provided, the test was frequently rescheduled periodically. Now, the variable was set to 0 to indicate that all devices were in the correct position, and if this was not so, the variable was incremented by 1. Due to the many frequent iterations, the result was a high value in the variable. The problem was that due to the limitations of the hardware at the time, after a certain value, the variable resulted in an overflow, causing the variable to set itself to 0. This effectively told the operator that the machine’s devices were in the correct position, while it truly was in another position. This resulted in the patient receiving a high-energy dose of radiation, when the device was originally meant to be in the light mode. Frequent repetitions of this bug caused severe injuries, which eventually led to death of a few patients.

Small Mistake, Huge Consequences

This problem was not really a bug; it was just bad programming. The correct technique would be to set two fixed values to indicate status: 1, if the devices need checking, and 0, if the machine is in the right mode. This bad programming most definitely stemmed from irresponsibility on behalf of the programmers. Such programming was definitely inferior even at its time; the Motorola 68k and was an excellent microprocessor, and this was partly due to its excellent use of assembly language. Although there is a possibility that the programmers were low-skilled, this is highly unlikely. Confirmed by researchers, it was found that the AECL did not have their source code reviewed independently, that the software did not provide any meaningful error codes (which resulted in operators ignoring them), and also that the Therac-25 was only tested fully (both software and hardware) on the day it was assembled in the hospital. Clearly, a proper software development methodology was barely followed, if not ignored.

Analysis: Need For Accountability (Not Always Speed)

Looking at all of this, it is clear that general irresponsibility and a somewhat lack of professionalism were major causes of the Therac-25 disasters. In the end, although the software engineers were partly responsible for the damage caused to the six victims of the disaster, they were not held accountable for it. Instead, AECL was accountable for the damages. This difference between responsibility and accountability has known to be huge problems for companies. Even though programmers are responsible for software bugs, they still get paid their salaries. However, the company itself accounts for problems and issues. This is evident in Microsoft; even though there are thousands of bugs and errors within the Windows operating system, the tendency is to criticize the company as a whole, rather than the individual engineers. This can be seen in the hundreds of anti-Microsoft attitudes in the past years. This is not acceptable, especially when the outcomes can be fatal. Lack of accountability from software engineers leads to increased irresponsibility, thereby resulting in the fatal situations like those of Therac-25.

Final Words

As software engineers, it is critical that all problems posed must be approached with the most profound sense of accountability. The irresponsible design of the Therac-25 must never be repeated. We must realize that our innovations can really change lives; for the better AND for the worse. We must be willing to take responsibility for the outcomes of our decisions, even if it is as simple as choosing the most efficient method of using flag variables. Only then can we be in the vicinity of perfection; after all, perfection is the absence of mistakes. There was a tradition in Ancient Rome; whenever an arch was being constructed, while the capstone was being hoisted into place, the engineer managing the project would stand under the arch. While the engineer must not have been responsible for placing each and every single block, he was willing to put his life at stake for the product of his team. Perhaps, it is time for us to stand under our own arches.

For further reading: An updated case study of the incident, race condition