All software systems involve people, but people are fallible and make mistakes. What can be done to create safer systems and minimise human error?
Systems in Action: To engineer is human Part 1
The hotel lobby was crowded for the Friday evening tea dance. Some of the crowds were on the walkways above the lobby. The strain was too great for the third floor walkway which collapsed onto the walkway below it. That too gave way and tons of concrete and steel fell onto the revellers below. The victims didn’t have time to get away or according to eye witnesses just stood still, transfixed with terror as the collapsing structure fell on them. The rescue operation began immediately, the first task being to save the lives of people trapped in the rubble.
In America AT&T isn’t the only long-distance company but it’s the largest and claims to be a world leader in communications technology. Much to their embarrassment, when the system ground to a halt at least 50 million long distance calls were lost. For everyone involved it was time wasting and frustrating.
The government has ordered an enquiry into the London Ambulance Service after its computerised call-out system collapsed at the beginning of this week. The new system was supposed to allocate emergency calls more efficiently, instead it delayed dozens of calls for hours resulting, the unions say, in between 10 and 20 deaths.
Only a coroner’s court will decide if Christine Dance’s husband was one of the victims but she believes he was. Just before 10 o’clock yesterday morning she dialled 999 because her husband Roger was choking, but it took 5 or 6 calls before an ambulance finally arrived at twenty past twelve, by then her husband was dead.
All of these incidents results in failure of systems that were engineered. In this programme we’ll be looking first at why the systems failed and then at how software engineers try to ensure such failures are rare events.
...the first task being to save the lives of people trapped in the rubble. It was the worst disaster in the history of Kansas City and it happened in a hotel that was architecturally speaking the pride of the city. The Hiatt Regency...
But within months two crowded walkways, suspended across the vast hall collapsed. Why did this happen? What went wrong? The enquiry revealed just two contributory factors.
Architectural drawings reveal a novel design. The cross beams on which the two walkways rested were supported by just one long rod attached to the ceiling. Each cross beam supported one walkway. However, the rod was only 60% as strong as the Kansas city building code required and although engineering codes of practice normally allow much more than minimum strength, in implementing the design the construction company found it hard to assemble the walkways with a single steel rod. So the connection system was modified. The single rod was replaced by two rods. Now the bottom rod supported the lower beam and the top rod supported the upper beam. The new arrangement meant the upper beam now bore most of the weight of both walkways, a weight that even unloaded it could barely take.
The additional weight of people on the walkways that night was too much and the structure collapsed. The building of a hotel is a clear example of engineering. In this case of civil engineering. But what is good engineering practice to do with the installation of a telephone system dependent upon software?
Initially the problem was described as an anomaly but the scale soon became clear.
So why did the system fail? In common with several other telephone companies, AT&T were moving over to a new computer controlled switching system, SF7. Investigators traced the failure to SF7. The problem began at just one switching centre in New York. A minor mechanical malfunction tripped the switch. The controlling computer sent messages to the 113 other centres nationwide to route no new calls to the New York centre until the switch was reset. After six seconds the reset was completed. The New York switching centre began sending out new long distance calls. As switching centre computers received the calls, they updated their software to renew their routing to New York.
Unfortunately the timing between the calls was spaced in a pattern which somehow triggered a software fault at the receiving centre. Each switching system had been designed with a back-up computer but the back-up computer reacted in an identical way. It too shut down and this sequence of faults snowballed. As each centre came back online, the long distance calls it sent triggered faults at receiving centres causing them to shut down to reset. The computer architecture had been designed to prevent catastrophic failures in the network by duplicating nearly every piece of equipment, but it couldn’t cope when a software error struck both primary and back-up computers at once.
Systems in Action: To engineer is human Part 2
Typically telephone systems contain millions of lines of code. There were around ten million lines in the troublesome SF7 system, far too many to be able to check all of them, even for simple errors.
In the autumn of 1992 the London Ambulance Service installed a new computerised system to manage the despatch of vehicles in response to calls. Initially all went well, but within weeks the system failed with tragic consequences. Why did it fail so disastrously? The report identified two major reasons, both involving computers called file servers. The first was a minor programming error.
In carrying out some work in the system some three weeks previously, the programmer has inadvertently left in the system a piece of programme code that caused a small amount of memory within the file server to be used up and not released every time a vehicle mobilisation was generated by the system. Over a three weeks period these activities had gradually used up all available memory, thus causing the system to crash.
And the second reason? Well the originally specification had included the provision of a backup computer, that is another file server, to take over in the event of problems. But as the report states:
The fall-back to the second server was never implemented. It was always specified and thus arguably would have activated had the system actually crashed on the 26th and 27th October 1992.
The report makes it clear that the programming error would not have been detected through conventional programming or used testing and it was cause by carelessness and lack of quality assurance. There was also considerable criticism of the way the entire software project had been managed. So can software be more reliable. Should we trust our lives to software systems? What are the costs in people and time?
I’m perfectly happy trusting my life to a system containing software. I am not sure that that makes me the same as most people in the country. I know that software can be made reliable enough to be as good as other engineering structures. I also know that quite often it isn’t built in those sorts of ways at that alarms me. So although I have concerns about software and software reliability, my concerns are really about the claims that people make for it rather than necessarily about the end result.
4th May 1989, NASA launches the space shuttle Atlantis into orbit around earth for a routine mission. The software for NASA projects costs many millions of dollars. For many flights in space the computer system is the only life support system for the flight crew. It’s a safety critical system, it has to work. On this occasion aboard the space shuttle is the outcome of a project that began nearly eight years earlier. Once in orbit about earth, Atlantis releases the spacecraft Magellan, the first stage of a journey to Venus that was to last 15 months. What was the purpose of the Magellan project?
Magellan is a spacecraft, specified, designed, built and launched by NASA whose purpose was to travel to the planet Venus and map its surface by radar imaging, and you have to use radar because Venus is permanently covered with clouds and you’re not able to use conventional video cameras.
Systems in Action: To engineer is human Part 3
Most commercial software is large, complex and subject to deadlines. The amount of money committed to the development and implementation of any project is bounded and has to be kept within a budget. In the next part of this programme we’ll be looking at the production of commercial software.
The A300 series of Airbuses are famous because they use a completely electronic flight controlled system, the so-called fly-by-wire system. The idea of using software in aircraft flight controls is not a new one. Auto pilots have used software for many years. But that’s not what the true fly-by-wire system involves.
So what is fly-by-wire? Well, from a pilot’s point of view the fly-by-wire aircraft is a much more pleasant aircraft to fly. For a start there are no highly cumbersome mechanical controls connecting the cock-pit to the other parts of the aircraft. Instead of mechanical connections, each pilot has a side-stick which passes his instructions through computers to actuate the controls of the elevators on the tail plane. Or the spoilers and aerons on the wings. During normal operation there is no physical connections between the flight controls in the cock-pit and those on the wings and tail, although some mechanical links are retained for the very improbable occurrence of total loss of the computers, which means that the cock-pit controls are easy to use and that whether flying a large aircraft or a smaller one, the handling is the same.
Large commercial software engineering applications can never be 100% error free. Checking for every possible error would take far too long and cost too much. The best that can be hoped for is that software which handles safety-critical parts of any system is thoroughly checked. Even then some errors are likely to remain. Airbuses are designed and built by Airbus Industry, a European consortium of four aircraft manufacturers. Transport planes like this Super Guppy fly components of the airbus made elsewhere in France, in Germany, in Spain or in the UK, to the Airbus Industry headquarters in Toulouse for final assembly.
So what is involved in designing the computerised flight control system for an Airbus? How is the way that the software is structured reflected in the way it’s checked? Who does the checking and when? What is the iron bird? And what happens when modifications of the software may be needed in the future?
First of all, you start by specifying what you want the aeroplane to do in more and more detail. You then specify what you want an individual computer to do, in doing its individual job and you also specify what safety features or monitoring you need to be put into a computer so the computer itself can be properly designed before they start writing the software.
We have five computers for the flight control system but in fact we have two sets of computers. One is based around a kind of microprocessor, the other is based on another kind of microprocessor, for example, on the Andreas 320, for one kind of computer we use Motorola microprocessor, 16 bit microprocessor, for the other kind of computer we use Intel family of microprocessor. Inside one computer the two channels are in fact totally independent, they are to be considered as two elementary computers, and we force also the similarity between these two channels as far as possible obviously by using different language, but mainly by the fact that these two channels are synchronous between, there is no strict synchronisation between them’.
When you start writing the software in fact, you start from a very formal set of specifications and you write the software in relatively small chunks so that you don’t end up with one enormous mass of spaghetti.
Inside the laboratory we receive computers and elements of the flight control system ...
Systems in Action: To engineer is human Part 4
...are synchronous between, there is no synchronisation between them.
When you start writing the software, in fact you start from a very formal set of specifications and you write the software in relatively small chunks so that you don’t end up with one enormous mass of spaghetti.
Inside the laboratory we receive computers and elements of the flight control system from the suppliers of the system and we check them independently of the designer, of the design office. We build an assembly including all the parts of system and, as far as the flight control system is concerned, we have seen the iron bird with the mechanical parts of the car, of the flight control system, and here we have a cockpit with side stick controller which is coupled with a computer of a flight control system so we can validate the flight control system in its raw parts in order to check as well as the functional operation of this system as well as the operational aspect of the system. Each problem which is found inside the laboratory is transmitted to the design people in order to correct the releases of the software, the computers.
You check the code that is in an individual module of software which is relatively small and understandable, and then you check how that module of software connects up with other modules of software and the information is properly transferred. When you’ve finished checking all the modules are interconnected with each other, you then go into checking whether the thing is doing what you intended in the first place by building it and then making sure the system carries out its intended function. And after that you fly it in the aeroplane and you make sure that the aeroplane, with the system in it, is doing what is intended.
When we make a modification because the aeroplane’s been in service for a few years and needs an update you go through part of the original design process in the same way that you did when you first designed the machine. This is why it is important to have small understandable pieces of code instead of one big heap of spaghetti code.
Software doesn’t fail the way that hardware fails in general because hardware, whether it’s mechanical systems, pulleys and levers, whether it’s physical structures like the wings of aeroplanes or whether it’s electronic circuits, hardware tends to be relatively simple compared with the enormous design complexity of a computer software program, and consequently most hardware failures are failures when the components of the hardware break; the valve sticks, the hinge comes undone, you get metals fatigue in the wing, the electronic component burns out because it’s past its design lifetime it’s been hit by an alpha particle or whatever it happens to be. Software doesn’t have those problems. All software failures are design or specification errors; they’re mistakes made by human beings who simply didn’t get the software right.