Speaking Notes
System Safety Society - Canada Chapter's Springtime Symposium
Kathy Fox, Board Member
10 June 2010

Check against delivery

Click here to see PowerPoint Presentation  [4576 KB]

Slide 1: Title Page

Good morning, and thank you for that introduction.

It's a pleasure to be here and present on Systems to Manage Safety. My presentation will focus on the challenges in implementing a system to manage safety viewed from a management perspective.

Slide 2: Outline

First I will briefly review how my own thoughts about accident causation and prevention have evolved and why some of the traditional approaches to managing safety are changing.

Using examples from Transportation Safety Board investigations, I'll outline some of the lessons to be learned about organizational drift, employee adaptations, hazard identification, and incident reporting. Then I'll discuss how a Safety Management System (SMS) can help reduce risks, provided that it is effectively implemented.

Although I'll mainly be citing examples from the aviation sector, many of the principles, processes, and lessons learned are transferable to other industries.

Slide 3: Early Thoughts on Safety

My whole career has been about practicing safety: as an air traffic controller, as a commercial pilot, as a flight instructor and a pilot examiner. At the Transportation Safety Board (or TSB), I and the other Board members review accident investigations, analyze safety deficiencies, and make recommendations to help regulators, operators, and manufacturers reduce the risks.

Early in my career, though-and this may seem odd-I didn't often think about what the word "safety" really meant. I was taught-and I believed-that things would be "safe" as long as you followed standard operating procedures. Don't break any rules, make sure the equipment isn't going to fail, pay attention to what you're doing, and don't make any stupid mistakes.

This belief-that accidents only happened to those who didn't follow the necessary steps-persisted even after I became responsible for investigating Air Traffic Control incidents. When things went wrong, be it a loss of separation or some other incident, we often attributed causes merely to "not following procedures" or "loss of situational awareness". We didn't always look deeper, and search for the "whys".

Slide 4: Safety Does Not Equal Zero Risk

When I became Director of Safety and Quality, responsible for implementing a safety management system for the then-recently privatized Canadian Air Navigation Services provider, I started explicitly thinking, and speaking, about "risk management versus safety" and I realized that "safety" doesn't actually mean "zero risk".

Slide 5: Balancing Competing Priorities

Many companies claim that "safety is our first priority", but there's plenty of convincing evidence to suggest their top priorities are really "customer service" or "return on shareholder investment." That being said, there's no getting around the fact that products and services have to be "safe"-especially if companies want to stay in business, avoid accidents and costly litigation, and maintain customer confidence and a competitive advantage.

Organizations, therefore, must balance competing priorities, and manage multiple risks: safety, economic, regulatory, financial, environmental, technological-to name just a few.

Some of my greatest insights came from studying under Dr. Sidney Dekker, Professor of Human Factors and Flight Safety and Director of Research at the School of Aviation at Lund University in Sweden.

Professor Dekker maintains that:

Safety is never the only goal: organizations exist to provide goods and services and to make money at it;

And because there's never just one goal, people have to reconcile these goals simultaneously: for example, efficiency versus safety;

Moreover, safety isn't automatic. People have to create it, through practice, at all levels of an organization.

However, the pressures of production influence people's decisions. For example, investigations and case studies often reveal how factors such as the emphasis on productivity and cost control can result in trade-offs, which inadvertently contribute to the circumstances that can lead to accidents.

Slide 6: Sidney Dekker: Understanding Human Error

Immediately following an accident, people on the street and news media want to know if it was caused by "mechanical failure" or "human error".

Dekker is also pretty clear about what we refer to as "human error".

It is not, he insists, a cause of failure. Rather, human error is the effect, or symptom, of deeper trouble. Human error is not random. It is systematically connected to features of people's tools, tasks and operating environment.

Slide 7: Why Focus on Management?

So, that leads in to why we should focus on management.

Moreover, management decisions have a longer term effect.

Mostly, though, it's because managers create the operating environment. They provide the tools, the training, and the resources required to produce goods and services. Managers also establish and communicate goals and priorities.

But here is the problem:

Given the constant need to reconcile all these competing goals and priorities, and given the uncertainties involved in assessing safety risks, how can managers recognize if-or when-they are drifting outside the boundaries of safe operation?

Slide 8: Drift

Because drift is normal. It may not be ideal, but it happens. Because we're human.

Drift is generated by normal processes of reconciling differential pressures on an organization (efficiency, capacity utilization, safety) against a background of uncertain technology and imperfect knowledge.(Dekker 2005:43)

Typically, drift is fuelled by the twin pressures of scarcity and competition. Without knowledge of where the boundaries actually are, people don't see the drift-and thus they don't do anything to stop it.

Drift into failure is about the slow and incremental movement of systems operations toward the boundary of acceptable performance. The question, though, is how far you can safely drift before...

Slide 9: Drifting Into Failure

Slide 10: Organizational Drift

Even with the best of intentions, it's very possible for organizations to develop policies and procedures aimed at mitigating known safety risks but which then subsequently erode under production pressures. The following case provides a concrete example:

In 2004, an MK Airlines Boeing 747 on an international cargo flight crashed during take-off from Halifax International Airport. Why? Because the crew inadvertently calculated performance data for take-off using a lower aircraft weight from a previous leg. This resulted in incorrect take-off speed and a thrust setting too low to enable the aircraft to take off safely, given its actual weight.

The TSB's conclusion? Crew fatigue likely increased the probability of errors in entering and calculating takeoff performance data. That same fatigue, in combination with the dark take-off environment, also degraded the flight crew's ability to detect these errors.

Slide 11: Organizational Drift (continued)

So, "human error," right? But why? How did this happen? Well, it started with good news: the company was experiencing significant growth. That combined with their recruitment strategy, however, led to a problem: a shortage of flight crews.

What was the company response? They gradually increased the maximum allowable duty period from 20 hours (with a maximum of 16 flight hours) to 24 hours (with a maximum of 18 flight hours) and reduced the number of pilots in the flight crew from 4 to 3. At the time of the accident, this particular flight crew had been on duty for almost 19 hours and, due to earlier delays experienced, would likely have been on duty for approximately 30 hours at their final destination had the remaining flights continued uneventfully.

The TSB investigation revealed that the company's crewing department routinely scheduled flights in excess of the 24-hour limit. This routine non-adherence to their own operations manual contributed to an environment where some employees and company management felt it was acceptable to deviate from company policy and/or procedures-so long as it was considered necessary to complete a flight or a series of flights.

Sociologist Diane Vaughan has also defined this as the "normalization of deviance." In other words, when deviations from a standard are institutionalized so that they, in effect, become a new standard.

Slide 12: Organizational Drift (cont'd)

Drift is particularly hard to spot from inside an organization because... well, because incremental changes are always occurring. In fact, drift is usually only visible-to outsiders-after an adverse outcome (such as an accident), and primarily thanks to the benefits of hindsight.

So although nothing can prevent drift, you can find ways to try to identify it.

Slide 13: Safety Management Systems (SMS)

SMS is an internationally recognized management tool intended to help with this. Although individuals routinely make decisions about risk, SMS takes an organizational approach, by making the pro-active identification of hazards and risks, and the management of those risks, part of everything a company does. A well-functioning safety management system integrates sound risk management policies, procedures, and practices into a company's day-to-day operations.

Put simply, SMS can help organizations find trouble before trouble finds them. SMS can help companies foresee what might go wrong so they can take pre-emptive action.

Slide 14: Evolution of SMS

SMS was first introduced in the 1980s, in the chemical industry, before it gradually migrated to other safety critical industries. In many cases, the motivation for adopting SMS was prompted by tragic, high-profile accidents and the realization that organizations needed to find a better way to prevent such occurrences. In terms of its "intellectual pedigree", SMS has evolved in part from research into high reliability organizations, strong safety culture, and organizational resilience.

Slide 15: Why change?

Some people are skeptical about whether SMS works. They see it as a form of deregulation or self regulation-which it is not. And understandably, many companies may find it challenging to implement a SMS. But there's good reason to make the switch, starting with the fact that the old methods aren't particularly effective.

Let's look at the traditional approach which, as I've already mentioned, is very much the way I was taught early in my career: follow the SOPs, don't break the rules, pay attention, and don't make any stupid mistakes.

But just because you've complied with the regulations doesn't make things safe. Regulations don't always cover every situation, and they're certainly not flexible enough to help with rare or unforeseen circumstances.

And being reactive? Remember that old saying-"an ounce of prevention is worth a pound of cure". All too often, steps taken following an accident have only addressed the symptoms rather than the underlying safety deficiencies.

Then there's the philosophy of "blame and re-train," which is perhaps the most common approach: "Punish the person who messed up. They'll quickly learn not to do that again."

But here's the irony: criminalizing human error can actually have a detrimental effect on safety. Sidney Dekker's book, Just Culture, explores this. When a professional mistake is put on trial, he says, "rather than investing in safety improvements, people in the organization or profession invest in defensive posturing... Rather than increasing the flow of safety-related information, legal action has a way of cutting off that flow."

Slide 16: TSB Mandate

Before I share some further examples from TSB investigations, let me tell you about our mandate.

The goal of the TSB is to advance transportation safety in the air, marine, rail and pipeline modes. We are not a court, and we do not assign fault or determine civil or criminal liability. We aren't a regulator, and we don't have powers of enforcement.

During an investigation, TSB investigators identify safety issues by assessing the technical, operational, and human factors related to an occurrence. They then determine what the unsafe acts and conditions are, as well as any other underlying factors that might have an influence on safety. From there, they assess the risks and analyze the defences in place, as well as any other existing risk-control options.

The TSB notifies the industry and regulators as soon as possible when significant safety risks are found. The Board issues recommendations to handle more difficult, systemic issues.

Slide 17: TSB Reports

Keeping in mind our mandate, we have, over the course of numerous investigations, observed some recurring causes and contributing factors even in those organizations with an SMS. These include:

  • Employee adaptations
  • Inadequate risk analysis
  • Goal conflicts
  • Not heeding "weak signals"

Let me give you some examples.

Slide 18: Employee Adaptations

There is frequently a mismatch between written procedures specifying how work should be performed, and how work actually gets done. This difference can cause problems.

Why does it happen? Think about it in the context of limited resources: Faced with time pressures and multiple goals, workers may be tempted to create "locally efficient practices" for one very simple reason: to get the job done.

Accident investigation reports sometimes refer to these as "violations" or "deviations from SOPs". But let's look at this in a different light. Dekker says: "Emphasis on local efficiency or cost-effectiveness pushes operational people to achieve or prioritize one goal or a limited set of goals... (that are) easily measurable... whereas it is much more difficult to measure how much is borrowed from safety. Past success is taken as a guarantee of future safety. Each operational success achieved at incremental distances from the formal, original rules can establish a new norm...Departures from the routine become routine...violations become compliant behavior."

These kinds of departures-these "employee adaptations"-can inadvertently sabotage safety.

Slide 19: Employee Adaptations

Let's look at a practical example of this, and what can happen as a result.

On November 11, 2007, a Global 5000 business jet touched down seven feet short of the runway in Fox Harbour, Nova Scotia. The subsequent TSB investigation revealed that the operator endorsed a practice whereby flight crews would "duck" under visual glide slope indicator systems to land as close as possible to the beginning of the relatively short runway. Previously, the crew had flown in to this same airport in a Challenger 604, and they were still adjusting to the larger Global 5000. As such, they were unaware of two key factors: the Global's eye-to-wheel height, and the fact that the Visual Glide Slope Indicator in use at Fox Harbour was not suitable for that type of aircraft.

Slide 20: Aircraft Attitude at Threshold

In this occurrence, the manufacturer had recommended procedures for the Global 5000's flight profile and handling techniques. This included crossing the runway threshold at 50' above the runway. The crew, however, flew the same profile they had flown on previous flights, without taking into consideration that the Global 5000 aircraft was bigger than the Challenger 604.

In other words, they misjudged their height and did not recognize that they were too low.

But employee adaptation wasn't the only factor that led to trouble. The company had introduced a pretty substantial equipment change-a new and bigger aircraft-without carrying out an effective risk analysis, first. As a result, they did not consider all the operational impacts of this change or how the company endorsed practice of "ducking under"-that is, the employee adaptation-could impact safety.

Slide 21: Goal Conflicts

As I mentioned earlier, a recurring factor in many accidents is goal conflicts. Getting something done cheaply, for example, versus getting it done as safely as possible.

Tension between any two goals can prompt deviations from sound risk management practices.

In 2006, a locomotive, pulling a loaded car of lumber, derailed while descending a steep grade near Lillooet, British Columbia, killing two crew members and seriously injuring a third. The locomotive was equipped with standard pneumatic air brakes, but was not equipped with the "dynamic brakes" commonly used in steep mountainous terrain.

Previously, locomotives running along this route had been equipped with both types of brakes. Prior to this occurrence, however, the railway company had re-assigned these locomotives for business reasons. Getting trains safely through that type of terrain, however, required locomotives equipped with dynamic brakes. In this case, a business decision, made without first conducting a thorough risk analysis, led to tragedy.

Slide 22: Weak Signals

Weak signals are another recurring factor.

In 2007, for example, a medical evacuation flight crashed in Sandy Bay, Saskatchewan, killing the pilot in command. The subsequent TSB investigation found that the crew of two pilots was unable to work effectively as a team to avoid, trap, or mitigate errors and safely manage the risks associated with the flight. As our lead investigator at the time put it, "This crew did not employ basic strategies that could have helped prevent the chain of events leading to this accident." This lack of coordination can be attributed in part to the fact that the crew had not received crew resource management (CRM) training.

Previously, there had been numerous "crew pairing issues" with respect to this crew. The company's management knew about this, although they were unaware of the extent to which these factors could impair effective crew coordination.

In fairness, it's hard to have as much context as you'd like, sometimes-and it's almost impossible to know the full extent and implication of any single event. For instance, are one or two reported problems just that-isolated conflicts or hazards-or are they evidence, warning signs of a dangerous trend?

The answer is we don't always know. By nature, "weak signals" may be insufficient to attract the attention of busy managers, who often suffer from information overload while juggling competing priorities under significant time pressures.

Slide 23: SMS: Incident Reporting

One way to amplify "weak signals" is by collecting and analyzing incident reports. Incident reporting is critical to a successful SMS. This, of course, leads to the obvious question, "What's a reportable incident?"

Traditionally these have been defined as "events that result in adverse outcomes."

But what if no one crashed? What if it was "only" an error, or a so-called near miss? Organizations that define a reportable incident too narrowly risk losing key information about events that could indicate potential system vulnerabilities.

In his book Forgive and Remember: Managing Medical Failure, Charles Bosk cautions that dangerous "near-misses" are, as a rule, only appreciated as harbingers of disaster after a disaster. Until then, as highlighted in the Sandy Bay occurrence, they remain weak or missed signals.

However, no matter how much information you have, it's only useful if you know how to analyze it. And many organizations simply have limited resources available to analyze reports, keep track of deficiencies, and identify patterns. Sometimes, much to our dismay, the one person who is best positioned to identify and act on a known hazard is stretched too thin, or focused on other priorities.

Slide 24: SMS: Incident Reporting (continued)

For argument's sake, however, let's assume that your company has enough people-and that you have enough of the right people. How much information, and what kind of information, should they have?

Merely counting errors doesn't necessarily generate any meaningful or relevant safety data. Furthermore, measuring performance based solely on error trends can be misleading, as the absence of errors and incidents does not imply an absence of risk.

Another key aspect is the kind of processes and structures that an organization will need to support incident reporting. Will it be voluntary or mandatory? Will the system identify reporters, be confidential, or anonymous? To whom will the reports be submitted? Are reporters susceptible to discipline?

The consensus, according to Dekker and Laursen, is that fear of retribution hampers peoples' willingness to report. Conversely, non-punitive systems generate more reports-and by extension, more learning-because people feel free to tell about their troubles.

Dekker and Laursen also note something interesting: It turns out that the main reason operators make reports isn't the lack of retribution, but rather the realization they can "make a difference." Dekker and Laursen also note the importance of reporting to an operationally knowledgeable safety group, which in turn helps the reporter make sense of performance and context in which the incident occurred.

Slide 25: TSB Reports

Of course, there were other contributing factors noted in a review of TSB accidents involving companies with or developing an SMS. Some are noted on this slide. From a management perspective, note in particular the potential impact of organizational transitions, short-staffing, inadequate supervision, and lack of training.

Slide 26: Implementing SMS: What Works

Interviews with operators who have experience implementing SMS provides some insight into what worked for them.

Leadership. Not just on paper, but real commitment, from all levels of an organization.

Less paperwork. Informants emphasized the importance of creating simple but effective processes and tools that would really be workable in their organization. Once people started using these processes and tools and began seeing the benefits, this reinforced and tended to spread their use throughout the organization.

A sense of ownership by those involved in implementing and applying SMS. In other words, let employees see that they can make a difference, and they will.

And finally, individual awareness-by everyone-that safety isn't just an add-on, after the fact, but rather, that it must permeate all aspects of doing business. People at all levels need to approach their work thinking ahead about what might go wrong.

Slide 27: What Doesn't Work?

Here's a list. In particular, overly bureaucratic processes and documentation, which have been developed without consideration of, or input from, the end-user and with no perceived benefit, will not be used.

Slide 28: Lessons Learned

As TSB occurrence data has shown, goal conflicts, employee adaptations, and drift are naturally occurring phenomena in any complex organizational setting and regularly contribute to incidents and accidents. Organizations implementing SMS can and should learn from these occurrences since they also demonstrate patterns of accident pre-cursors (e.g. not thinking ahead to what might go wrong, not having an effective means to track and highlight recurrent maintenance or other safety deficiencies, and insufficient training and/or resources to deal with unexpected events).

Slide 29: Benefits and Pitfalls

Implementing SMS is not a panacea against accidents. There is, however, a lot of evidence to show that SMS has a positive impact on the way organizations make decisions and manage risk. In particular, some organizations have adopted more formal, structured approaches to searching for and documenting hazards (i.e. a "mindful infrastructure") causing their decision-making criteria to shift conservatively. By implementing such process changes, thinking about what might go wrong becomes part "of the way people go about their work" and "part of (that) organization's culture".

SMS allows organizations to receive more reports from employees about 'near misses', events they would not have heard about previously, but only when employees feel "safe" to report and confident that their reports will be acted upon. These reports can help organizations to identify "drift" and the "boundaries of safe operation". Amplifying 'weak signals' and self-auditing their safety management processes continues to be a challenge but will likely improve as companies gain more experience and their SMS matures.

Slide 30: Conclusion

SMS is only as effective as the organizational safety culture in which it is embedded. SMS won't take hold unless there is a strong underlying commitment and buy-in to safety. But just wanting operations to be safe won't create safety, not unless this commitment is also supported by 'mindful' processes such as formal risk assessments, increased reporting, tracking of safety deficiencies and effective follow-up. While these process changes can stimulate changes in culture, they will only be sustainable in the long term if they are seen to add value.

This has important implications for the successful implementation of SMS:

- Organizations must recognize that it will take unrelenting commitment, time, resources and perseverance to implement an effective SMS.

- Regulators must be sensitive to the challenges companies face as they transition to SMS, and diligent in how they conduct compliance and safety oversight activities. Based on the experience to date, there is a risk that some short-sighted companies will take a minimalist, bureaucratic, or checklist approach to adopting SMS and yet believe they are 'safe' because they have a 'compliant' SMS.

- Inadequate policies covering the use of safety data (e.g. for litigation or enforcement purposes) and increased criminalization of human error will discourage open reporting, a key component of an effective SMS.

- And accident investigators should continually strive to uncover the contextual drivers that influence decision-making, goal conflicts, employee adaptations and 'non-compliance' with formally documented rules, procedures and safe practices to facilitate organizational learning and effective follow-up after an occurrence.

Slide 31: Watchlist

Back in March, around the same time the TSB turned 20, we issued our Watchlist.

We did it because time and again, our investigators have arrived at the scene of an accident to find the same old safety issues-unresolved issues-that remain in need of attention.

Of these issues, we singled out nine that pose the greatest risk to Canadians.

Two of these nine are what we call multi-modal: that is, they don't apply solely to air investigations, or to marine, or rail or pipelines. SMS is one of them.

The Board will continue to monitor the implementation and regulatory oversight of safety management systems in the transportation industry and share our findings with regulators, industry and Canadians.

Thank you very much for your attention. I'd be happy to take a few questions at this time.