Developed by David Jones (Chevron), Walt Frank (ABS Consulting), Karen Tancredi (DuPont), and Mike Broadribb (BP).
Overview
“In our view, the NASA organizational culture had as much to do with this accident as the foam.” CAIB Report, Vol. 1, p. 97
On February 1, 2003, the Space Shuttle Columbia disintegrated during re-entry into the Earth’s atmosphere, killing all seven crewmembers aboard. The direct chain of events leading to the disaster had begun 16 days earlier when the Shuttle was launched. During ascent, 81 seconds after liftoff, a large chunk of insulating foam broke off of the external fuel tank, struck the Shuttle, and damaged critical thermal protection tiles. The tiles subsequently failed when exposed to the intense heat encountered when the shuttle re-entered the atmosphere during its return to Earth.
“Organizational culture refers to the basic values, norms, beliefs, and practices that characterize the functioning of a particular institution. At the most basic level, organizational culture defines the assumptions that employees make as they carry out their work; it defines “the way we do things here.” An organization’s culture is a powerful force that persists through reorganizations and the departure of key personnel.” CAIB Report, Vol. 1, p. 101
In pursing the investigation beyond immediate causal contributors, the CAIB was trying to understand two issues in particular:
- Why was it that serious concerns about the integrity of Columbia, raised within one day of launch, were not acted upon in the two weeks available between launch and return? With little corroborating evidence, management had become convinced that a foam strike was not, and could not be, a concern.
- Which of the cultural patterns emerging from the Columbia accident were the same as those first identified after the Challenger tragedy (almost exactly 17 years earlier) andwhy were they still present?
Through its report, the CAIB has provided a service to all organizations that operate facilities handling hazardous materials or that engage in hazardous activities. Although NASA is a unique organization, with a focused mission, the organizational cultural failures that led to the Columbia disaster have counterparts in any operation with a potential for significant incidents. Key organizational cultural themes emerging from the CAIB report include:
- Maintaining a Sense of Vulnerability. Catastrophic incidents involving highly hazardous materials or activities occur so infrequently that most organizations never have the unfortunate, but educating, opportunity of experiencing one. Operating diligence and management effectiveness can be easily dulled by a sense of false security – leading to lapses in critical prevention systems. Eliminating serious incidents requires constant reminders of the vulnerabilities inherent in hazardous activities.
- Combating Normalization of Deviance. When pre-established engineering or operational constraints are consciously violated, with no resulting negative consequences, an organizational mindset is encouraged that more easily sanctions future violations. This can occur despite well-established technical evidence, or knowledge of operational history, that suggests such violations are more likely to lead to a serious incident.
- Establishing an Imperative for Safety. An organization that is focused on achieving its major goals can develop homogeneity of thought that often discourages critical input. In the case where valid safety concerns are ignored, the success of the enterprise can be put in jeopardy. The CAIB report makes a compelling argument for ensuring strong, independent “sanity” checks on the fundamental safety integrity of an operation.
- Performing Valid/Timely Hazard/Risk Assessments. Without a complete understanding of risks, and the options available to mitigate them, management is hampered in making effective decisions. Organizations that do not actively engage in qualitative and quantitative “what can go wrong?” exercises, or that fail to act on recommendations generated by the risk assessments that are done, miss the opportunity to identify and manage their risks.
- Ensuring Open and Frank Communications. A dysfunctional organizational culture can discourage honest communications, despite formal appearances to the contrary. This is done through established protocol, procedures, and norms that dictate the manner in which subordinates communicate with management, and the manner in which management receives and responds to the information. Barriers to lateral communications (e.g., between work groups) can also impede the free flow of safetycritical information.
- Learning and Advancing the Culture. Organizations that do not internalize and apply the lessons gained from their mistakes relegate themselves to static, or even declining, levels of performance. Safety excellence requires the curiosity and determination necessary to be a learning, advancing culture.
Maintaining a Sense of Vulnerability
Spoken by NASA official, following Columbia launch, and after a significant debris strike had been identified. CAIB Report, Vol. 1, p. 101
Advisory Panel “Kraft Report,” March 1995. CAIB Report, Vol. 1, p. 108
Maintaining a Sense of Vulnerability – Question-Sets for Self-Examination
- Could a serious incident occur today at one of our facilities, given the effectiveness of our current operating practices? When was the last serious close call or near miss? Do we believe that process safety management (PSM) or other compliance activities are guaranteed to prevent major incidents?
- Are lessons from related industry disasters routinely discussed at all levels in the organization, with action taken where similar deficiencies have been identified in our operations?
- Do risk analyses include an evaluation of credible major events? Are the frequencies of such events always determined to be “unlikely?” Have proposed safety improvements been rejected as “not necessary” because “nothing like this has ever happened?” Do risk analyses eliminate proposed safeguards under the banner of “double jeopardy?” (“Double jeopardy” refers to a mindset, too common to process hazard analysis teams, that scenarios requiring two independent errors or failures need not be considered since they are “so unlikely to occur.”)
- Are critical alarms treated as operating indicators, or as near miss events when they are activated? Do we believe that existing safety systems will prevent all incidents?
- Is the importance of preventive maintenance for safety critical equipment recognized, or is such work allowed to backlog? Are the consequences of failure of such equipment recognized and understood by all?
- Are there situations where the benefits of taking a risk are perceived to outweigh the potential negative consequences? Are there times when procedures are deviated from in the belief that major outcomes will not be caused? What are these? Are risk takers tacitly rewarded for “successful” risk taking?
Combating Normalization of Deviance
Taken From Shuttle Managers Handover Notes on January 17, 2003. CAIB Report, Vol. 1, p. 142
Minutes From Columbia Shuttle Management Meeting, January 24, 2003. CAIB Report, Vol. 1, p. 161
Having lost the sense of vulnerability, the organization succumbed to accepting events that were precluded in the original shuttle design basis. Over the 113 Shuttle missions flown, foam shedding and debris impacts had come to be accepted as routine and maintenance concerns only. Limited or no additional technical analyses were performed to determine the actual risks associated with this fundamental deviation from intended design. Each successful landing reinforced the organization’s belief to the point where foam shedding was “normalized.” As new evidence emerged suggesting that the Columbia foam strike was larger, and possibly more threatening, than earlier foam strikes, this information was quickly discounted by management. The “understanding” that foam strikes were insignificant was so ingrained in the organizational culture that even after the incident, the Space Shuttle Program Manager rejected the foam as a probable cause, stating that Shuttle mangers were comfortable with their “previous risk assessments.”
Combating Normalization of Deviance – Question-Sets for Self-Examination
- Are there systems in operation where the documented engineering or operating design bases are knowingly exceeded, either episodically, or on a “routine” basis?” Examples might include flare systems with inputs added beyond the design capacity, process piping or equipment operating at or above the design limits, or systems operated in a significantly different manner than initially intended.
- Have the systems meeting the above criteria been subjected to thorough risk assessments? Did issues of concern emerge from the risk assessments? Were they addressed appropriately?
- Have there been operating situations in the past where problems were solved by not following established procedures, or by exceeding design conditions? Does the organizational culture encourage or discourage “creative” solutions to operating problems that involve circumventing procedures?
- Is it clear who is responsible for authorizing waivers from established procedures, policies, or design standards? Are the lines of authority for deviating from procedures clearly defined? Is there a formalized procedure for authorizing such deviations?
- What action is taken, and at what level, when a willful, conscious, violation of an established procedure occurs? Is there a system to monitor deviations from procedures where safety is concerned? Can staff be counted on to strictly follow procedures when supervision is not around to monitor compliance?
- Do we have management systems that are sufficiently discerning and robust to detect patterns of abnormal conditions or practices before they can become accepted as the norm?
- Are we knowingly accepting practices or conditions that we would have deemed unacceptable 12 months ago? … 24 months ago?
Establishing an Imperative for Safety
Daniel S. Goldin, NASA Administrator 1994. CAIB Report, Vol. 1, p. 106
Whether or not the budget cuts to which Mr. Goldin was referring in the above quote would have actually impacted safety is irrelevant. The impact of such a statement on an organizational culture is significant -- especially when coming from a top official. People at all levels feel less compelled to bring up safety matters if they feel that top management is not interested. Others, at lower levels, begin to mimic the attitudes and opinions that they hear from above.
Establishing An Imperative For Safety – Question-Sets for Self-Examination
- Is there a system in place that ensures an independent review of major safety-related decisions? Are reporting relationships such that impartial opinions can be rendered? Is there a “shoot the messenger” mentality with respect to dissenting views?
- Who are the people independently monitoring important safety-related decisions? Are they technically qualified to make judgments on complex process system designs and operations? Are they able to credibly defend their judgments in the face of knowledgeable questioning? Do safety personnel find it intimidating to contradict the manager’s/leader’s strategy?
- Has the role of safety been relegated to approving major decisions as fait accompli? Do production and protection compete on an equal footing when differences of opinion occur as to the safety of operations?
- Has the staffing of key catastrophic incident prevention positions (process safety management) been shifted, over the years, from senior levels to positions further down the organization? Are there key positions currently vacant?
- Does management encourage the development of safety and risk assessments? Are recommendations for safety improvements welcomed? Are costly recommendations, or those impacting schedule, seen as “career threatening” – if the person making the recommendations chooses to persistently advocate them?
- Is auditing regarded as a negative or punitive enterprise? Are audits conducted by technically competent people? How frequently do audits return only a few minor findings? Is it generally anticipated that there will be “pushback” during the audit closeout meetings?
Performing Valid/ Timely Hazard/Risk Assessments
CAIB Report, Vol. 1, p. 188
Email Exchange Between Engineers, January 28, 2002. CAIB Report, Vol. 1, p. 165
Audits had repeatedly identified deficiencies in NASA’s problem and waiver tracking systems. Prior safety studies had identified 5396 hazards that could impact mission integrity. Of these, 4222 were ranked as “Criticality 1/1R,” meaning that they posed the potential for loss of crew and orbiter. However, associated safety requirements had been waived for 3233 of these 1/1R hazards and, at the time of the Columbia investigation, more than 36% of those waivers had not been reviewed in the previous 10 year period.
The failure of the risk assessment process is ultimately manifested in the Columbia incident. By the time the Shuttle had launched, there was still no clear technical, risk-based understanding of
foam debris impacts to the spacecraft. The management had no solid information upon which to base their decisions. In lieu of proper risk assessments, most of identified concerns were simply
labeled as “acceptable.”
Performing Valid/Timely Hazard/Risk Assessments – Question-Sets for SelfExamination
- Are risk assessments performed consistently for engineering or operating changes that potentially introduce additional risks? Who decides if a risk assessment should be performed? What is the basis for not performing a risk assessment?
- How are risks for low frequency – high consequence events judged? Is there a strong reliance on the observation that serious incidents have not occurred previously, so they are unlikely to occur in the future? What is the basis for deeming risks acceptable – particularly those associated with high consequence events?
- Are the appropriate resources applied to the risk assessment process? Are senior level personnel, with appropriate technical expertise, enlisted for the risk assessment? Are the recommendations emerging from the risk assessments meaningful?
- What are the bases for rejecting risk assessment recommendations?
- Subjective judgment, based upon previous experience and observation?
- Objective assessment, based upon technical analysis?
- Are the risk assessment tools appropriate for the risks being assessed? Are qualitative or quantitative tools used to assess risks associated with low frequency – high consequence events? Are the tools deemed appropriate by recognized risk assessment professionals?
- Do we have a system, with effective accountabilities, for ensuring that recommendations from risk assessments are implemented in a timely fashion, and that the actions taken achieve the intent of the original recommendation?
Ensuring Open and Frank Communications
- The management had already settled on a uniform mindset that foam strikes were not a concern. Any communications to the contrary were either directly or subtly discouraged.
- An organizational culture had been established that did not encourage “bad news.” This was coupled with a NASA’s culture that emphasized “chain of command” communications. The overall effect was to either stifle communications completely, or, when important issues were communicated, to soften the content and message as the reports and presentations were elevated through the management chain.
- Engineering analysis was continually required to “prove the system is unsafe” rather than “prove the system is safe” – without any hard data available to support either position.
- The organizational culture encouraged 100% consensus. (The CAIB observed that a healthy safety organization is suspicious if there are no dissenting views). In this environment, general dissention was tacitly discouraged. Participants felt intimidated.
Ensuring Open and Frank Communications – Question-Sets for Self-Examination
- How does management encourage communications that contradict pre-determined thoughts or direction? How are contradictory communications discouraged? Is the bearer of “bad news” viewed as a hero, or “not a team player?”
- Does the organizational culture require “chain of command” communications? Or is there a formalized process for communicating serious concerns directly to higher management? Is critical, safety-related news that circumvents official channels welcomed?
- Do communications get altered, with the message softened, as they move up the management chain? Why does this happen? Is there a “bad news filter” along the communications chain?
- Do management messages on the importance of safety get altered as they move down the management chain? Do management ideals get reinterpreted in the context of day-to-day production and schedule realities?
- Are those bearing negative safety-related news required to “prove it is unsafe?”
- Has the “intimidation” factor in communications been eliminated? Can anyone speak freely, to anyone else, about their honest safety concerns, without fear of career reprisals?
- Does the culture prompt a “can do” or “we cannot fail” attitude that overrides a common-sense awareness of what is truly achievable, and stifles opinions to the contrary?
Learning and Advancing the Culture
CAIB Report, Vol. 1, p. 195
- The integrity and potency of the safety oversight function had been allowed to again erode.
- An overly ambitious launch schedule (relative to the capabilities of the organization) was again imposing an undue influence on safety-related decision-making.
- NASA was once again relying on “past performance as a guarantee of future success.”
- Conditions and events totally inconsistent with NASA’s technical basis for mission safety were still being “normalized.”
- Rigid organizational and hierarchical policies were still preventing the free and effective communication of safety concerns. Rank and stature were once again trumping expertise.
NASA had not effectively drawn learnings from the Challenger incident, and its safety culture had not sufficiently advanced in the intervening 17 years. Implementing cultural change can be slow, hard work. It begins by leaders consistently modeling and reinforcing the attitudes and behaviors expected in the new culture. Results would suggest that this had not happened at NASA.
Learning and Advancing the Culture – Question-Sets for Self-Examination
- Are corporate and site leaders aware of the essential features of a sound safety culture? Do they understand their personal responsibilities for fostering and sustaining the safety culture? Are they meeting these responsibilities?
- Do leaders consistently model and support the attitudes and behaviors we expect of our culture? Do the workers?
- Are we monitoring our operations closely enough to detect problems? How do we ensure the objectivity necessary to see those problems for what they are?
- Do we have systems for reliably learning from our mistakes? Do we willingly and enthusiastically accept those learnings and apply them to improve our systems and procedures?
- Where are we now vs where we hope to be with respect to our safety culture? Where do we want to be a year from now? … two years from now? How do we plan to get here?
- If we are comfortable with where we are now, how do we discriminate between comfort and complacency?
Intention and Limitations of the Question-Sets
The above question-sets summarize lessons from the Columbia incident only, and should not be automatically applied - as is - to other organizations. NASA’s experience may or may not be wholly or directly applicable to the unique features of each organizational culture. The question sets are intended to serve only as a starting point in determining the relevancy of the Columbia experience. More importantly, by opening a dialog on this important issue, organizations may be able to further enhance ongoing cultural improvement efforts.