Sections
Several lessons can be learned from the February 2021 Texas power outages that are directly applicable to chemical processing facilities.
In February 2021, Texas was impacted by a severe winter storm that disrupted the electricity supply to more than 4.5 million people. The low temperatures brought by the storm disrupted primary electricity generating units and backup systems, resulting in the unavailability of over 35,000 MW of generation capacity (nearly 30% of the total installed capacity). When the additional equipment outages unrelated to the storm are included, almost half of the state’s generation capacity was unavailable at the peak of the crisis (1). The financial impact of the event is estimated to be between $80 billion and $130 billion (1).
The impact on people from this event was catastrophic. Of the 246 fatalities confirmed to be related to the winter storm, most were directly related to loss of electric power; 186 people died from hypothermia or cold-related exacerbation of pre-existing illnesses, and 29 people died from fires or carbon monoxide poisoning caused by alternative sources of heat such as space heaters and fuel-burning equipment (2).
If not for the swift and decisive actions of many involved in power generation, transmission, and distribution, the situation could have been much worse. At one point, the electrical system was less than five minutes away from a potential total collapse, which would have taken much longer to recover from (1).
After any incident or near-miss, it is critical to review the scenario to identify the lessons to be learned. This process can determine what went well and what could have been done better to reduce the likelihood of a repeat occurrence, or to mitigate the impacts of such an event. However, it is also important to look outside of one’s own industry to find lessons that can be learned in different, but closely related, industries and processes. Investigating other industries provides the benefit of a larger pool of information with a greater potential for identifying valuable lessons that can prevent catastrophic incidents, thereby saving lives and protecting production, assets, and the environment.
Electrical networks and chemical processing facilities have much in common. Both rely on the interconnection of a wide variety of complicated and bespoke equipment. Both systems are controlled with a mixture of complex automation and expert human input; those people are required to monitor the system for extended periods of time, and must be prepared to step in and take swift and decisive action at a moment’s notice if anything starts to go wrong. Both systems also have the potential for catastrophic events when the unexpected happens. This makes the electrical power industry an important source of lessons learned for the chemical process industries (CPI).
This article discusses some of the most important process safety lessons that can be learned from the 2021 Texas power outages and identifies the elements of CCPS’s Risk Based Process Safety (RBPS) (3) that apply to each lesson.
Overview of the electrical generation and distribution system
An electrical generation, transmission, and distribution system (i.e., a grid) includes high-voltage generators, transmission equipment, and transformers that generate, transfer, and supply electric power (1). Each grid comprises multiple companies and organizations involved in each stage of power generation and delivery, and so each grid is coordinated by an independent system operator (ISO).
The Electric Reliability Council of Texas (ERCOT) is the ISO for the grid that provides approximately 90% of the electric power to the State of Texas (1). The other two ISOs that were most impacted by the February 2021 winter storm are the Midcontinent Independent System Operator (MISO) and the Southwest Power Pool (SPP). Although the impacts on MISO and SPP were significant, to limit the scope, this article focuses specifically on ERCOT due to its relative isolation from wider electrical grids.
The many different entities that comprise the entire system play complex roles; however, this article simplifies these roles into generation, transmission, and the balancing, coordination, planning, communication, and scheduling functions of the ISO to improve clarity. Figure 1 shows the portion of the grid that ERCOT oversees (4).
The flow of electric power in the system is measured in megawatts. At the time of the winter storm, ERCOT had a total installed capacity — the amount of power that could theoretically be generated if all installed generating equipment was running simultaneously — of 123,057 MW (1). In reality, there are always generators out of service for planned maintenance or due to a failure, or which are not generating at their full nameplate capacity, and therefore peak generating capacity is significantly lower. For reference, the all-time peak demand for ERCOT prior to February 2021 was 74,820 MW in August 2019 (1), approximately 60% of the total installed capacity in February 2021.
In the ERCOT region, the peak summer power demand is usually greater than the peak winter demand. The previous all-time winter peak record was 65,750 MW, set in January 2018 (1). A new all-time winter peak record of 69,871 MW was set on Feb. 14, 2021, during the winter storm (1). The full potential peak demand is unknown due to the load shedding that was required shortly thereafter to prevent a full system collapse, but it was estimated that it would have been 76,819 MW (1).
Due to the difficulty of storing large amounts of electrical energy, most electrical power is consumed at about the same time it is produced. The amount of electrical energy being consumed is called the load, and the generation and load must be balanced at all times for the grid to function. Among other roles, ERCOT manages the system and tells generating companies when they need to start and stop generating to keep generation balanced with the load, and manages scheduled outages of equipment to prevent too much generation capacity from being out of service at any given time. This balance was at the heart of the issues encountered during the winter storm.
Lessons to be learned
Conduct risk analyses and forecasts for a range of severity and likelihood values. ERCOT and the operators within the grid were well aware of the potential for cold weather to cause an increase in power demand and a decrease in available generation. To prepare for each winter, the balancing authorities and planning coordinators for a grid produce forecasts of the potential peak load in winter and the expected generation capacity that will be available. By comparing these numbers, an estimate can be made regarding any areas of concern or any actions that need to be taken to ensure that enough generation capacity will be available to meet the anticipated peak load (1).
Peak load is predicted through a 50/50 forecast or 90/10 forecast. 50/50 means that there is an anticipated 50% chance that the actual peak load will exceed the forecasted value, and 90/10 means that there is an anticipated 10% chance that the actual peak load will exceed the forecasted value (1). Although the actual peak load that would have occurred in February 2021 is unknown (due to the requirement to shed load during the event), it was estimated that the actual peak demand would have been approximately 14% above the 90/10 peak load forecast and approximately 33% above the 50/50 peak load forecast (1).
It is extremely difficult to anticipate future conditions, and care should be taken when assessing the merits of such predictions in hindsight after the event has occurred. However, it is essential to perform these forecasts for future conditions and include multiple scenarios with different likelihoods and potential severities. The RBPS element hazard identification and risk analysis (HIRA) (3) involves predicting which events may potentially negatively impact a process, how severe that scenario could be, and how likely it is to occur.
In risk analysis, the team will often review only one hazardous scenario for each event to maximize efficiency. Some approaches will only consider the most likely scenario; however, it can be seen from this event that this is inadequate to prepare for future events. If ERCOT had only developed a 50/50 forecast, it would have left an incomplete vision of the future and the consequences of the event may have been significantly worse. By including a 90/10 forecast, it was possible for the entities across the grid to better prepare by having a value that was only 10% likely to occur, but still quite possible. This shows the value of having a less likely, but higher-severity, forecast. For events with potentially severe consequences, it is valuable to develop a range of potential scenarios with individual severity and likelihood rankings for each, which can be used to determine the amount of time and resources needed to protect against each level of severity.
Identify and manage potential cascading failures. On the grid, a loss of generation capacity can cause more generators to shut down, which can cause a feedback loop that leads to loss of all generation across the system. Therefore, as catastrophic as this incident was, it could have been much worse. At one point, the entire system was only 4 min and 37 sec away from a potential entire grid collapse (1), which would have caused a complete power outage across the entire grid and taken much longer to recover from.
Alternating current (AC) electricity operates in cycles, measured in Hz, which indicates the number of full cycles the system completes per second (i.e., frequency). The electrical system within the U.S. is designed to operate at a frequency of, or very close to, 60 Hz (60 cycles per second). The frequency is maintained by balancing the electrical energy being generated with the electrical energy being consumed. Since storing significant amounts of electrical energy is difficult, the majority of electrical power is generated at almost the same time it is used, and so, to maintain the frequency at 60 Hz, the amount of electricity generated must be adjusted moment to moment to meet the real-time demand of users. If the system deviates from 60 Hz, it can cause severe damage to equipment throughout the grid, from generation to end users. It can also cause the entire system to fail (i.e., a blackout). Restarting from a blackout, known as a black start, is a long and complicated process, so system operators do everything they can to keep at least some part of the system operating even under upset conditions.
Ideally, the balance between electrical generation and demand is maintained by increasing or decreasing generation to match the supply. However, during the February 2021 event, there was insufficient generation capacity to meet demand. This resulted in reduced grid frequency, also known as underfrequency, and required additional generation capacity to be brought online. Without any additional generation capacity that could be brought online, the only option left was to reduce the demand on the system, which is accomplished by “shedding” load. By disconnecting the power supply to blocks of users, transmission operators can reduce the total load on the system and therefore maintain the remaining electrical grid operating within acceptable parameters. ERCOT has the authority to order transmission operators to shed a certain amount of load. The required load shed is split proportionally between each transmission operator. The transmission operator must then enact their predetermined load shed plan to remove that amount of load from the system.
Figure 2 shows the generation capacity available from Feb. 14 to Feb. 20, and the estimated load that would have been demanded from the system if the load-shed requirement had not been enacted (4).
During an underfrequency event, generators have to work harder to try to increase the frequency. If a generator continues to operate when the frequency is too low, it can overload and cause significant damage to the equipment. For this reason, underfrequency relays are installed on generators to automatically shut them down if the frequency is too low. ERCOT and North American Electric Reliability Corp. (NERC) protocols dictate the minimum allowable time delay for these automatic shutdowns; the further away from the ideal frequency the system is operating at, the faster equipment damage will occur and thus the faster the relays have to act. The ERCOT and NERC protocols allow for underfrequency relays to act to shut down the equipment after nine minutes if the frequency is at or below 59.4 Hz (1).
Early in the morning of Feb. 15, 2021, additional generating units failed or tripped, and the frequency of the system fell below 59.4 Hz despite two previous orders to shed load, triggering these 9-min underfrequency relay timers. ERCOT issued two more load-shed orders, which brought the system back above 59.4 Hz, but it took 4 min and 23 sec to do so. If the frequency had remained below 59.4 Hz for another 4 min and 37 sec, the underfrequency relays would have acted to shut down approximately 17,000 MW worth of generation, potentially causing a total system collapse and blackout (1). Figure 3 shows how the frequency fell and eventually recovered, along with the outages that caused the frequency decline and ERCOT’s load-shed orders to respond.
The risk of grid collapse is well-known and largely understood from previous such incidents. However, many processing facilities are unique systems that have not previously experienced potential cascading failure events that can occur within their process. As such, these scenarios may not have been identified or analyzed. Process operators must investigate these types of scenarios by performing a hazard analysis that considers the potential for “domino effect” cascading failures, so that safeguards or emergency response plans to prevent or mitigate these events can be developed and implemented. In the RBPS framework, this is accomplished through the incident investigation element for incidents that have occurred (or nearly occurred) before, and through the HIRA element for the discovery, analysis, and prevention of potential scenarios that could occur (3).
A well-known example of a domino effect that caused the escalation of a dangerous situation is the Piper Alpha disaster in July 1988, which led to 167 fatalities and complete loss of the oil platform. In this event, an initial explosion disabled the control room, main power supplies, and likely the fire-water system, thereby disabling or reducing the efficacy of the platform’s emergency response systems. The heat from the fires also caused a chain of events that resulted in the rupture of a high-pressure oil pipeline, further feeding the fire and leading to the loss of the entire platform (5). Although emergency systems were in place to protect against a fire, the initial explosion started the cascade of failures that ultimately increased the catastrophic magnitude of the incident.
Identify and understand potential common-cause impacts. Part of the reason that the impact of the February 2021 event was so pronounced was that, in addition to a loss of generation capacity, the same cold weather caused a significant increase in demand for electric energy. As the temperature fell and generating units started to have more cold-related issues, the demand for electricity to heat homes rose dramatically. This combination of unusually low generation capacity and unusually high demand resulted in the emergency measures that needed to be taken, and they were both caused by the same event.
Many of the homes in the affected area are heated using electricity. There is a clear correlation between the outside air temperature and the heating requirement for a home to cover the heat lost to the outside air, but there was an additional factor that caused a more significant increase in power demand than may be readily apparent. Many of the homes with electrical heating use electric heat pumps, which use the refrigeration cycle to transfer heat energy from the outside air to the home interior, and are effectively an air conditioning unit run in reverse (the same unit is used for air conditioning of the home during warm weather by reversing the heat transfer direction). These units are very efficient when the outside temperature is well above freezing and therefore do not need as much power to heat the home as a unit that relies purely on electric resistance heating. However, when the temperature falls near freezing, ice forms on the external coils and the heat pump becomes ineffective.
When this happens, the unit must switch to electric resistance heating to provide the required heat energy, which requires more energy for the same amount of heating. Because of this, it can take up to almost four times as much power to heat a home at –10°C as it does at 0°C (1).
As the homes using electric heat pumps switched from heat pump mode to resistance heating, the demand rose sharply. This caused the power demand to increase rapidly at the same time that generation capacity was rapidly decreasing. These types of effects can be understood and revealed through effective process knowledge management combined with appropriate HIRA (4).
Provide redundancy between normally operating systems and backup systems. As generating systems went offline or were derated due to issues associated with the cold weather conditions, generating stations turned to their backup systems to help provide some of the missing demand. However, many of these backup systems were affected by the same issues that impacted the normally operating systems. Even though some generating facilities had additional generating capacity, not all of them had the fuel to run them.
During the event, almost 30% of all generating unit outages were due to natural gas supply issues (1). The cold weather caused issues with natural gas extraction, gathering, processing, and distribution. Figure 4 shows that natural gas processing declined by 82% during the event compared with early February 2021. Approximately 88% of this impact was related to the weather (1). This decline in natural gas availability impacted the ability to run and restart natural gas-powered generators. At its peak, 7,700 MW of generation was unavailable solely due to lack of natural gas supply.
When natural gas is extracted from the ground, it usually has some amount of water with it, which can freeze when the temperature is low enough. However, the water and natural gas together can also form hydrates, which are solids that can form even at temperatures above freezing in high-pressure processes such as these. This, in addition to the cold-weather impacts on the processing equipment, reduced natural gas production and processing rates. However, unavailability of power was also a key cause of reduced production; 18% of the production decline was caused by loss of power, and a further 18% was caused by a combination of failures such as loss of power combined with freezing (1).
A vicious cycle becomes apparent here: a reduction in natural gas production availability caused reduced fuel availability for electric generating stations, which caused reduced power availability to natural gas infrastructure.
Another part of the reason that so much of ERCOT’s generation capacity was affected was due to a lack of geographic diversity. The other electrical entities most affected by the storm, MISO and SPP, had larger geographic footprints (Figure 5), and could import greater amounts of electrical power from other geographic areas where generators were unaffected, or less affected, by the winter storm. ERCOT was only able to import just over 1,000 MW, compared with a total installed capacity of 123,057 MW (1). This left ERCOT unable to compensate for its lack of generation capacity.
In RBPS management, all HIRA activities and emergency management planning must consider the potential for common-cause failures between normal operations and safeguarding, backup, and emergency equipment or processes. Overlooking these interdependencies can lead to scenarios in which the event that causes failure of the system also causes failure of the backup systems in place, leaving the system exposed to much greater risk than anticipated.
Another catastrophic incident in which the normal operation and safeguarding systems were affected by the same inciting event was the Fukushima Daiichi nuclear reactor explosion in Japan in March 2011, which forced the evacuation of more than 100,000 people. In this event, an undersea earthquake caused damage to the power generation equipment, requiring a shutdown. After shutdown, seawater is pumped through cooling systems to remove heat generated as the nuclear reaction slows down and eventually stops, thereby preventing overheating of the equipment. Since the site’s electric power had been disrupted by the earthquake, emergency diesel generators were used to power the cooling water pumps. However, a tsunami caused by the same earthquake overwhelmed the facility’s tsunami protection seawalls. The facility flooded, damaging the generators, thereby resulting in a loss of the cooling water circulation that would have otherwise prevented the overheating of the reactors and subsequent explosion that occurred (6). In this event, the normally operating systems and backup systems were both rendered inoperable by the same earthquake.
Plan and prepare for emergencies and extreme situations. As discussed in the cascading failures section, load shedding and load management is an activity that is planned in advance of an emergency situation. The authority of the regulator to issue load-shed orders is established in advance, and each transmission operator is required to develop load-shed plans for this eventuality (1). These plans define how the transmission operator will shed load, and include consideration of critical circuits that will not be part of a load-shed action, such as circuits that include hospitals, water treatment facilities, police and fire stations, military facilities, and other facilities or services deemed crucial to public safety or to restoring the remaining electric system (1). Non-critical circuits are then designated in the load-shed plans, with the intention to cycle turning circuits on and off at different times to effect “rolling blackouts,” such that the integrity of the overall grid is maintained without cutting power to any circuit for more than a few hours at a time. In the February 2021 event, these plans became critical as ERCOT was forced to issue load-shed orders for 20,000 MW for nearly three consecutive days (1).
Having load-shed plans in place was essential for the activity to be successful. Without them, operating personnel would have had to make extremely quick and potentially under-informed decisions on which circuits to isolate, which could have led to critical circuits being disconnected, or a delayed response causing the nine-minute underfrequency relays to trip and potentially causing a system blackout. The established authority of ERCOT to issue mandatory load-shed orders to transmission operators was also essential to prevent disagreements between operators over which companies had to shed load and by how much (as load shedding impacts company revenue). Figure 3, showing the frequency drop, reveals how important it was that each operator responded quickly to the load-shed orders.
This type of forward planning and strategizing for emergency events is an important part of the emergency management element of RBPS (3). It also requires HIRA to identify what could happen and training and performance assurance to ensure that personnel know how and when to enact these plans (3).
It should be noted that the ERCOT system transmission operators’ plans were developed considering much smaller increments of load shed than the 20,000 MW required (1), which resulted in some last-minute planning and many circuits being kept turned off for multiple days, instead of the rolling blackouts that were planned.
In order to create and enact these plans, stakeholder outreach and workforce involvement are required (4). It is essential to identify which parties are required to participate in the planning process and to involve those parties who would be implementing the plans. It is also important to consider conduct of operations, to verify that the correct operational discipline is in place to provide the speed and reliability of response required.
Understand the potential hazards of backup systems. During the outage, many people turned to whatever heat sources they could find to try to stay warm. However, some of these sources came with additional risks that were not always identified or managed effectively. Of the 246 fatalities confirmed to be related to the winter storm, 29 people died as a result of the use of backup heat sources (2). 19 of these fatalities were caused by carbon monoxide poisoning due to running generators, grills, heaters, or vehicles in enclosed spaces with inadequate ventilation, or from ice blocking the vents of gas-powered heating equipment. A further ten fatalities were from injuries sustained in fires, some of which were caused by using space heaters close to flammable materials.
These additional and unfortunate human impacts highlight another key lesson, which is that during an emergency event, the people involved may have to improvise procedures, jury-rig physical solutions, or use equipment for unintended purposes. They may also be required to use backup processes or equipment that they are unfamiliar with. In each case, there may be unknown and unidentified risks, and in an emergency situation, it is usually impossible to take the time or resources to conduct a risk analysis. With the increased stress and time pressure of an emergency, decision-making becomes harder and more prone to error. A household may not have been aware of the risks of carbon monoxide poisoning if they had not needed to use a combustion-driven heat source in an enclosed space before. Furthermore, there may be known risks accepted by personnel who would not normally tolerate such risks, but there may be no other options in an emergency situation. In this case, a household may have been aware of the hazards of using a space heater in an unsafe manner, but may not have had any other options.
An essential part of emergency preparedness is to identify in advance what can go wrong, create backup plans for what to do in these situations, and acquire and maintain the resources required to enact them.
This human impact also reveals the requirement to include HIRA (3) as part of emergency management. First, there must be adequate process knowledge management (3) to be able to identify what potential hazards may be created or exacerbated by any proposed emergency management solutions. Then, a risk analysis must be conducted on these systems, processes, or procedures to find out what could go wrong when they are used or enacted.
A too-common example of a response to an emergency causing an unintended additional hazard is a potential in every kitchen — an oil fire. It is widely believed that the best way to extinguish any fire is to douse it in water, although this is not always true. In the case of a burning pan of oil, the water will sink to the bottom of the pan, quickly heat up on the hot pan base and vaporize to steam. As the water rapidly expands to steam and exits the pan, it will carry the burning oil with it and result in an eruption of flaming oil from the pan. In order to avoid this scenario, the cook has to be aware of this hazard, know the correct course of action, and then remember these facts during the high-pressure scenario of a flaming pan of oil. Training in how to correctly manage the fire, and practice drills to reinforce the correct response, can prevent someone from making mistakes of this type.
Consider what went well and what could have gone worse. Much of what has been discussed so far is an analysis of what went wrong, or unexpected events that were not accounted for. The incident investigation element of RBPS emphasizes the importance of investigating these aspects of any incident, or near-miss, as an important source of lessons learned for preventing similar future incidents (3).
However, it is also important to consider what went right. Any protection system that worked as intended to prevent severe impacts must be analyzed and reviewed to understand how it was successful, and to determine what can be done to ensure it will act in the same way next time it is required to respond. Any “good luck” encountered needs to be reviewed to see what could have happened had a worse scenario occurred.
In this event, the load shedding enacted by the operators at the instruction of ERCOT just barely prevented the system from a possible full blackout (1). By having a plan and lines of communication in place with clear authority, ERCOT and the transmission operators were able to respond quickly and decisively to a dangerous scenario and prevent even worse catastrophes.
At the same time, it is essential to remember the potential for a worse outcome, and to investigate and follow up just as thoroughly as if the worse scenario had occurred. Near-incidents have a tendency to be overlooked and forgotten. When there has been a severe impact, it’s even more likely that the subsequent review will focus on the scenario that occurred, and the true hazard potential will be overlooked. When conducting incident investigations, the investigators must look for what else could have happened, or any worse scenarios that could have been a credible outcome, which requires HIRA to identify the worst credible potential outcome.
In closing
Many lessons can be learned from the February 2021 Texas power outages that are directly applicable to processing facilities. Scenarios like this with a potentially high impact but a low likelihood of occurring are inherently difficult to plan for, and additional backup or protection systems that are rarely — if ever — used can be expensive and time-consuming to install and maintain. This is why it is important to seek lessons learned from as many different sources as possible. By analyzing the preparation and actions taken by the operators within the ERCOT system, processing organizations can learn how to better plan for and respond to scenarios with severe consequences.
Literature Cited
- Federal Energy Regulatory Commission, “The February 2021 Cold Weather Outages in Texas and the South Central United States,” www.ferc.gov/media/february-2021-cold-weather-outages-texas-and-south-central-united-states-ferc-nerc-and (Nov. 2021).
- Texas Dept. of State Health Services, “February 2021 Winter Storm-Related Deaths – Texas,” www.dshs.texas.gov/sites/default/files/news/updates/SMOC_FebWinterStorm_Mortality-SurvReport_12-30-21.pdf (Dec. 2021).
- Center for Chemical Process Safety, “Guidelines for Risk Based Process Safety,” CCPS, American Institute of Chemical Engineers, New York, NY (2007).
- Magness, B., “Review of February 2021 Extreme Cold Weather Event – ERCOT Presentation,” Electric Reliability Council of Texas, www.ercot.com/files/docs/2021/02/24/2.2_REVISED_ERCOT_Presentation.pdf (Feb. 2021).
- Cullen, W., “The Public Inquiry into the Piper Alpha Disaster,” U.K. Health and Safety Executive, www.hse.gov.uk/offshore/assets/docs/piper-alpha-public-inquiry-volume1.pdf (Nov. 1990).
- International Atomic Energy Agency, “The Fukushima Daiichi Accident, Report by the Director General,” IAEA, www-pub.iaea.org/mtcd/publications/pdf/pub1710-reportbythedg-web.pdf (2015).
Copyright Permissions
Would you like to reuse content from CEP Magazine? It’s easy to request permission to reuse content. Simply click here to connect instantly to licensing services, where you can choose from a list of options regarding how you would like to reuse the desired content and complete the transaction.