Incident Investigation

RCA: 7 Ways to Avoid the Operator Error Trap

Learn how EHS managers can run RCA after serious incidents without stopping at operator error, using evidence, barriers, and leadership decisions.

Por Publicado em 7 min de leitura Atualizado em
investigative scene on rca 7 ways to avoid the operator error trap — RCA: 7 Ways to Avoid the Operator Error Trap

Principais conclusões

  1. 01Diagnose operator error as the last visible action, then test which planning, supervision, barrier, and leadership conditions made that action likely.
  2. 02Preserve first-day evidence before hierarchy hardens the story, including scene data, permits, interviews, photographs, and digital records from the task.
  3. 03Separate active failures from latent conditions so corrective actions strengthen controls instead of defaulting to retraining, warnings, or extra signatures.
  4. 04Verify RCA closure at 30 and 90 days, because completed actions do not prove that recurrence risk changed in the work area.
  5. 05Use Andreza Araujo's safety culture diagnostics when repeated incidents show that reports close faster than controls, leadership routines, and speak-up behavior improve.

When a serious incident report ends with operator error, the organization may have named the last visible action while leaving the failure mechanism untouched. This article shows how an EHS manager can run root cause analysis in a way that reaches procedures, barriers, supervision, and leadership decisions without turning the investigation into a courtroom.

RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.

Why operator error is usually an incomplete RCA

Operator error is a description of what happened at the sharp end, not a root cause by itself. James Reason's work on active and latent failures remains useful here, since the person closest to the injury often exposes weaknesses that were created much earlier in planning, design, procurement, training, or supervision.

Across 25+ years leading EHS at multinationals, Andreza Araujo has seen the same pattern in different countries: the faster a report reaches a guilty person, the slower the company becomes at fixing repeatable conditions. The investigation may satisfy a deadline, although it does not explain why the normal system allowed the unsafe choice to look reasonable at the time.

A stronger RCA asks what had to be true for the action to make sense. In a maintenance task, including one governed by a working at height rescue plan, that means looking at the permit, isolation method, time pressure, crew size, handover quality, and whether the supervisor had enough authority to stop a job whose schedule had already been promised to production.

RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.

1. Preserve the first 24 hours before the story hardens

The first 24 hours after an incident decide whether RCA starts with evidence or with a defensive narrative. Teams should secure the scene, photograph barrier positions, collect permits and checklists, identify witnesses, and capture machine data before memory, cleanup, and hierarchy reshape the facts.

What most investigations miss is the social pressure of the first meeting. If the plant manager asks, "Who made the mistake?", every witness begins to protect a position; if the opening question is about how work was actually organized, the team can still describe the system in which the event emerged.

Use a simple evidence map with four columns: physical evidence, documents, digital records, and interviews. In the interview column, separate what the person saw from what the person believes, because beliefs can guide hypotheses but should not be treated as facts.

RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.

2. Separate the active failure from the latent condition

An active failure is the action near the event, while a latent condition is the organizational weakness that sat in the system before the shift began. Reason's distinction matters because corrective actions aimed only at the active failure usually produce retraining, warnings, or a new signature field.

As Andreza Araujo argues in Safety Culture: From Theory to Practice, culture is revealed by repeated decisions, not by slogans. If a crew bypasses a verification step every Friday afternoon, the relevant question is not only why the worker did it, but why planning, supervision, and production routines made that shortcut ordinary.

A practical RCA should list at least three latent conditions for each active failure. For example, a missed lockout can point to unclear equipment identification, poor isolation drawings, and a supervisor whose span of control is too wide for simultaneous critical jobs.

3. Test whether the control existed only on paper

A control that exists in a procedure but cannot be executed under real work conditions is not an effective barrier. In many serious events, the document is technically correct while the field arrangement makes compliance slow, ambiguous, or dependent on personal courage.

This is where the investigation should connect with the existing risk matrix. If the matrix rated the task as medium risk while the field depended on a single administrative check, the problem is not only behavior; it is a weak view of critical controls.

Ask the team to reconstruct the control as performed, not as written. The gap becomes visible when the procedure says two-person verification, while staffing records show one technician covering three work fronts during the same two-hour window.

250+ cultural transformation projects supported by Andreza Araujo show why paper controls often overstate operational discipline: leaders see the document, while operators experience the workaround.

RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.

4. Use five whys carefully, or it becomes five accusations

Five Whys for SIF investigations can help a team move from event to condition, although it becomes dangerous when each answer names a person rather than a mechanism. The method should move from visible action to system design, where decisions can be changed and verified.

The common failure is linguistic. "Why did he open the valve?" already narrows the answer toward individual fault; "What conditions made opening the valve possible and plausible?" allows the group to examine labeling, isolation status, handover, supervision, and alarm logic.

Use five whys with evidence rules. Each why needs a document, observation, interview convergence, or data point; when the team cannot support an answer, mark it as a hypothesis and assign verification before the final report.

RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.

A corrective action should increase barrier strength, not merely prove that the company reacted. Training has a place, but it is a weak answer when the investigation found missing engineering controls, confusing displays, impossible workload, or permit steps that supervisors cannot realistically audit.

During Andreza Araujo's PepsiCo South America tenure, where the accident ratio fell 50% in six months, the useful lesson was not that campaigns solve incidents. The lesson was that leadership cadence, operational discipline, and visible decisions must change together when a serious pattern appears.

Classify each action as elimination, substitution, engineering, administrative, or PPE, then identify who owns verification. If most actions sit in training and communication, the RCA probably found symptoms rather than controls.

6. Audit the culture signal inside the investigation

An investigation is also a culture test because people watch what the organization protects first: learning, reputation, production, or hierarchy. The first visible decision after an event tells the workforce whether reporting will expose risk or expose the reporter.

A safety culture diagnosis becomes especially relevant after a serious event, since interviews can reveal whether supervisors feel allowed to challenge schedule pressure. In Safety Culture Diagnosis: Learn how to do your own, Andreza Araujo treats perception gaps as operational data, not as soft opinion.

Review the investigation process itself. If witnesses avoid specifics, if contractors are interviewed last, or if the report removes management decisions from the timeline, the RCA is generating a culture signal that may discourage the next near-miss report.

Each week without correcting this signal makes the next investigation weaker, because the workforce learns which facts are safe to say and which facts should stay outside the official record.

RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.

7. Close the RCA only when recurrence risk changes

An RCA is closed when recurrence risk has changed, not when every action has a green status. The final review should test whether controls are stronger, responsibilities are clearer, and similar tasks would fail differently if the same pressure returned.

The trap is administrative closure. A spreadsheet can show 100 percent completion while the field still carries the same ambiguous permit, the same understaffed shift, and the same supervisor who is expected to audit three critical jobs at once.

Set a 30-day and 90-day verification point for serious incidents. The 30-day review confirms that actions were installed, while the 90-day review checks whether behavior, planning, and leading indicators changed in the work area where the event occurred.

RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.

Comparison: blame-centered RCA vs barrier-centered RCA

Decision pointBlame-centered RCABarrier-centered RCA
Opening questionWho failed to follow the rule?What conditions made the rule fail in real work?
Primary evidenceWitness statements and rule deviationScene data, documents, interviews, and control verification
Typical actionRetrain, warn, add signatureRedesign the barrier, clarify ownership, verify field execution
Leadership roleApprove the report after the factRemove schedule, resource, and authority conflicts found by the team
Closure testAll actions marked completeRecurrence risk reduced and checked after 30 and 90 days

Conclusion

RCA becomes useful when it explains why the operator's action was possible, plausible, and insufficiently blocked by the system around the task. The report should leave leaders with decisions to make, not only workers with rules to remember.

If your organization needs to turn serious incident findings into stronger controls, leadership routines, and a culture that reports risk earlier, connect with Andreza Araujo and build the next investigation around evidence rather than blame.

#rca #incident-investigation #sif #swiss-cheese #ehs-manager

Perguntas frequentes

Is operator error a valid root cause in RCA?
Operator error can describe the active failure near the event, but it should rarely stand alone as the root cause. A serious incident investigation should ask what conditions made the action possible and plausible, including procedure design, supervision, time pressure, training quality, equipment layout, and barrier strength. James Reason's active and latent failure distinction helps teams move from individual action to system conditions that leaders can change.
How do you avoid blame in incident investigation?
Avoiding blame starts with the first question. Instead of asking who failed to follow the rule, ask what conditions made the rule fail in real work. The team should secure evidence, separate facts from interpretations, interview across roles, and test controls as performed in the field. This does not remove accountability, but it keeps accountability tied to decisions, barriers, and risk ownership.
What is the difference between active failure and latent condition?
An active failure is the visible action near the incident, such as opening the wrong valve or missing a lockout step. A latent condition is the weakness that existed before the shift, such as unclear labeling, poor isolation drawings, excessive workload, weak supervision, or production pressure. RCA needs both, because fixing only the active failure often leaves recurrence risk unchanged.
When should RCA corrective actions be closed?
Corrective actions should be closed when they are installed and verified, but the RCA should not be considered effective until recurrence risk changes. For serious incidents, a 30-day review can confirm installation, while a 90-day review checks whether work practices, barrier verification, leading indicators, and supervisory routines changed in the affected area.
How does Andreza Araujo approach RCA and safety culture?
Andreza Araujo connects incident investigation with safety culture because reports reveal what an organization protects after failure. In Safety Culture Diagnosis: Learn how to do your own, she treats perception gaps and leadership routines as operational evidence. That approach helps EHS managers see whether an RCA is strengthening controls or simply documenting a familiar blame pattern.

Sobre a autora

Global Safety Culture Specialist

Andreza Araujo is an international reference in EHS, safety culture and safe behavior, with 25+ years leading cultural transformation programs in multinational companies and impacting employees in more than 30 countries. Recognized as a LinkedIn Top Voice, she contributes to the public conversation on leadership, safety culture and prevention for a global professional audience. Civil engineer and occupational safety engineer from Unicamp, with a master's degree in Environmental Diplomacy from the University of Geneva. Author of 16 books on safety culture, leadership and SIF prevention, and host of the Headline Podcast.

  • Civil Engineer (Unicamp)
  • Occupational Safety Engineer (Unicamp)
  • Master in Environmental Diplomacy (University of Geneva)