RCA: 7 Ways to Avoid the Operator Error Trap
Learn how EHS managers can run RCA after serious incidents without stopping at operator error, using evidence, barriers, and leadership decisions.
Principais conclusões
- 01Diagnose operator error as the last visible action, then test which planning, supervision, barrier, and leadership conditions made that action likely.
- 02Preserve first-day evidence before hierarchy hardens the story, including scene data, permits, interviews, photographs, and digital records from the task.
- 03Separate active failures from latent conditions so corrective actions strengthen controls instead of defaulting to retraining, warnings, or extra signatures.
- 04Verify RCA closure at 30 and 90 days, because completed actions do not prove that recurrence risk changed in the work area.
- 05Use Andreza Araujo's safety culture diagnostics when repeated incidents show that reports close faster than controls, leadership routines, and speak-up behavior improve.
When a serious incident report ends with operator error, the organization may have named the last visible action while leaving the failure mechanism untouched. This article shows how an EHS manager can run root cause analysis in a way that reaches procedures, barriers, supervision, and leadership decisions without turning the investigation into a courtroom.
RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.
Why operator error is usually an incomplete RCA
Operator error is a description of what happened at the sharp end, not a root cause by itself. James Reason's work on active and latent failures remains useful here, since the person closest to the injury often exposes weaknesses that were created much earlier in planning, design, procurement, training, or supervision.
Across 25+ years leading EHS at multinationals, Andreza Araujo has seen the same pattern in different countries: the faster a report reaches a guilty person, the slower the company becomes at fixing repeatable conditions. The investigation may satisfy a deadline, although it does not explain why the normal system allowed the unsafe choice to look reasonable at the time.
A stronger RCA asks what had to be true for the action to make sense. In a maintenance task, including one governed by a working at height rescue plan, that means looking at the permit, isolation method, time pressure, crew size, handover quality, and whether the supervisor had enough authority to stop a job whose schedule had already been promised to production.
RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.
1. Preserve the first 24 hours before the story hardens
The first 24 hours after an incident decide whether RCA starts with evidence or with a defensive narrative. Teams should secure the scene, photograph barrier positions, collect permits and checklists, identify witnesses, and capture machine data before memory, cleanup, and hierarchy reshape the facts.
What most investigations miss is the social pressure of the first meeting. If the plant manager asks, "Who made the mistake?", every witness begins to protect a position; if the opening question is about how work was actually organized, the team can still describe the system in which the event emerged.
Use a simple evidence map with four columns: physical evidence, documents, digital records, and interviews. In the interview column, separate what the person saw from what the person believes, because beliefs can guide hypotheses but should not be treated as facts.
RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.
2. Separate the active failure from the latent condition
An active failure is the action near the event, while a latent condition is the organizational weakness that sat in the system before the shift began. Reason's distinction matters because corrective actions aimed only at the active failure usually produce retraining, warnings, or a new signature field.
As Andreza Araujo argues in Safety Culture: From Theory to Practice, culture is revealed by repeated decisions, not by slogans. If a crew bypasses a verification step every Friday afternoon, the relevant question is not only why the worker did it, but why planning, supervision, and production routines made that shortcut ordinary.
A practical RCA should list at least three latent conditions for each active failure. For example, a missed lockout can point to unclear equipment identification, poor isolation drawings, and a supervisor whose span of control is too wide for simultaneous critical jobs.
3. Test whether the control existed only on paper
A control that exists in a procedure but cannot be executed under real work conditions is not an effective barrier. In many serious events, the document is technically correct while the field arrangement makes compliance slow, ambiguous, or dependent on personal courage.
This is where the investigation should connect with the existing risk matrix. If the matrix rated the task as medium risk while the field depended on a single administrative check, the problem is not only behavior; it is a weak view of critical controls.
Ask the team to reconstruct the control as performed, not as written. The gap becomes visible when the procedure says two-person verification, while staffing records show one technician covering three work fronts during the same two-hour window.
250+ cultural transformation projects supported by Andreza Araujo show why paper controls often overstate operational discipline: leaders see the document, while operators experience the workaround.
RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.
4. Use five whys carefully, or it becomes five accusations
Five Whys for SIF investigations can help a team move from event to condition, although it becomes dangerous when each answer names a person rather than a mechanism. The method should move from visible action to system design, where decisions can be changed and verified.
The common failure is linguistic. "Why did he open the valve?" already narrows the answer toward individual fault; "What conditions made opening the valve possible and plausible?" allows the group to examine labeling, isolation status, handover, supervision, and alarm logic.
Use five whys with evidence rules. Each why needs a document, observation, interview convergence, or data point; when the team cannot support an answer, mark it as a hypothesis and assign verification before the final report.
RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.
5. Link corrective actions to barrier strength
A corrective action should increase barrier strength, not merely prove that the company reacted. Training has a place, but it is a weak answer when the investigation found missing engineering controls, confusing displays, impossible workload, or permit steps that supervisors cannot realistically audit.
During Andreza Araujo's PepsiCo South America tenure, where the accident ratio fell 50% in six months, the useful lesson was not that campaigns solve incidents. The lesson was that leadership cadence, operational discipline, and visible decisions must change together when a serious pattern appears.
Classify each action as elimination, substitution, engineering, administrative, or PPE, then identify who owns verification. If most actions sit in training and communication, the RCA probably found symptoms rather than controls.
6. Audit the culture signal inside the investigation
An investigation is also a culture test because people watch what the organization protects first: learning, reputation, production, or hierarchy. The first visible decision after an event tells the workforce whether reporting will expose risk or expose the reporter.
A safety culture diagnosis becomes especially relevant after a serious event, since interviews can reveal whether supervisors feel allowed to challenge schedule pressure. In Safety Culture Diagnosis: Learn how to do your own, Andreza Araujo treats perception gaps as operational data, not as soft opinion.
Review the investigation process itself. If witnesses avoid specifics, if contractors are interviewed last, or if the report removes management decisions from the timeline, the RCA is generating a culture signal that may discourage the next near-miss report.
Each week without correcting this signal makes the next investigation weaker, because the workforce learns which facts are safe to say and which facts should stay outside the official record.
RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.
7. Close the RCA only when recurrence risk changes
An RCA is closed when recurrence risk has changed, not when every action has a green status. The final review should test whether controls are stronger, responsibilities are clearer, and similar tasks would fail differently if the same pressure returned.
The trap is administrative closure. A spreadsheet can show 100 percent completion while the field still carries the same ambiguous permit, the same understaffed shift, and the same supervisor who is expected to audit three critical jobs at once.
Set a 30-day and 90-day verification point for serious incidents. The 30-day review confirms that actions were installed, while the 90-day review checks whether behavior, planning, and leading indicators changed in the work area where the event occurred.
RCA quality also depends on what enters the system before harm occurs. A company that treats near-miss reporting as a volume metric will often miss the weak control pattern that later appears in the formal investigation.
Comparison: blame-centered RCA vs barrier-centered RCA
| Decision point | Blame-centered RCA | Barrier-centered RCA |
|---|---|---|
| Opening question | Who failed to follow the rule? | What conditions made the rule fail in real work? |
| Primary evidence | Witness statements and rule deviation | Scene data, documents, interviews, and control verification |
| Typical action | Retrain, warn, add signature | Redesign the barrier, clarify ownership, verify field execution |
| Leadership role | Approve the report after the fact | Remove schedule, resource, and authority conflicts found by the team |
| Closure test | All actions marked complete | Recurrence risk reduced and checked after 30 and 90 days |
Conclusion
RCA becomes useful when it explains why the operator's action was possible, plausible, and insufficiently blocked by the system around the task. The report should leave leaders with decisions to make, not only workers with rules to remember.
If your organization needs to turn serious incident findings into stronger controls, leadership routines, and a culture that reports risk earlier, connect with Andreza Araujo and build the next investigation around evidence rather than blame.
Perguntas frequentes
Is operator error a valid root cause in RCA?
How do you avoid blame in incident investigation?
What is the difference between active failure and latent condition?
When should RCA corrective actions be closed?
How does Andreza Araujo approach RCA and safety culture?
Sobre a autora
Andreza Araujo
Global Safety Culture Specialist
Andreza Araujo is an international reference in EHS, safety culture and safe behavior, with 25+ years leading cultural transformation programs in multinational companies and impacting employees in more than 30 countries. Recognized as a LinkedIn Top Voice, she contributes to the public conversation on leadership, safety culture and prevention for a global professional audience. Civil engineer and occupational safety engineer from Unicamp, with a master's degree in Environmental Diplomacy from the University of Geneva. Author of 16 books on safety culture, leadership and SIF prevention, and host of the Headline Podcast.
- Civil Engineer (Unicamp)
- Occupational Safety Engineer (Unicamp)
- Master in Environmental Diplomacy (University of Geneva)