AI Didn’t Fail. Leadership Did.

Dr M Maruf Hossain, PhD, GAICD
Feb 27
10 min read

The Governance Gap Behind Woolworths, Commonwealth Bank, and the Next Headline You Haven’t Read Yet

For boards and executive teams navigating AI adoption in 2026.

Imagine calling Australia’s largest supermarket chain for help with your grocery order. The company has invested heavily in artificial intelligence. They’ve partnered with Google. They’ve announced their new AI assistant to the market with genuine fanfare. You type your question, and the chatbot tells you about its mother.

This is not a hypothetical. This is Woolworths in February 2026.

Olive, the retailer’s AI assistant powered by Google’s Gemini platform, made headlines for all the wrong reasons. Customers reported the system making personal claims—including unprompted references to its mother and details about its personal life. More damaging for a retail operation, the system simultaneously produced pricing errors on basic grocery items and delivered inconsistent information on stock and pickup fees. The warm, anthropomorphised persona the company had invested in made the failures land harder. When a system presents itself as a person, customers hold it to a person’s standard of accuracy. When it fails that standard on milk prices, the reputational damage compounds.

The technical post-mortem is instructive: Olive’s mother hallucinations were triggered by legacy decision-tree scripts from an older system that were never properly mapped to the new large language model backend. When certain customer inputs—including what the system interpreted as birthdate fields—were submitted, they activated fun fact scripts programmed years prior. Nobody caught this before deployment because, in practice, there was no QA discipline designed to catch it.

This is not a technology story. It is a governance story. And Woolworths is not alone.

The Commonwealth Bank Fiasco: A Pattern Becomes Undeniable

Six months before Olive went rogue, the Commonwealth Bank of Australia made a decision its executive team would deeply regret.

In August 2025, CBA announced the redundancy of 45 customer service employees and the replacement of those roles with an AI-powered voice bot. The bank projected the bot would reduce call volumes by 2,000 per week, improve efficiency, and demonstrate the bank’s positioning as a digital leader. The Finance Sector Union called it what it was: cost-cutting dressed in technology-speak.

Within weeks, the strategy had collapsed. Call volumes did not decrease. Wait times escalated. Existing staff, including team leaders, were forced to work significant overtime to manage the system’s failure, despite the bank’s having been so publicly confident about it. Following a workplace relations tribunal, CBA admitted its initial assessment “did not adequately consider all relevant business considerations” and offered to reinstate all 45 employees.

The reversal was total. The embarrassment was public. And the two incidents, placed side by side, reveal something more significant than two bad months for two major brands.

They reveal a pattern.

In both cases, a major Australian organisation made a significant AI commitment—financial, strategic, and reputational—without the foundational validation required to know whether the technology was ready for production. In both cases, the failure was not the AI itself. The failure was leadership applying governance designed for static, deterministic systems to probabilistic, non-deterministic AI—and then being surprised when the results were unpredictable.

The Diagnosis: A Cognitive Failure, Not a Technical One

It would be convenient to attribute these failures to technical complexity—to position AI as simply too novel, too opaque, too unpredictable for traditional oversight to manage. That framing is both incorrect and dangerous, because it removes accountability from the people who are actually responsible.

The root cause of both failures is a leadership mental model problem. In 2026, the dominant AI failure mode is no longer technical. It is cognitive.

Boards continue to evaluate AI primarily as a cost lever rather than as a probabilistic operational system requiring continuous validation. This single mispricing cascades into three forms of institutional debt that never appear on the balance sheet until it is too late:

Reputational debt: the erosion of customer and stakeholder trust when AI fails publicly, visibly, and in ways that feel careless. Woolworths built a warm persona designed to strengthen the brand relationship. That persona made the pricing errors feel like a betrayal rather than a glitch.

Operational debt: the hidden fragility that accumulates when AI is deployed into workflows before those workflows—and their edge cases, exception paths, and emotional complexity—are fully understood. CBA’s voice bot was provisioned to handle routine balance enquiries. In practice, customer service calls are rarely routine. The bot couldn’t cope. The humans it replaced were no longer there to catch it.

Innovation debt: the strategic cost of using AI as a labour arbitrage tool rather than as a signal detection system that surfaces unmet needs. This is the deepest failure. When AI is deployed primarily to reduce headcount rather than to augment capability, organisations are not innovating. They are substituting. And as CBA discovered, substitution without validation is simply risk transfer—from the AI team to the front page.

While 53% of Australian boards identify cybersecurity as a top priority, only 37% have conducted an audit of how employees are actually using AI within their organisations. The governance ambition exists. The governance architecture does not yet match it.

The Structural Gap: Reactive GRC vs. Proactive QA

To understand why these failures keep occurring, it is necessary to name the operating model failure clearly.

Traditional Governance, Risk, and Compliance (GRC) frameworks—the tools boards rely on to provide assurance—were built for a different operating environment. They excel at periodic audits, regulatory reporting, and post-incident containment. They were designed for a world where risks unfold at human speed, where a quarterly review cycle can meaningfully capture organisational exposure.

GRC is a right-of-boom discipline. It asks: Were we compliant last quarter? It produces post-incident reports, remediation plans, and briefings for regulators. It assumes that failures will present themselves gradually and that documentation will follow.

AI does not operate on that timeline. A language model can hallucinate, drift from its validated baseline, produce biased or harmful outputs, or fail catastrophically—between one audit cycle and the next, in front of millions of customers, with no warning in any backward-looking report.

Applying reactive GRC to non-deterministic AI systems does not provide governance. It provides the appearance of governance. And that distinction is precisely what Woolworths and CBA discovered at high cost.

Proactive Quality Assurance (QA) is a left-of-boom discipline. It asks: Are we safe right now, under live conditions, and will we remain so as the environment shifts? It identifies vulnerabilities before they manifest as customer harm. It operates continuously, not periodically. It treats AI systems not as software to be approved and released, but as probabilistic systems to be continuously observed, tested, and validated.

Olive failed not because large language models hallucinate—they do, and any competent QA function knows this. Olive failed because Woolworths deployed a probabilistic AI system into production with governance built for deterministic, static software. That is not a technology error. It is a governance category error.

The AI Posture Problem: Why Boards Underestimate Their Exposure

There is a board objection worth addressing directly, because it will be raised: Our AI use case is simpler. We’re not doing anything as complex as Woolworths or CBA.

This objection typically reflects a failure to distinguish between two fundamentally different risk classes—and the distinction matters enormously for the depth of QA required.

An Internal Transformer uses AI to optimise internal operations: automating document processing, improving internal search, and accelerating back-office workflows. The risk profile is meaningful but contained. Failures are largely internal. Recovery is faster.

A Business Pioneer deploys AI in direct contact with customers and staff: conversational agents, automated decisioning on credit or claims, and AI-driven workforce changes. The risk profile is exponentially higher. Failures are immediate, public, and reputational.

Woolworths’ Olive and CBA’s voice bot were Business Pioneer deployments governed with Internal Transformer controls. That category error is not a nuance. It is the mechanism of failure.

Before any board approves an AI deployment, the first governance question should be: Which posture does this represent, and are our QA standards calibrated to match? If that question is not being asked and documented, the organisation is not governing its AI. It is hoping its AI governs itself.

The Solution: Proactive QA as Strategic Infrastructure

The good news—and there is genuinely significant good news here—is that the discipline required to govern AI responsibly is not new. Quality Assurance has existed as a rigorous professional practice for decades. What is new is the imperative to elevate it from a back-office technical function to a board-level strategic commitment. Three practices make this concrete.

AI Red Teaming

Before any AI system meets a customer, it should undergo structured adversarial testing—deliberate attempts to induce hallucinations, misfires, biased outputs, or conflicts with organisational values and regulatory requirements. This is not standard software testing, which verifies that a system does what it is supposed to. AI red teaming attempts to discover what the system could do under stress, in edge cases, when legacy integrations behave unexpectedly. Had Woolworths red-teamed Olive against the full range of customer inputs—including birthday fields that could activate years-old scripts—February’s embarrassment is a pre-deployment finding, not a public headline.

Continuous Control Monitoring

Annual compliance audits are structurally inadequate for AI oversight. The technology changes continuously; its risk profile changes with it. Modern platforms enable automated, real-time monitoring that continuously validates AI behaviour against defined standards—detecting deviations as they emerge rather than weeks or months after the damage is done. For a large enterprise, transitioning from periodic audits to continuous monitoring can reduce the time to detect issues by up to 98%. The CBA tribunal could have been a dashboard escalation rather than a public admission of institutional error.

Human-in-the-Loop Governance Architecture

CBA’s fundamental mistake was treating its AI voice bot as a set and forget deployment. The most effective AI systems define explicit risk thresholds above which human judgment is mandatory—transaction complexity, emotional register, regulatory sensitivity—and they establish clear escalation paths that surface decisions to human oversight before they become crises. AI is not a replacement for human judgment in complex situations. It is an extension of human capacity in routine ones. The governance architecture must explicitly encode that distinction.

What Boards Must Require

Translating this into board-level practice requires four specific commitments.

Codify the AI posture. The board and executive team must explicitly agree on whether each AI deployment is an Internal Transformer or a Business Pioneer, and ensure QA standards are calibrated accordingly. This conversation needs to happen at the board table, not be delegated to the technology function. The posture determination sets the acceptable risk threshold, the required depth of pre-deployment validation, and the conditions under which deployment should be blocked or rolled back.
Establish a formal AI assurance capability with direct board visibility. A dedicated AI Ethics and Compliance Committee—with representation from technology, risk, legal, and human resources—provides the cross-functional perspective required for AI deployments. This committee should own the scaling rules: the validated QA gates that an AI system must pass before moving from pilot to production. Both Woolworths and CBA appeared to scale before those gates were established.
Operationalise with tooling, not just policy. Governance that exists only in written frameworks cannot detect a chatbot hallucinating about its mother in real time. Boards should require management to report on live AI governance indicators—the metrics that signal governance health before failures occur, not after.
Design human-AI collaboration as a system, not a trade. The instinct to position AI as a replacement for human labour is understandable as a cost-reduction strategy, but it is demonstrably fragile as an execution model. CBA’s subsequent pivot—investing in upskilling tens of thousands of employees for a human-AI collaborative model—is both the more defensible and the more strategically sophisticated approach. The question is not whether AI can perform a function. The question is whether the organisation has validated that it performs that function reliably, across the full distribution of real-world scenarios, before the humans who provide the safety net are removed.

The Board Metrics That Matter

Governance without measurement is aspiration. Boards should require the following leading indicators in regular AI reporting—and treat their absence as a governance gap requiring immediate remediation.

Override rate: How frequently are AI decisions being manually corrected by humans? A rising override rate is an early signal of model drift or deployment scope creep before it becomes a customer experience failure.
Red-team discovery ratio: What proportion of vulnerabilities are being found by structured pre-deployment testing versus discovered in production by customers? Any organisation where this ratio favours production discovery is governing reactively rather than proactively.
Mean time to detection: How long does it take the organisation to identify an AI behavioural anomaly from the moment it first occurs? This metric is the operational proxy for the maturity of continuous monitoring infrastructure.
Model drift indicators: How much has the AI system’s behaviour deviated from its validated baseline, and over what timeframe? Drift is the silent predecessor to the kind of failures that generate headlines.
Workforce readiness rate: What percentage of staff who interact with or supervise AI systems have completed structured AI-human collaboration training? This is the organisational resilience metric that CBA learned the hard way.

If these metrics are not visible at the board level, AI governance is performative. The board is not governing AI. It is approving it.

The Competitive Opportunity in Getting This Right

This analysis should not be read as a counsel of caution on AI adoption. It is precisely the opposite.

The organisations that will define Australian enterprise leadership in the next decade are not necessarily the ones that deployed AI first. They are the ones who built the assurance architecture that allowed them to deploy AI repeatedly, safely, and at scale—without reversals, without tribunal pressure, without public apologies for decisions made prematurely.

In a market increasingly weary of AI theatrics, reliability becomes a differentiator. The average cost of a regulatory violation in this space is estimated at $2.3 million—and that figure excludes the compounding damage of brand erosion, talent exodus, and the strategic credibility hit that follows a public reversal of the kind CBA experienced.

Australian regulators are moving toward mandatory AI governance requirements. The NAIC Guidance for AI Adoption, the National Framework for AI in Government, and the evolving Privacy Act compliance landscape are all pointing in the same direction. Organisations that have built proactive QA infrastructure will demonstrate compliance with relative ease. Organisations that have not will face the same reactive scramble at higher cost, under greater scrutiny, and with greater damage.

Boards that elevate QA from a technical afterthought to a strategic capability will compound their advantage. Boards that do not will continue to compound their headlines.

The organisations that win with AI will not be the ones that moved fastest. They will be the ones that moved with certainty—and certainty in AI is not a property of the model. It is a property of the governance surrounding it.

Concluding Remarks

GRC Provides the Structure. QA Provides the Certainty.

The most clarifying insight from the Woolworths and CBA failures is also the most actionable: both were survivable. Both surfaced at a scale that was painful but recoverable. Both offer a preview of what the next, larger failure will look like if the structural gap between reactive GRC and proactive QA remains open.

For Australian enterprises, AI ambition is no longer the differentiator. AI assurance is.

GRC provides the structural guardrails. QA provides the certainty that AI systems behave as intended in production under real-world conditions with real customers. Without both operating in concert—GRC setting the framework, proactive QA continuously validating within it—organisations are not governing their AI. They are hoping their AI governs itself.

It did not work for Woolworths. It did not work for Commonwealth Bank. The window to act before the next headline is open. The blueprint is clear. The question now belongs entirely to the boards and executives in the room.

The gap between AI ambition and AI assurance is where reputations are made or broken. Closing that gap is a board-level responsibility—and it is achievable today.