In 1968, Lt. Gen. William B. Bunker noted that when developing complex systems "there are two kinds of technical problems: there are the known unknowns, and the unknown unknowns.”
In other words, some problems point to recognizable gaps in our knowledge, but even more troublesome are the things we do not realize are unknown. This axiom was commonly used within U.S. defense procurement in the early Space Age, yet it rings true today more than ever amidst the rise of artificial intelligence.
The promises of Generative AI are remarkable: automation, scalability, personalization, speed. However, beneath these benefits lies a fundamental issue. As GenAI pervades the world of enterprise solutions, it brings hallucinations along with it. These hallucinations, as it turns out, are a result of unknown unknowns.
GenAI’s Achilles Heel: It’s Only as Good as Its Data
GenAI doesn’t “know” anything. It’s not thinking, it’s predicting. It stitches together plausible responses based on the data it’s been fed and trained on. What happens when the data is Redundant, Obsolete, or Trivial (ROT)? The outcomes can range from mildly unhelpful to critically defective, potentially jeopardizing your business.
Many, if not most, hallucinations happen when the AI model fails to find the answer in its database. These models cannot recognize when they don’t know an answer, and instead they erroneously fill in the blanks. The system is incapable of saying “I don’t know” or expressing when a request is out of the scope of the data it was trained on. Rex Booth, GenAi expert CISO of Sailpoint, posited that forcing LLMs to explain such limitations would be a huge step forward towards making output more reliable.
The Mayo Clinic is employing a cutting-edge approach to address the issue. They’ve combined the clustering using representatives (CURE) algorithm with LLMs and vector databases to double-check their AI’s data retrieval. Combining CURE with a reverse Retrieval Augmented Generation (RAG) approach allows Mayo’s LLM to partition the summaries it procures into individual facts, and match those facts back to the source documents. A second LLM then scores how well the facts align with those original sources, assessing the existence of a causal relationship between the two. One AI acts as a watchdog for the other.
“AI Watching AI” Isn’t Enough
Like Mayo Clinic, other enterprises are experimenting with having one AI system check another in a sort of digital buddy system. Although this solution is clever, the LLMs involved are still working based off the same flawed data that caused the issue in the first place. Other organizations are leveraging more human oversight to address the problem, but that approach depreciates the efficiency benefits that GenAI promised. Many are turning to third parties to evaluate and try to improve their GenAI accuracy, an expensive solution.
Instead of these reactive approaches to GenAI’s flawed outputs, organizations need to look upstream. Soumendra Mohanty, chief strategy officer at Tredence, asserts that “GenAI models hallucinate not just because they’re flawed, but because they’re being used in environments that were never built for machine decision-making.” Enterprises can’t expect AI to behave perfectly in a messy and outdated system, he says. To move past this, Mohanty suggests “CIOs need to stop managing the model and start managing the system around the model. This means rethinking how data flows” and “the ways in which AI is embedded in business processes.”
In other words, curate the well of data and instill trust in the entire system, not just the language model itself. This is where data governance comes into play.
Why GenAI Needs Data Governance
Recent research from IDC found that 88% of observed AI proofs of concept (POCs) didn’t make the cut to full-scale use. For every 33 AI POCs a company launched, only four advanced to production. The authors report that this low conversion rate “indicates the low level of organizational readiness in terms of data, processes and IT infrastructure,” listing insufficient AI-ready data as one of the main challenges to widespread AI deployment.
Think of GenAI as a sponge. It soaks up everything. This includes good data, sensitive data, and ROT. Without strong data governance in place, AI systems are drinking from a contaminated well.
Governance defines what data is authoritative. It sets policies around privacy and lifecycle. It tells the AI, “Trust this document, not that outdated email chain from 2017.” It finds redundancies, ensures current context, and enforces compliance. You can’t control what GenAI creates if you can’t control what it consumes.
GenAI Reflects Data Hygiene
The path to trustworthy AI starts with responsible data management. GenAI doesn’t supersede data governance but rather makes it more essential than ever. Whether it’s preventing hallucinations or ensuring compliant outputs, the common thread is always the same: clean, governed, reliable data.
Before you scale your next GenAI initiative, ask yourself if your data landscape is ready for it. Because in the age of AI, the data you ignore today becomes the liability you explain tomorrow.
----------
Interested in cleaning and governing your data for AI? Contact ZL Tech.