Unveiling the Illusion: Synthetic Data's Limitations in Unravelling the Unknown Unknowns

Dr M Maruf Hossain, PhD, GAICD
Mar 2
4 min read

Synthetic data has become a buzzworthy topic in recent times, offering a glimmer of hope for addressing the challenge of limited high-quality data for training AI and ML models. The other day, an enthusiastic salesperson came to me with a pitch for a product that claimed to generate synthetic data. Now, don’t get me wrong, AI and ML models are undoubtedly going to shape the future of work. However, I have some reservations about relying solely on synthetic data to build these models. Let me explain why.

Originally published at LinkedIn Pulse on 12 June 2023.

Photo by Mikael Blomkvist from pexels.com

Based on my extensive experience crafting impactful ML models over the past decade, I can confidently say that relying exclusively on data generation algorithms like SMOTE or other methods is far from reality. Throughout my journey, I’ve delivered numerous ML models thriving in production environments, and one thing has become abundantly clear: relying solely on synthetic data falls short when it comes to capturing the elusive realm of the unknown unknowns. These unknown unknowns are the uncharted territories of knowledge that lie beyond our current understanding.

Let me break this down in simpler terms. Picture a quadrant chart that represents our knowledge and the queries we use to gather it.

In the bottom left of the green quadrant, we have the known knowns. This is where both the question and answer are already known to us. For example, things like the company you work for or the salary you receive fall into this category.

Moving on to the amber quadrants, we encounter the known unknowns or the unknown knowns. These two terms are pretty much interchangeable. In this quadrant, you either know the question but don’t know the answer, or you already know the answer, so you don’t feel the need to ask the question. For instance, a telecommunications company might be aware that some customers will churn, but they don’t know exactly who or when it will happen. Another example is that we know the sun will rise tomorrow, so we don’t bother asking that question repeatedly.

Now, let’s delve into the top right or the red quadrant, which is where things become more challenging. This is the realm of unknown unknowns. Here, you don’t even know what question to ask, let alone the answer. It’s a state of complete uncertainty. So, when these phenomena occur repeatedly and become captured in the data, we can only uncover them by analysing that data. It’s like stumbling upon hidden knowledge.

Let’s touch upon synthetic data. We cannot embed knowledge about unknown unknowns in synthetic data unless we are already aware of that knowledge. Synthetic data is generated based on what we already know, so it cannot capture those elusive unknowns.

The quadrant chart helps us understand the different categories of knowledge and the corresponding queries we make to expand our understanding. Known knowns are what we already know; known unknowns refer to questions without answers or answers without questions, and unknown unknowns present the most difficult challenge, where we don’t even know what questions to ask until we discover patterns in the data. In a nutshell, synthetic data can’t capture the unknown unknowns unless we already know about them beforehand.

Consequently, trying to induce an AI or ML model with the necessary robustness to handle real-world scenarios becomes an unattainable goal if we rely solely on synthetic data. While synthetic data can certainly help fill in some gaps and augment our existing knowledge, it cannot fully replace the richness and complexity of real-world data. Real-world data is where we encounter the unexpected, the outliers, and the nuances that synthetic data algorithms may struggle to capture.

Furthermore, models trained on one dataset often do not generalise well to another. IBM Watson Health serves as a prime example of such a multi-million-dollar failure. Their major setbacks stemmed from the revelation that their cancer diagnostics tool was not trained on real patient data but on hypothetical cases provided by a small group of doctors at a single hospital. Synthetic data is generally unsuitable for ML, as it often fails to generalise well for the entire population and is challenging to match with the population’s distribution. This leads to blind spots and an inability to be generalisable across all cases. Even when using real-world data, there is no guarantee that the model will not encounter entirely new patterns in real-world scenarios, necessitating a mechanism for retraining.

To truly create ML models that navigate the complexities of the real world, we need a balanced approach that combines synthetic and real-world data. By leveraging a diverse range of data sources, we can increase the chances of uncovering those unknown unknowns and equipping our models with the necessary adaptability and resilience.

So, while synthetic data has its merits and can be a valuable tool in the AI and ML toolkit, it should not be seen as a silver bullet that can replace the need for real-world data. Embracing a holistic approach that integrates both synthetic and real data will enable us to build more robust, insightful models that better tackle the challenges of the ever-evolving world we live in.