Sesame Street Wisdom: Anomaly Detection Beyond Machine Learning
February 2, 2024
Art Morales, Ph.D.
We’ve all grown up watching Sesame Street, the colorful world filled with heartwarming characters, catchy songs, and lessons that last a lifetime. While it might seem like a leap, believe it or not, Sesame Street has been teaching us data science principles long before most of us knew what data science was. Remember the classic segment, “One of These Things Is Not Like the Others” As it turns out, this simple exercise in pattern recognition serves as an apt introduction to a crucial data science concept: Anomaly Detection.
In data science, anomaly detection identifies data points that do not conform to expected behavior. Such anomalies or outliers could be the result of errors, but they could also indicate important occurrences like fraud in financial transactions or tumors in medical imaging.
Machine Learning Isn’t the Only Way
With the advent of artificial intelligence and machine learning, algorithms like Isolation Forests and Autoencoders have gained popularity in detecting anomalies. These techniques often outperform traditional methods in terms of speed and accuracy, especially when dealing with large and complex datasets.
However, machine learning isn’t the only tool in our toolbox for anomaly detection. In certain fields like healthcare and life sciences, the ‘black box’ nature of machine learning models can be a drawback. Medical practitioners and scientists often require a level of transparency that these models can’t provide, which is where Statistical Process Control (SPC) comes in.
Statistical Process Control is a method used primarily in quality control and has a long history in industrial manufacturing. In the healthcare sector, SPC can be employed to monitor patient vitals, streamline operations, or even optimize clinical decision-making processes.
Advantages of SPC
· Transparency: Every calculation is clear and understandable, making it easy to explain to non-technical stakeholders.
· Low Complexity: The methods are relatively simple, requiring only basic statistical calculations.
· Contextual Understanding: SPC provides a more in-depth understanding of process behavior over time, useful for continual improvement.
For example, Control Charts, an SPC technique, are commonly used in monitoring the heart rate or blood sugar levels of patients. These charts help medical professionals understand what constitutes a ‘normal range’ for each individual patient and spot any drastic deviations that may require immediate action.
As much as SPC is a proven and powerful technique for outlier detection, it works better with single or low-dimensional metrics. As systems get more complex, SPC can start to decrease the signal-to-noise ratio and bury the analysts in a sea of false or mathematically significant but unimpactful outliers. This is where Machine Learning can come to the rescue.
Machine learning approaches excel at handling large multi-dimensional datasets, which is increasingly important in modern healthcare and many other industries. Other Deep Learning methods can also be used to identify anomalies in complex image data, like MRI scans or X-rays. These techniques are quite powerful but still have dependencies and risks that must be addressed. This is where domain expertise helps generate impactful insights in context.
Once we start relying on AI/ML approaches to identify outliers, we must bring additional considerations into play and these considerations must ride alongside the technology and scientific improvements:
· Explainability: For AI to be trusted, especially in critical fields like healthcare, the decision-making process must be transparent and tractable.
· Labeling and Verification: Confirming the ‘goodness’ or ‘badness’ of an outlier requires labeled data, which is often expensive or challenging to acquire in a medical context.
· Real-world Validation: To assess the accuracy and reliability, AI-driven methods must be compared with real-world outcomes.
· Impactfulness: Mathematical Significance is not enough, Outliers must also be evaluated for impact and significance, otherwise we just end up with too many distracting signals.
· FLAIRT Principles: As discussed in a previous blog, we must understand and trust our data. Building on the Principles of FAIR (Findable, Accessible, Interoperable, and Reusable), we must also understand their Lineage to establish Trust.
As Sesame Street taught us, spotting the outlier is often the first step toward understanding complex systems. Both SPC and Machine Learning have a role to play in this quest. While the straightforward and transparent methods of SPC make it invaluable in sensitive industries like healthcare, the advanced capabilities of machine learning models hold the promise of revolutionizing anomaly detection as we know it.
Just like the multifaceted lessons we learned on Sesame Street, a balanced approach combining traditional and modern methods can be both effective and enlightening. So, the next time you watch “One of These Things Is Not Like the Others,” remember, you’re not just singing along; you’re revisiting a foundational principle of data science… but don’t forget, context, explainability, impactfulness, and domain significance are still important!