Data Mastering

August 22, 2024

Art Morales, Ph.D

Grad school at MIT had an awful lot of running involved. Most of my lab shared this interest, and I––wanting a piece of this collective bonding––got roped in. The Charles River was our circuit most days, especially around Marathon Day: a time of palpable excitement. Now, I was not the right body type or weight, but there was something infectious about this atmosphere, and running a marathon became a subconscious, borrowed dream of mine.  

A work email in 2011 detailing the search for volunteers to participate in the Dana-Farber Marathon Challenge nudged me to realize my aspiration. I mean, this email…in my inbox…for a good cause, it must be serendipity, right? Tell that to me a couple of months later and then again in 2012: a puddle of sweat somehow managing to cross the finish line. The dream had lost its allure.   

It just so happened that in 2013 I was vacationing far away from Boston, but I was on the receiving end of concerned friends asking after me: “Just saw the news and thought of you…”  

In years prior my family waited around for me near the finish line with energy drinks in hand, near where the bombs went off. This was true serendipity, but too many volunteers and their loved ones could not say the same. The proximity of this tragedy to my life, coupled with its inescapability on the news led me down a rabbit hole of investigation.  

My most disturbing finding, and the one that sticks with me today, is that one of the biggest opportunities to prevent the bombing was missed because the name on an alert wasn’t an exact match to the name in another database (Tsarnaev vs. Tsarnaeva). That a seemingly trivial discrepancy could have such catastrophic implications serves as a reminder for us to master our data meticulously.  


---  


Every organization, inclusive of Homeland Security, that has unmastered records or rudimentary databases needs to work on centralization to optimize business processes. Every organization must face this problem, and they must do so internally––because an enterprise’s source of truth varies depending on how much of their data is confidential as opposed to externally licensed.  

I have inherited data mastering projects where powerful and expensive platforms were in place, but the initial team’s approach to data mastering was overly simplistic. In such cases I have had to rebuild the data mastering process from scratch, returning to and reviewing every source.  

Where data mastering goes wrong is reliance on ‘identifiers’ such as license numbers, Social Security numbers, or National Provider Identifiers (NPI)––the intuitive standard for tracking individuals across databases. However––alone––they fall short of ensuring data association: Every error in data entry or every instance of outdated information leads to another unresolved connection that can cause critical misses with grave consequences––misses like the one that contributed to the events of April 15, 2013.  

Some might look to data clusters as a balm for our data association problem, but traditional rule-based approaches to clustering are not scalable; they become unwieldy as rules need continual updates. This is where I would bring in more advanced mastering tools, such as Tamr, because artificial intelligence (AI) and machine learning (ML) can detect patterns and inconsistencies that are easily overlooked by human eyes, significantly improving the accuracy and efficiency of data clustering. These capabilities prepare us for the future with quick onboarding for new sources and an overall higher data quality.  

 

---  

 

AI-enhanced data mastering truly shines in mission-critical applications, whether in tracking potential security threats or managing comprehensive healthcare records. It guarantees accurate and timely information corroborated by various sources. In 2024, we continue to generate massive amounts of data, and show no signs of stopping. Sophisticated and integrated technologies aren’t optional but necessary. All sectors can benefit by embracing AI and ML in data mastering processes, from delivering effective care to patients in hospitals to keeping the public safe at marathons.