Knowledge Graphs Are Pretty Awesome Hammers, But Not Everything Is A Nail

May 13, 2024

Art Morales, Ph.D

One of my favorite learnings/sayings working as a computational biologist after my PhD was that “When all you have is a hammer, everything looks like a nail”. As a data scientist (before we were called that), I was always more interested in solving scientific problems than how efficiently my code was running. And I usually got away with it, because the problems were generally small enough that I could throw more computational power and memory at the problem. For example, when I was writing sequence alignment algorithms using modified versions of the Needleman–Wunsch algorithm, I was forced to use C as a language, but I hated pointers with a passion… so I made everything a global variable (tsk tsk, I know). It was not an issue because the sequences were not that long, and I could always add more swaps if needed…

I still remember when I was working as a computational biologist at one of the original genomic companies. I developed an algorithm to compare the evolution of every protein in the yeast genome to find co-evolving pairs that would potentially indicate they were in the same pathway… it seems easy now, but it was early days… I was comparing every protein against 85 other genomes and building a fingerprint of its evolution, the goal was then to do an all-vs-all comparison using Euclidean distances to find the nearest neighbors… in other words, doing a vector search by hand to build a graph… (25 years ago!)

Anyways, as a computational biologist, my language of choice was PERL… Python did not exist. For those of you old enough, you may remember that PERL is great at strings, but stinks at math. But it was my hammer. As they said in the Red Hot commercial, “I put that @#$% on everything”. I developed the algorithm with test data and then the time came to run the first full dataset… I was excited about it and had a shiny Silicon Graphics Octane server on my desktop (I was lucky). I kicked off the process and went about my other work… checking up on it daily… and it was running. And running. And running.

Sadly, we had a power outage about a month in… (UPS? What is that?). As I reviewed the logs, I could tell I was about 9% done, giving me an expected run time of 11 months… My, Oh my.

As luck would have it, one of my cubemates was an awesome C/C++ developer and he introduced me to the concept of “inline C”. The concept was simple, convert the math routines to C (something C was awesome at), and call them from the PERL code. The hope was that it would accelerate things… and that, it did. We also took the opportunity to parallelize it, because “why not?” (this was one of those Embarrassingly parallelizable problems and we could take advantage of it since we were modifying the algorithms anyway).

Long story short, the new code could be completed in about 11 hours over 10 or so CPUs, and my research was enabled… I spent the Christmas break using LSF on every Alpha CPU I could find in the company (way before IBM bought them) and was able to build and optimize my evolutionary graph and show cool results that January.

Those examples were foundational in my desire to architect use-case-driven solutions rather than technical ones. Just because you could do something, it did not mean you should. In my next job, the request came: “Let’s model genomic data into our new shiny RDBMS” Answer: NO! YOU SHOULDN’T! (We had a tough fight since we were building a new data warehouse and people just wanted *everything* in Oracle for simplicity, but it was a really bad idea).

Well, it is 2024, and here we are again. I am writing this blog while at the Knowledge Graph Conference. Knowledge Graphs are amazing. They enable queries and insights that are hard to achieve using normal SQL. They are and will continue to be a major enabler for GenAI and will only improve with time. But they are not always the full solution (or even “the” solution).

Graphs also help you integrate data products better. Just ask our friends at Data.World. Juan Sequeda and friends have built an amazing data marketplace that uses a knowledge graph as a backend to model data products, and I’m excited about how this will enable the rapid integration of products for user-driven use cases at the consumption layer rather than having to prebuild those products by hand and relying on the data producers and/or IT to build them.

In theory, everything can be modeled as a graph; either as a relationship, an entity, or a property. It is the way we think. When modeling a graph, there is no real need for the concept of foreign keys (although we still need unique IDs), and we can model the logical relationships right into the schema.

The problem is that graphs can grow exponentially (no XponentL pun intended). Depending on the graph type and semantic constraints, queries can also get complicated and take a while to write, debug, and execute (this is true for RDF graphs but also an issue with LPG graphs once your graph grows).

Ask a devout Graph Architect and they will build you the biggest, “bestest” graph that models everything. The better ones will talk about subgraphs and performance considerations, but they will still likely want to model everything as a graph. When asked “why?” they’ll say that one can traverse the graph and get any answer they need since everything is modeled. But the reality is that no one needs that much freedom… just ask the guys who built the SGI virtual file browser that was featured in Jurassic Park. It was a cool graph (awesome, for the time), but no one ever used it. It was impractical.

The same goes for an all-encompassing graph. It is cool. It is awesome. And yes, data should be modeled in a way that they can be easily integrated with other data products and where relationships are explicitly defined, but no one needs the whole thing at once. Most use cases are always going to focus on one part of the graph. Related/Subsequent questions can be answered through different queries that may bring up other parts of the graph. This is the same approach that allows new areas in a video game to load as the player gets nearby.

To be fair, part of the problem lies in the name. The term “Graph” is accurate in describing the architecture, but the word graph can have multiple meanings. To architects, it means a network, but to others (lay people and “Luddites”), it implies a visual concept. This duality of meaning can confuse because the consumption pattern gets associated with the architecture and this places artificial constraints into both the design and the use cases. Seeing your data connections visually can be useful, but as the number of connections grows, so does the complexity of the query and resources needed to generate the visuals. Moreover, the actual use case often just needs data from the graph and not the visual, especially for data science applications. But, when talking to end-users, mentioning the word graph instantly creates the often artificial need for a visual.

More importantly, graphs are awesome at connections, but they provide an overhead due to their breath and flexibility. As an example: Properties are just lookups on an entity, right? And what is awesome about looking up properties given a unique ID? You guessed it: Relational Databases. Why forget Postgres or even Oracle when it was purpose-built for providing data back given a key? That is what SQL is for!

To be clear, I am not advocating against using Graphs. As I said, they are awesome. However, it is important for data architecture to be driven by use cases. Just because you can, it does not mean you should model everything as a graph. The danger is in creating a technical solution without a real use case. Solutions need to be architected and informed by the set of use cases. Domain SMEs (subject matter experts) need to be involved in the design. Logic cannot be thrown out the window due to shiny, awesome toys. In some cases, a hybrid solution where a foundational graph is queried first, and then the properties are obtained from an RDBMS store and served as one unit via an API is a much better answer. To be fair, quite often, a well-defined query on a comprehensive graph that can be visualized as a graph on its own is the right choice. It is common for the actual use case to only need part of the graph and limiting the data landscape to the relevant area makes the query and visual much more manageable.

Nevertheless, the point still stands that just because graphs are powerful, they are not always the sole solution or even the best solution. But when they are, their power can approach Mjollnir level (Thor’s Hammer).

Do you use graphs? Are you interested in using them? Let’s talk! Contact me at art.morales@xponentl.ai