Knowledge Graphs: Question Answering

Dan SelmanDistinguished Engineer, Smart Agreements

•Published Jul 5, 2024

Summary•6 min read

See how to combine structured and unstructured semantic queries and to use large language models to orchestrate question answering over a knowledge graph.

TLDR; Watch the video
Business problem
Research context
Architecture
Movie Graph demo
Summary
Additional resources

TLDR; Watch the video
Business problem
Research context
Architecture
Movie Graph demo
Summary
Additional resources

Over the past few weeks I’ve been researching and building a framework that combines the power of large language models (LLMs) for text parsing and transformation with the precision of structured data queries over knowledge graphs for explainable data retrieval.

In this third article of the series (one, two) I will show you how to combine structured and unstructured semantic queries, and to use LLMs to orchestrate question answering over a knowledge graph.

TLDR; Watch the video

Business problem

Enterprises contain vast amounts of diverse data. Much of it is structured, held in relational and non-relational databases, while there is often a largely untapped pile of unstructured data: emails, call logs, videos, customer support chat, images, and more that is typically correlated with some structured data (such as a customer record). The data is fragmented and scattered across systems of record, and building anything close to the infamous “360 degree view of the customer” is a very daunting proposition for most enterprises.

A Knowledge Graph is a potential integration fabric across those diverse data sources, allowing enterprise data (both structured and unstructured) to be aggregated in a manner that makes it accessible to upstream reporting, analytics, and, yes, chat-with-your data and Retrieval Augmented Generation (RAG) applications.

The next generation of DBMS products will support vector embedding of text, image and video data along with strong support for queries over structured data. We see advancements from all DBMS vendors along these dimensions. One of the leaders in this space is Neo4J—particularly suitable for this task, as it supports dynamic schema for nodes and relationships (edges) within the graph as well as full-text indexing and vector indexing of text content.

Research context

LLM, RAG and knowledge graphs are white hot research topics right now. Microsoft is promoting GraphRAG (open source soon?), while Neo4J has been releasing some powerful capabilities for building knowledge graphs from documents along with RAG and/or cypher generation.

One of my long-standing research interests is data models/ontologies, and via the Concerto Graph project (Apache 2 license) I aim to bring ease-of-use and domain adaptability to the debate.

Architecture

The high-level architecture of Concerto Graph applications is illustrated in the diagram below, with the domain model/ontology playing a key role in parameterising the framework for a given business domain.

Starting from the left, the components are:

User interface that allows the user to submit natural language queries. The user interface supports submitting explicit full-text search queries, semantic queries, and queries in natural language that are automatically converted to graph queries. The user interface can display the details for how a question was answered by displaying the graph query that was generated from the natural language.
LLM Conversation Context maintains a stateful chat completion context with an LLM. This allows the user to ask follow-up questions or to request additional details and refinements.
LLM Tools are automatically generated from the domain model, providing guidance to the LLM that, should they require answers to questions about a business domain, tools are available that can be called.
Text to Graph Query API is called by the LLM Tool when a natural language query about the domain model is submitted.
Text to Graph Query uses a secondary stateless LLM and few-shot prompting to convert the natural language query to a structured knowledge graph query. The LLM prompt is parameterised using the domain model so that graph queries are correctly expressed in terms of the domain model.
Run Query executes the queries against the knowledge graph, handles errors, and returns the results back up the call chain.
Knowledge Graph exposes APIs to run queries that can combine full-text search, vector similarity search, and structured queries.

Movie Graph demo

Movie Graph is a demo application that is built on the Concerto Graph framework. In just a couple hundred lines of code, it defines a domain model for IMDB data, loads sample data, and creates an interactive user interface based on the architecture above.

Once the Knowledge Graph is populated with Movies and People, the user can ask questions about movies, movie ratings, plot summaries, and the relationships between movies and people. For testing I populate the graph with 4000+ movies, some plot summaries and 60+ people related to those movies (approximately 5K nodes and 13K edges).

Let’s start with a simple query to illustrate the data flow:

Question: how many movies are there?

   Generated Cypher: MATCH (m:Movie)
   RETURN count(m) as totalMovies

Answer: There are a total of 4,598 movies in the database.

The user posed the simple question “how many movies are there?” First the LLM determined that this question would be best answered by the LLM Tool created by the Movies domain model. It then uses the tool to convert the question into the Cypher query MATCH (m:Movie) RETURN count(m) as totalMovie, executed the query and returned a JSON query result object back to the LLM. The LLM then converted that JSON object to the text “There are a total of 4,598 movies in the database.” for display to the end user.

If the user then follows up with the query:

Question: what is the capital of France?
Answer: The capital of France is Paris.

We have seen that the LLM Tool was not invoked, as this is not a question that the knowledge graph–backed domain model can help answer.

However, if the user next asks:

Question: what movies are there that are set in that place?

We then see the following query generated:

MATCH (l:Movie)
    CALL db.index.vector.queryNodes('movie_summary', 10, <EMBEDDINGS> )
    YIELD node AS similar, score
    MATCH (similar)
    RETURN similar.identifier as identifier, similar.summary as content, score limit 3 with embeddings for: 'movies set in Paris'

With the response being:

Answer: Here are some movies set in Paris:

1. **If Looks Could Kill (1991)** - Michael Corben is mistaken for a CIA agent 
in Paris and gets involved in a thrilling spy adventure.

2. **Timecode (2000)** - A film set around a film production company office, 
following several interweaving plot threads.

3. **Rush Hour 3 (2007)** - Chief Inspector Lee and Detective James Carter 
team up in Paris to fight a criminal organization.

These movies feature scenes and storylines set in the scenic and vibrant 
city of Paris.

The query generated creates a vector embedding for the string “movies set in Paris” and then queries the knowledge graph for three movie summaries that are conceptually similar (close to) that vector. The three movies’ titles in the result set are then displayed to the user in the response.

If we then follow up with the question:

what people are known for appearing in those movies?

The response is:

1. **If Looks Could Kill (1991)**
   - Richard Grieco
   - Robin Bartlett
   - David McIlwraith
   - Linda Hunt
   - Roger Rees
   - Gabrielle Anwar
   - Roger Daltrey

2. **Timecode (2000)**
   - Salma Hayek
   - Stellan Skarsgård
   - Jeanne Tripplehorn
   - Saffron Burrows
   - Glenne Headly
   - Xander Berkeley
   - Golden Brooks
   - Holly Hunter
   - Kyle MacLachlan

3. **Rush Hour 3 (2007)**
   - Jackie Chan
   - Chris Tucker
   - Hiroyuki Sanada
   - Max von Sydow
   - Yvan Attal
   - Tzi Ma
   - Noémie Lenoir

These actors and actresses played significant roles in the respective films.

However, no Cypher query was generated, meaning that the LLM used its training data to answer this question, not the knowledge graph. The answer is therefore susceptible to hallucination and may not incorporate the latest facts in the knowledge graph.

If we then follow up with:

Question: use the database please

The response is:

Generated Cypher: MATCH (m:Movie {title: 'If Looks Could Kill'})<-[:RELATED_TO]-(a:Person)
RETURN a.name as actorName

Generated Cypher: MATCH (m:Movie {title: 'Timecode'})<-[:RELATED_TO]-(a:Person)
RETURN a.identifier, a.name

Generated Cypher: MATCH (m:Movie {title: 'Rush Hour 3'})<-[:RELATED_TO]-(a:Person)
RETURN a.name

Here are some notable actors from the movies set in Paris:

1. **If Looks Could Kill (1991)**
   - Brad Dourif
   - Linda Hunt
  
2. **Timecode (2000)**
   - Stellan Skarsgård

3. **Rush Hour 3 (2007)**
   - Max von Sydow

The response shows that there are actually only four people in the Knowledge Graph that are related to the three movies in question.

Summary

The combination of LLM, LLM tools, natural language to graph query language transformation and a knowledge graph is a powerful combination! GPT-4o in auto mode is adept at determining when the knowledge fraph can be of assistance in answering questions, and it is very impressive at transforming natural language to graph queries—even when the domain model is provided as a parameter to the prompt. The ability to combine structured and unstructured data within a single query provides the LLM with an extremely powerful mechanism to interrogate the knowledge graph.

In the future I plan to test how well the framework generalises to other domain models as well as building a demo web interface for interacting with the system.

Additional resources

Dan SelmanDistinguished Engineer, Smart Agreements

Dan Selman has over 25 years experience in IT, creating software products for BEA Systems, ILOG, IBM, Docusign and more. He was a Co-founder and CTO of Clause, Inc., acquired by Docusign in June 2021.