Knowledge graphs: Chat with your data

Dan SelmanDistinguished Engineer, Smart Agreements

•Published Jun 5, 2024

Summary•3 min read

See how you can use the “chat with your data” paradigm to convert natural language to graph queries, including using semantics search over vector embeddings.

This is a continuation of my previous article on creating a Knowledge Graph in 100 lines of code. In this article I will show you how you can use the “chat with your data” paradigm to convert natural language to graph queries, including using semantics search over vector embeddings.

There are various approaches to “chatting with your data”, from pure structured query generation to pure Retrieval Augmented Generation (RAG). Here I will present a hybrid approach, powered by the Concerto Graph framework, which makes the approach largely declarative and easy to put into practice.

Before we proceed, you will need a NEO4J Aura account (free accounts are available) as well as an Open AI API Key.

We start by defining our data model/ontology. The concepts in the data model represent the nodes and relationships within our knowledge graph. For the purposes of a demo, we define a simple movie database data model (full code is on GitHub):

concept Person extends GraphNode {
  @label("ACTED_IN")
  --> Movie[] actedIn optional
  @label("DIRECTED")
  --> Movie[] directed optional
}

concept Actor extends Person {
}

concept Director extends Person {
}

concept User extends Person {
  o ContactDetails contactDetails
  o AddressBook addressBook
  @label("RATED")
  --> Movie[] ratedMovies optional
}

concept Genre extends GraphNode {
}

concept Movie extends GraphNode {
  o Double[] embedding optional
  @vector_index("embedding", 1536, "COSINE")
  @fulltext_index
  o String summary optional
  @label("IN_GENRE")
  --> Genre[] genres optional
}

We then use the GraphModel API to connect to the database and create indexes for all the concepts in the data model:

const options:GraphModelOptions = {
    NEO4J_USER: process.env.NEO4J_USER,
    NEO4J_PASS: process.env.NEO4J_PASS,
    NEO4J_URL: process.env.NEO4J_URL,
    logger: console,
    logQueries: false,
    embeddingFunction: process.env.OPENAI_API_KEY ? getOpenAiEmbedding : undefined
  }
  const graphModel = new GraphModel([MODEL], options);  
  await graphModel.connect();
  await graphModel.deleteGraph();
  await graphModel.dropIndexes();
  await graphModel.createConstraints();
  await graphModel.createVectorIndexes();
  await graphModel.createFullTextIndexes();

We can then use the mergeNode and mergeRelationships methods to update or insert nodes/relationships in the Knowledge Graph. The properties of the nodes and relationships are validated against the data model, ensuring that only well-structured data can be added to the graph:

await graphModel.mergeNode(transaction, 'Movie', {
   identifier: 'Fear and Loathing in Las Vegas', 
   summary: 'Duke, under the influence of mescaline, complains of a swarm of 
   giant bats, and inventories their drug stash. They pick up a young 
   hitchhiker and explain their mission: Duke has been assigned by a 
   magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought 
   excessive drugs for the trip, and rented a red Chevrolet Impala 
   convertible.'} 
);

Once the graph is populated, you can run a full-text search over the Movie nodes. Movies are automatically indexed for full-text search because they have the property summary, which has the @fulltext_index decorator.

 const fullTextSearch = 'bats';
  console.log(`Full text search for movies with: '${fullTextSearch}'`);
  const fullTextResults = await graphModel.fullTextQuery('Movie', fullTextSearch, 2);

Returns the single result that contains the string 'bats':

[
  {
    summary: 'Duke, under the influence of mescaline, complains of a swarm of giant bats, and inventories their drug stash. They pick up a young hitchhiker and explain their mission: Duke has been assigned by a magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive drugs for the trip, and rented a red Chevrolet Impala convertible.',
    score: 0.4010826349258423,
    identifier: 'Fear and Loathing in Las Vegas'
  }
]

Next, let’s use a conceptual (semantic) search to find three movies that are about a given concept, rather than looking for specific text in the summary:

const search = 'working in a boring job and looking for love.';
    console.log(`Searching for movies related to: '${search}'`);
    const results = await graphModel.similarityQuery('Movie', 'summary', search, 3);
    console.log(results);

This returns three results, with the movie that is the most similar to the concept “working in a boring job and looking for love” being “Brazil”:

[
  {
    identifier: 'Brazil',
    content: 'The film centres on Sam Lowry, a low-ranking bureaucrat trying 
    to find a woman who appears in his dreams while he is working in a mind-
    numbing job and living in a small apartment, set in a dystopian world 
    in which there is an over-reliance on poorly maintained (and rather 
    whimsical) machines', score: 0.7013247609138489
  },
  {
    identifier: 'Fear and Loathing in Las Vegas',
    content: 'Duke, under the influence of mescaline, complains of a swarm of 
    giant bats, and inventories their drug stash. They pick up a young 
    hitchhiker and explain their mission: Duke has been assigned by a magazine 
    to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive 
    drugs for the trip, and rented a red Chevrolet Impala convertible.', 
    score: 0.5629135966300964
  },
  {
    identifier: 'The Man Who Killed Don Quixote',
    content: `Instead of a literal adaptation, Gilliam's film was about "an 
    old, retired, and slightly kooky nobleman named Alonso Quixano".`,
    score: 0.5493587255477905
  }
]

Next, let’s use the ability to convert a natural language query into a structured Neo4J Cypher query. We use the query “Which director has directed both Johnny Depp and Jonathan Pryce, but not necessarily in the same movie?” which gets correctly converted to the Cypher query shown below, and the result “Terry Gilliam” is displayed. Bonus points if you got this far and know why the result appears twice!

const chat = 'Which director has directed both Johnny Depp and Jonathan Pryce, 
but not necessarily in the same movie?';
const chatResult = await graphModel.chatWithData(chat);
    
Chat with data: Which director has directed both Johnny Depp and Jonathan Pryce, 
but not necessarily in the same movie?
Converted to Cypher query: MATCH (d:Director)-[:DIRECTED]->(m:Movie)(m2:Movie)

<p>Finally, our coup de grâce! Let’s create a natural language query that has to exploit <strong>both</strong> the structured data in the graph and the unstructured text data that we’ve indexed using vector embeddings:</p>

<code data-uuid="syynPXt3">const search = 'working in a boring job and looking for love.';
const chat2 = `Which director has directed a movie that is about the concepts of 
${search}? Return a single movie.`;
const chatResult2 = await graphModel.chatWithData(chat2);

Calling tool: get_embeddings
Tool replacing embeddings: MATCH (d:Director)-[:DIRECTED]->(m:Movie)
CALL db.index.vector.queryNodes('movie_summary', 1, <embeddings>)
YIELD node AS similar, score
MATCH (similar)

<p>In the output trace we can see a couple of steps:</p>

<ol><li>Open AI determining that, to satisfy this query, it needs vector embeddings for the string <em>‘working in a boring job and looking for love.’</em> via an Open AI tool configuration.</li>
	<li>Generation of a Cypher query with a placeholder set of embeddings <code><embeddings></embeddings></code>;. If you are curious to see how this works, you can find the prompt we use for Open AI on <a href="https://github.com/accordproject/lab-concerto-graph/blob/0bf4f8dedc6aea04543181f9dc179312211e503e/src/graphmodel.ts#L70">GitHub</a>. It combines the Concerto domain model with the natural language query and some information about index naming to create (mostly) valid Cypher queries.</li>
	<li>Generation of the embeddings using the Open AI embedding model.</li>
	<li>Execution of the Cypher query, which correctly returns Terry Gilliam as the only (most likely) director of a movie about the concepts of <em>‘working in a boring job and looking for love.’</em></li>
</ol><p>Voilà! We are <strong>chatting with our data</strong> — and we’ve not written any application-specific code; we’ve just defined our data model and called some framework APIs. For a full-fledged movie database example, please refer to my <a href="https://github.com/dselman/movie-graph">movie-graph demo</a>.</p>

<p>To give you a hint of how powerful this can be, take a look at this transcript:</p>

<code data-uuid="vetnZMvA">Which actor famously starred in a film conceptually about journalism, hitchhiking and drugs in Las Vegas?
Calling tool: get_embeddings
Converting query with embeddings to Cypher...
Tool replacing embeddings: MATCH (m:Movie)
    CALL db.index.vector.queryNodes('movie_summary', 3, <embeddings> )
    YIELD node AS similar, score
    MATCH (similar)

<p>Have fun!</p></embeddings></code></embeddings></code>

Dan SelmanDistinguished Engineer, Smart Agreements

Dan Selman has over 25 years experience in IT, creating software products for BEA Systems, ILOG, IBM, Docusign and more. He was a Co-founder and CTO of Clause, Inc., acquired by Docusign in June 2021.