Neo4j LLM Knowledge Graph Builder: How to Create Knowledge Graphs for RAG
Knowledge graphs have become essential tools for data management and analysis in 2023. These powerful data structures offer a way to connect and visualize complex information. Neo4j, arguably the leading graph database platform, has recently introduced the LLM Knowledge Graph Builder, designed to simplify the creation of knowledge graphs for Retrieval-Augmented Generation Retrieval-Augmented Generation (RAG). And also better understand knowledge graphs.
In this starter guide, we'll walk you through the step-by-step process of building knowledge graphs using Neo4j's innovative tool for knowledge graph creation and exploration. Whether you're new to graph databases or an experienced RAG professional, you'll learn how to:
- Set up Neo4j LLM Knowledge Graph Builder
- Create your first knowledge graph for RAG
- Visualize the created graph and find insights
Our goal is to show you how to effectively use Neo4j's tool to build knowledge graphs that can improve your data analysis and information retrieval capabilities. We'll start with the basics and work our way through the key features.
What are knowledge graphs in the context of RAG?
Knowledge graphs are structured representations of information that show relationships between different entities. They consist of nodes (representing concepts or entities) and edges (representing relationships between these entities). In essence, knowledge graphs organize data in a way that mimics how humans understand and connect information.
In the context of Retrieval Augmented Generation (RAG), knowledge graphs serve as an advanced information storage and retrieval system. RAG is a technique that complements language models by providing them with relevant external knowledge during the generation process. Knowledge graphs are an addition to more traditional, purely vector-based RAG, by:
- Organizing Information: They structure data in a way that's easily queryable and navigable.
- Establishing Connections: They explicitly represent relationships between different pieces of information.
- Providing Context: They offer a broader view of how different facts or concepts relate to each other.
Simple knowledge graph
They promise to increase the quality of RAG systems:
-
Improved Accuracy: By providing structured, relevant information, knowledge graphs help RAG systems generate more accurate and contextually appropriate responses.
-
Enhanced Reasoning: The relational structure of knowledge graphs allows RAG systems to perform more complex reasoning tasks, connecting disparate pieces of information.
-
Flexibility: They can be easily updated and expanded, allowing RAG systems to incorporate new information over time.
-
Explainability: The explicit relationships in knowledge graphs make it easier to trace how a RAG system arrived at a particular output, enhancing transparency and interpretability.
-
Reduced Hallucination: By grounding language models in a structured knowledge base, knowledge graphs can help reduce the likelihood of RAG systems generating false or inconsistent information.
While we understand knowledge graphs by know - at least on a high level, we need to further distinguish between two types of knowledge graphs:
-
Lexical Graphs: These graphs focus on the connection between the structural elements of your data. Think of things like document -> chapter -> section -> paragraph -> chunk
-
Semantic Graphs: They focus on the "meaning" of your documents. For example they might connect "Paris" with "France" and "Eiffel Tower". Or information about "onboarding" with "employee" and "training".
Most of the time, you want to have both graphs integrated in your RAG system. As you need to find both - relationships in terms of document structure and content meaning.
Lexical and Semantic graph
Vector-based RAG vs. Knowledge Graph-based RAG
I want to highlight, that there is no question of rivalry between vector-based and knowledge graph-based RAG. They are complementary and you might utilize both methods in your application.
However, to better understand knowledge-graphs, it makes sense to list both approaches side-by-side (as most readers might be more familiar with the vector-based approach).
-
Data Representation:
-
Vector-based RAG: Represents documents or chunks of text as high-dimensional vectors in an embedding space.
-
Knowledge Graph RAG: Represents information as interconnected entities and relationships.
-
-
Information Retrieval:
-
Vector-based RAG: Uses similarity measures (like cosine similarity) to find relevant vectors.
-
Knowledge Graph RAG: Uses graph traversal algorithms to find relevant information through relationships.
-
-
Context Understanding:
- Vector-based RAG: Implicit context based on vector proximity.
- Knowledge Graph RAG: Explicit context through defined relationships between entities.
-
Handling of Structured Data:
- Vector-based RAG: Less effective with highly structured.
- Knowledge Graph RAG: Excels at representing and querying structured. Can represent complex relationships and also link between structured and unstructured data.
-
Scalability:
- Vector-based RAG: Can handle large amounts of unstructured text efficiently.
- Knowledge Graph RAG: Efficient for querying complex relationships but can be more resource-intensive to build and maintain.
-
Updating Information:
- Vector-based RAG: Adding new chunks of text is straightforward. Just add the new embeddings.
- Knowledge Graph RAG: Can be updated incrementally by adding new nodes and edges. Changing the schema however is more complex.
-
Explainability:
- Vector-based RAG: Less transparent, as relationships are implicit in vector space.
- Knowledge Graph RAG: More transparent, with explicit relationships that can be traced.
-
Handling Ambiguity:
- Vector-based RAG: May struggle with disambiguating similar concepts.
- Knowledge Graph RAG: Can explicitly represent different meanings or contexts for similar terms.
-
Resource Requirements:
- Vector-based RAG: Generally requires less upfront work to implement.
- Knowledge Graph RAG: Often requires more initial effort to build and structure the knowledge base.
-
Flexibility:
- Vector-based RAG: More flexible with unstructured or previously unseen text.
- Knowledge Graph RAG: More rigid but powerful for domains with well-defined relationships.
What is Neo4j?
Neo4j is the leading graph database management system - and it's open source. It's designed to store, manage, and query highly connected data in an efficient manner. Unlike traditional relational databases, Neo4j uses a graph structure for semantic queries, making it particularly interesting for applications that involve complex relationships and interconnected data - knowledge graphs being a prime example.
Key features of Neo4j:
-
Native Graph Storage: Neo4j stores data in nodes and relationships, mirroring real-world connections more intuitively than table-based storage.
-
Cypher Query Language: Neo4j uses Cypher, a declarative graph query language that allows for efficient querying of graph data.
-
ACID Compliance: Neo4j ensures data integrity through ACID (Atomicity, Consistency, Isolation, Durability) transactions.
-
Scalability: It offers horizontal scalability through its Causal Clustering architecture, allowing for read and write scaling.
-
Performance: Neo4j is optimized for traversing relationships, making it significantly faster than relational databases for certain types of queries, queries, especially those involving complex relationships.
-
Flexibility: It allows for easy addition of new nodes, relationships, and properties without disrupting existing queries.
-
Visualization: Neo4j provides built-in tools for visualizing graph data, making it easier to understand complex relationships. In general, the Neo4j team provides a multitude of great tools around working with LLMs.
What is the Neo4j LLM Knowledge Graph Builder?
The Neo4j LLM Knowledge Graph Builder is an application designed to transform unstructured text into a structured knowledge graph. At its core, this tool processes a variety of input formats, including PDFs, documents, web pages, and even YouTube video transcripts, to generate a comprehensive graph representation stored in a Neo4j database.
Neo4j LLM Knowledge Graph Builder
The Knowledge Graph Builder's processing pipeline provides several features:
Input Processing: The system allows to ingest multiple document types using LangChain Loaders. It supports formats including PDFs, web pages, and even YouTube video transcripts.
Text Chunking: After ingestion, the content is divided into manageable chunks. These chunks become the foundational nodes in the graph structure, linked to their source documents and to each other.
Embedding Generation: The system computes embeddings for each chunk, storing them within the chunk nodes and in a Vector index.
Entity and Relationship Extraction: Using Large Language Models such as OpenAI, Gemini, or Llama3, the system extracts entities and relationships from the text. This process uses modules like llm-graph-transformer or diffbot-graph-transformer.
Shoutout to LangChain for providing the elemental tools to build such a system.
The graph construction phase results in two main structures: a lexical graph of documents and chunks with embeddings, and an entity graph containing extracted entities and their relationships. To enhance connectivity, the system implements a k-Nearest Neighbors (kNN) Graph by linking similar chunks with similar relationships.
One of the most interesting features for us RAG-folks is its support for multiple Retrieval-Augmented Generation (RAG) approaches:
-
GraphRAG (The thing we are talking here)
-
Vector-based retrieval ("Classical", vector-based RAG)
-
Text2Cypher queries (A way to query the graph directly)
This not only allows us to combine multiple of these models and get potentially better results, but - even more exciting - compare the different methods against each other!
Setting up the Neo4j LLM Knowledge Graph Builder
The Neo4j LLM Knowledge Graph Builder connects to a Neo4j database to do it's it's magic. Therefore, we first need to get ourselves a Neo4j database.
There are two ways to do this:
-
Use the fully managed Neo4j AuraDB. This is the easiest way to get started.
-
Set up your own Neo4j database on your local machine or a server.
Setting up Neo4j AuraDB
For using the AuraDB, just head over to Neo4j AuraDB and make yourself an account. On first login, you'll get a password and username presented. There is also a "Download" - Button which allows you to download a file containing all the required connection parameters.
Wait until your instance is created - and voila - you're done.
Neo4j AuraDB
Setting up a self-hosted Neo4j database
Running Neo4j with docker is also quite simple.
Note: We need Neo4j with APOC enabled.
-
Make sure, you have Docker installed.
-
Run the following command (replace
your_password
with the password you want to use and/path/to/your/data
with the path to your actual data location on your local host):
Running the Neo4j LLM Knowledge Graph Builder
So, now that you have your database up and running, let's start the Knowledge Graph Builder.
Again, there are two ways to do this:
-
Using the Neo4j LLM Knowledge Graph Builder web application
-
Or using the provided docker compose file
Using the LLM Knowledge Graph Builder web application
Simply navigate to the Neo4j LLM Knowledge Graph Builder web app and enter your connection information.
If you used AuraDB, simply refer to the downloaded credentials files from above.
Neo4j LLM Knowledge Graph Builder connection settings
Note: Most probably, if you self-hosted your Neo4j instance, you might not be able to use the graph builder web application, as you would need to expose your database to the internet. Refer to the next section in this case.
Running the Neo4j LLM Knowledge Graph Builder locally, using docker compose
Again, make sure Docker is installed.
-
Create a
.env
file in your projects root folder, with following variables:Get your OpenAI key from the OpenAI Plattform and your Diffbot key from the Diffbot application.
-
Run the following command:
Kindly refer to the Neo4j LLM Knowledge Graph Builder deployment documentation for more configuration options and instructions.
Creating your first knowledge graph for RAG
So, now you should have access to the web application.
Neo4j LLM Knowledge Graph Builder
With that at hand, let's create our first knowledge graph, to get a grasp at how this all works.
-
Click on "Web Sources" on the left menu bar.
-
Click on "Website Link" (the third icon) and enter a website you want to get data from. For example you could use our latest blog post about using LLMs to chat with BigQuery.
-
Click on "Submit".
The first time you submit a file, you are asked to create a schema for your knowledge graph. How to come up with a schema is a topic for another post. However, if we simply leave the settings empty, the LLM model will be tasked be tasked with creating the schema for us - which is actually not that bad of an idea.
Graph schema settings
Wait for a second and the LLM build will show the newly created file in the main area of the app. Use the checkbox to the left and click "Generate Graph". Graph". This will take a wild, depending on the size of the document.
Now after the graph is generated, click on "Preview Graph". You should be greeted with a graph visualization of your document, similar to the one below.
graph visualization with LLM-built entities
Use the dropdown on the top right to switch between the different types of graphs (lexical, entity, kNN) and explore the different relationships.
Trying RAG with our newly created knowledge graph
So, while the graph looks nice, let's try if it is actually useful for us. Let's go back to the main area and use the chat-area on the right-hand side.
Ask a question like "What is the system message in the llm api call?" (Which is quite an intricate question, as it refers to a code-section within the blog post and might therefore be quite hard to find).
However, as you can see below, the system is easily able to find this information from our blog post.
Chat with LLM-built knowledge graph
And there you have it, we built our very first knowledge graph and used it in a small RAG example.
By the way: Where is our knowledge graph stored?
If you are wondering, where the data reside: The Neo4j LLM Knowledge Graph Builder simply adds the graph data to the Neo4j database instance provided during app setup.
Conclusion
In conclusion, the Neo4j LLM Knowledge Graph Builder provides a practical solution for creating knowledge graphs to enhance RAG systems. This guide has walked you through the basic setup and usage, demonstrating how to transform unstructured data into queryable graph structures.
The tool's ability to generate both lexical and semantic graphs offers a good and complete view of your data, capturing structural and conceptual relationships. As shown in our simple example, even with minimal setup, the system can provide accurate, context-aware responses to queries.
While this guide serves as a starting point, there's much more to explore in terms of schema design, entity extraction, and advanced querying techniques. As you become more familiar with the tool, you'll be able to create increasingly sophisticated knowledge graphs based on your very own data.
The integration with Neo4j's database technology ensures efficient querying and updating of your graphs, making this a powerful addition to our RAG toolkit.
Further Reading
- BitNet: LLM Quantization at its Extreme
- Use LLMs to extract data from documents
- Using AI directly from your PostgreSQL database
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use