Chat with your Confluence: A Step-by-Step Guide using Airbyte, Langchain and PGVector
The ability to quickly access and interpret data is crucial in today's business landscape. As organizations increasingly adopt tools like Confluence for documentation and knowledge management, the challenge often lies in navigating and utilizing this vast repositories of information efficiently. This is where the concept of an AI-powered chatbot comes into play, particularly one that is driven by a Large Language Model (LLM) and integrated seamlessly with Confluence data.
The aim of this blog post is to guide you through the process of creating such a chatbot. We'll start by discussing the role of Airbyte, an open-source data integration platform, in syncing your Confluence data into a suitable database. Then, we'll delve into the utilization of PGVector, an extension for PostgreSQL, which allows for efficient vector operations - a key component in running LLM queries against large datasets. Finally, we'll introduce Langchain, a powerful tool for integrating various tools with each other, and demonstrate how it can be used to glue together the various components of our chatbot.
By the end of this guide, you will not only understand the theoretical underpinnings of this technology but also have the practical know-how to set up your own AI chatbot. This tool will be capable of interpreting complex queries and returning concise, relevant information from your Confluence data, enhancing the way your team interacts with and leverages this valuable resource.
Whether you are a software developer, a data scientist, or simply a tech enthusiast looking to experiment with AI and data integration, this post is designed to provide you with a clear, straightforward path to achieving a sophisticated and highly functional chatbot solution. Let's dive in and start our journey towards revolutionizing data accessibility and interaction within your organization.
What is an LLM and Why Use It for a Chatbot?
A Large Language Model (LLM) like GPT-4 is an advanced AI that understands and generates human-like text. LLMs are trained on vast datasets, enabling them to comprehend context, answer questions, and even mimic human conversation styles. For a chatbot, this means the ability to provide more natural, accurate, and contextually relevant responses to user queries.
Introduction to Airbyte, Langchain and PGVector
Airbyte is an open-source data integration tool that simplifies the process of syncing data from various sources to your databases, data lakes, or data warehouses. In our case, Airbyte will be instrumental in connecting and transferring data from Confluence to a PostgreSQL database.
PGVector is an extension for PostgreSQL designed for efficient vector operations, crucial for handling LLM-based queries. It allows for fast searching and comparison within large datasets, making it an ideal choice for our chatbots backend. For a primer on what PGVector is and why it's such a great solution for Chatbots, have a look at this blog post about "What is PGVector"
Langchain is a powerful tool for integrating various tools with each other. It allows you to create a chain of tools, where the output of one tool becomes the input of the next. In our case, Langchain will be used to connect Airbyte with PGVector, enabling seamless data transfer and query optimization for our chatbot.
Prerequisites
To follow along with this guide, you'll need the following already set up:
- Confluence Account: Ensure you have access to a Confluence account with data you want to query.
- PostgreSQL Database: Set up a PostgreSQL database. This will be where your Confluence data resides for the chatbot to access.
- Docker and Docker Compose: Install Docker and Docker Compose on your local. Follow the official Docker instructions
Setting Up Airbyte and Langchain for Confluence Data Integration
Airbyte historically was a flexible and powerful UI-based tool for data integration. With their more than 300 connectors you can easily connect to a wide range of data sources and destinations. However, historically, this also meant to set up the whole Airbyte platform on your system or use their managed cloud offering.
With the release of Airbyte's latest python library called PyAirbyte
, you can
now utilize the Airbyte connectors from within a python script - without the need
of running Airbyte. This makes setting up Airbyte as easy as running an import
command.
Furthermore, Airbyte and Langchain are very well integrated. Langchain provides a
AirbyteLoader
tool that uses PyAirbyte
to utilize any Airbyte connector and
integrate them into the Langchain ecosystem.
I can't overstate the significance of this combination of Langchain and Airbyte. We finally have a set of tools allowing us to connect virtually and data source with LLMs - and creating chatbots on top of them.
Setting up Confluence as a Source in the Langchain AirbyteLoader
NOTE: The following section requires you to run Python 3.10 or lower.
The PyAirbyte
library is not yet compatible with Python 3.11 or 3.12.
As the theory is out of the way, let's get our hands dirty and start setting up our chatbot.
-
Create a Confluence API Token in your Confluence instance.
- Navigate to your confluence api token screen
- Click on "Create API Token" and give it a label, e.g. "Airbyte Integration"
- Copy the generated token and store it in a safe place. You will need it later.
-
Install the Langchain AirbyteLoader
-
Import the AirbyteLoader and set up the Confluence source
-
That's basically all we need to read the confluence pages. Let's print our result.
The output is provided as yaml, and might look as follows.
While this is quite an achievement, we have two problems we should tackle:
- The sync is quite slow. The
loader.load()
method reads ALL documents synchronously. - The output format contains a lot of fields we do not need.
Thankfully, Langchain has us covered.
First, we can tweak the loader.load()
call as
follows:
This creates a lazy loading version of the AirbyteLoader, which only loads the confluence pages when they are needed. Furthermore, it provides an async inter- face, allowing us to run the operation concurrently, if we wanted to.
Note: We don't implement concurrency in this tutorial, however it is highly advisable to make use of concurrency in a real-world scenario.
Furthermore, Langchain provides a very handy interface to select, which fields
should actually be returned - and how. This is done by specifying a template
in the AirbyteLoader
constructor.
The template
parameter allows to define a template string defining which fields
to select from the response. In our case we are interested in the body->storage->value
field, which is selected with template string body[storage][value]
.
However we still have a little issue: the provided
page content is overloaded with html tags which are not needed in our case.
To resolve this issue, we can use the package beautifulsoup4
.
It is especially good for parsing html code and transforming it to text.
(We want to have plain text so we can use it later on for the LLM).
Also, Langchain extracts a bunch of meta-information which we don't necessarily need. We can remove it. So, our final code to pre-process the extracted data from confluence looks as follows.
That's it, we are done with preprocessing our data.
Note: It's not important, that the extracted text does not "look" very clean. We might still have some html or utf tags left. However, the LLM will later on be able to handle these imperfections.
Storing the data in our PGVector vector store
Now that we have our data extracted, we can put them in our PGVector vector store.
If not already done, install PGVector in you Postgres database. On your Postgres database host, run
Alternatively, you can also run a PGVector-enabled Postgres directly from docker
with image pgvector/pgvector:pg16
.
Note: If you use a managed Postgres installation like supabase
or Postgres
on Microsoft Azure, have a look at their extension documentation. Most, if not all
providers offer support for PGVector.
Next, enable the PGVector extension. Connect to your Postgres database and run
That's all we need to initialize PGVector. Next we can already use Langchain to insert our data.
Creating embeddings from texts and inserting to PGVector
On a high level, we have two tasks:
- Create vector embeddings from the confluence texts
- Insert these embeddings alongside the texts to our PGVector vector database
While we could do this manually, Langchain provides convenient abstractions, reducing it to a few lines of code.
Note that Langchain will use the OpenAI text-embeddings-v2 embedding model to create the vector embeddings. Therefore you need a working OpenAI api key.
That's it! We stored our embeddings in our PGVector database.
Using the Langchain chat interface to Chat with our Confluence data
Now that we have stored and prepared our confluence data, we can finally use our beloved LLM - GPT-4 - to ask questions and get answers based on our confluence data.
To run a prompt, simply call
That's all we need. Under the hood, Langchain does the following:
- Create vector embeddings from the search query.
- Use vector similarity search of PGVector to find documents in the vector store which are semantically similar to our search query.
- Send the question plus the semantically similar documents to the GPT-4-TURBO large language model.
- Retrieve the answer from the AI model.
Conclusion
In this blog post, we provided a step-by-step guide on how to build an AI chatbot that interfaces with Confluence data, leveraging tools like Airbyte for extremely easy data extraction, PGVector for storing text embeddings, and Langchain for gluing the components together.
Key points:
-
Large Language Models (LLMs) like GPT-4 enable chatbots to provide more natural, accurate, and contextually relevant responses to user queries by understanding and generating human-like text.
-
Airbyte simplifies syncing data from sources like Confluence to databases. PyAirbyte, Airbyte's Python library, allows using Airbyte connectors directly from Python without running the full platform.
-
PGVector is a PostgreSQL extension for efficient vector operations, very useful and convenient for handling LLM-based queries on large datasets.
-
Langchain integrates various tools into a chain. It connects Airbyte with PGVector for seamless data transfer and query optimization in the chatbot.
-
The AirbyteLoader in Langchain utilizes PyAirbyte to load Confluence data. The data is pre-processed to extract relevant fields and clean HTML content.
-
PGVector is used to store the cleaned Confluence data and their vector embeddings generated using OpenAI's embedding model via Langchain.
-
Finally, Langchain's RetrievalQA chain-type interfaces with the PGVector data store and the GPT-4 model to provide answers to user queries based on the stored Confluence knowledge.
The post demonstrated the power of combining Langchain, Airbyte, and PGVector to create an intelligent chatbot solution that seamlessly integrates with Confluence data, providing a realistic, state-of-the art way for data accessibility within organizations.
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use