Chat with your Confluence: A Step-by-Step Guide using Airbyte, Langchain and PGVector

blog preview

The ability to quickly access and interpret data is crucial in today's business landscape. As organizations increasingly adopt tools like Confluence for documentation and knowledge management, the challenge often lies in navigating and utilizing this vast repositories of information efficiently. This is where the concept of an AI-powered chatbot comes into play, particularly one that is driven by a Large Language Model (LLM) and integrated seamlessly with Confluence data.

The aim of this blog post is to guide you through the process of creating such a chatbot. We'll start by discussing the role of Airbyte, an open-source data integration platform, in syncing your Confluence data into a suitable database. Then, we'll delve into the utilization of PGVector, an extension for PostgreSQL, which allows for efficient vector operations - a key component in running LLM queries against large datasets. Finally, we'll introduce Langchain, a powerful tool for integrating various tools with each other, and demonstrate how it can be used to glue together the various components of our chatbot.

By the end of this guide, you will not only understand the theoretical underpinnings of this technology but also have the practical know-how to set up your own AI chatbot. This tool will be capable of interpreting complex queries and returning concise, relevant information from your Confluence data, enhancing the way your team interacts with and leverages this valuable resource.

Whether you are a software developer, a data scientist, or simply a tech enthusiast looking to experiment with AI and data integration, this post is designed to provide you with a clear, straightforward path to achieving a sophisticated and highly functional chatbot solution. Let's dive in and start our journey towards revolutionizing data accessibility and interaction within your organization.

What is an LLM and Why Use It for a Chatbot?

A Large Language Model (LLM) like GPT-4 is an advanced AI that understands and generates human-like text. LLMs are trained on vast datasets, enabling them to comprehend context, answer questions, and even mimic human conversation styles. For a chatbot, this means the ability to provide more natural, accurate, and contextually relevant responses to user queries.

Introduction to Airbyte, Langchain and PGVector

Airbyte is an open-source data integration tool that simplifies the process of syncing data from various sources to your databases, data lakes, or data warehouses. In our case, Airbyte will be instrumental in connecting and transferring data from Confluence to a PostgreSQL database.

PGVector is an extension for PostgreSQL designed for efficient vector operations, crucial for handling LLM-based queries. It allows for fast searching and comparison within large datasets, making it an ideal choice for our chatbots backend. For a primer on what PGVector is and why it's such a great solution for Chatbots, have a look at this blog post about "What is PGVector"

Langchain is a powerful tool for integrating various tools with each other. It allows you to create a chain of tools, where the output of one tool becomes the input of the next. In our case, Langchain will be used to connect Airbyte with PGVector, enabling seamless data transfer and query optimization for our chatbot.

Prerequisites

To follow along with this guide, you'll need the following already set up:

  • Confluence Account: Ensure you have access to a Confluence account with data you want to query.
  • PostgreSQL Database: Set up a PostgreSQL database. This will be where your Confluence data resides for the chatbot to access.
  • Docker and Docker Compose: Install Docker and Docker Compose on your local. Follow the official Docker instructions

Setting Up Airbyte and Langchain for Confluence Data Integration

Airbyte historically was a flexible and powerful UI-based tool for data integration. With their more than 300 connectors you can easily connect to a wide range of data sources and destinations. However, historically, this also meant to set up the whole Airbyte platform on your system or use their managed cloud offering.

With the release of Airbyte's latest python library called PyAirbyte, you can now utilize the Airbyte connectors from within a python script - without the need of running Airbyte. This makes setting up Airbyte as easy as running an import command.

Furthermore, Airbyte and Langchain are very well integrated. Langchain provides a AirbyteLoader tool that uses PyAirbyte to utilize any Airbyte connector and integrate them into the Langchain ecosystem.

I can't overstate the significance of this combination of Langchain and Airbyte. We finally have a set of tools allowing us to connect virtually and data source with LLMs - and creating chatbots on top of them.

Setting up Confluence as a Source in the Langchain AirbyteLoader

NOTE: The following section requires you to run Python 3.10 or lower. The PyAirbyte library is not yet compatible with Python 3.11 or 3.12.

As the theory is out of the way, let's get our hands dirty and start setting up our chatbot.

  1. Create a Confluence API Token in your Confluence instance.

    • Navigate to your confluence api token screen
    • Click on "Create API Token" and give it a label, e.g. "Airbyte Integration"
    • Copy the generated token and store it in a safe place. You will need it later.
  2. Install the Langchain AirbyteLoader

    1pip install langchain
    2pip install langchain-airbyte
    3pip install langchain-openai
    4pip install langchain-postgres
    5pip install beautifulsoup4
    6pip install psycopg
  3. Import the AirbyteLoader and set up the Confluence source

    1from langchain_airbyte import AirbyteLoader
    2
    3loader = AirbyteLoader(
    4 source = "source-confluence",
    5 stream = "pages",
    6 config = {
    7 "domain_name": "devopsandmore.atlassian.net",
    8 "email": "andreas.nigg@devopsandmore.com",
    9 "api_token": "<your-confluence-api-key>",
    10 }
    11)
  4. That's basically all we need to read the confluence pages. Let's print our result.

1docs = loader.load()
2print(docs[0].page_content[:500])

The output is provided as yaml, and might look as follows.

1_expandable:
2 ancestors: ''
3 childTypes: ''
4 children: /rest/api/content/65610/child
5 container: /rest/api/space/~625124fdf813eb00692f81e5
6 metadata: ''
7 operations: ''
8 schedulePublishDate: ''
9 schedulePublishInfo: ''
10 space: /rest/api/space/~625124fdf813eb00692f81e5
11_links:
12 editui: /pages/resumedraft.action?draftId=65610
13 self: https://devopsandmore.atlassian.net/wiki/rest/api/content/65610
14 tinyui: /x/SgAB
15 webui: /spaces/~625124fdf813eb00692f81e5/pages/65610/Sample+Pages
16body:
17 _expandable:
18 anonymous_export_view: ''
19 atlas_doc_format: ''
20 dynamic: ''
21 editor: ''
22 editor2: ''
23 export_view: ''
24 styled_view: ''
25 storage:
26 _expandable:
27 content: /rest/api/content/65610
28 embeddedContent: []
29 representation: storage
30 value: "\n<ac:structured-macro ac:name=\"info\" ac:schema-version=\"1\" ac:macro-id=\"\
31 e4a7c28f-0952-47e9-bf6c-a6edb999e9bf\"><ac:rich-text-body><p> We've created\
32 \ a few sample pages that you can use to get started.</p></ac:rich-text-body></ac:structured-macro>\n\
33 <p />\n<table data-layout=\"wide\">\n<tbody>\n <tr>\n <td><h3 style=\"\
34 text-align: center;\"><ac:emoticon ac:name=\"tick\" /> <a href=\"/wiki/spaces/~625124fdf813eb00692f81e5/pages/65625/Product+requirements\"\
35

While this is quite an achievement, we have two problems we should tackle:

  1. The sync is quite slow. The loader.load() method reads ALL documents synchronously.
  2. The output format contains a lot of fields we do not need.

Thankfully, Langchain has us covered. First, we can tweak the loader.load() call as follows:

1my_async_iterator = loader.alazy_load()
2async for doc in my_async_iterator:
3 print(doc.page_content)

This creates a lazy loading version of the AirbyteLoader, which only loads the confluence pages when they are needed. Furthermore, it provides an async inter- face, allowing us to run the operation concurrently, if we wanted to.

Note: We don't implement concurrency in this tutorial, however it is highly advisable to make use of concurrency in a real-world scenario.

Furthermore, Langchain provides a very handy interface to select, which fields should actually be returned - and how. This is done by specifying a template in the AirbyteLoader constructor.

1loader = AirbyteLoader(
2 source = "source-confluence",
3 stream = "pages",
4 config = {
5 "domain_name": "devopsandmore.atlassian.net",
6 "email": "andreas.nigg@devopsandmore.com",
7 "api_token": "<your-confluence-api-key>",
8 },
9 template=PromptTemplate.from_template(
10 '{body[storage][value]}'
11 ),
12)

The template parameter allows to define a template string defining which fields to select from the response. In our case we are interested in the body->storage->value field, which is selected with template string body[storage][value].

However we still have a little issue: the provided page content is overloaded with html tags which are not needed in our case. To resolve this issue, we can use the package beautifulsoup4. It is especially good for parsing html code and transforming it to text. (We want to have plain text so we can use it later on for the LLM).

Also, Langchain extracts a bunch of meta-information which we don't necessarily need. We can remove it. So, our final code to pre-process the extracted data from confluence looks as follows.

1from bs4 import BeautifulSoup
2my_async_iterator = loader.alazy_load()
3docs = []
4async for doc in my_async_iterator:
5 doc.metadata = {"title": doc.metadata["title"],
6 "createdAt": doc.metadata["history"]["createdDate"],
7 "createdBy": doc.metadata["history"]["createdBy"]}
8
9 html_content = doc.page_content
10 soup = BeautifulSoup(html_content, 'html.parser')
11 clean_html = soup.get_text()
12
13 # Appending the title to the page content, as we want to use the title in
14 # our pgvector embedding similarity search
15 doc.page_content = doc.metadata["title"] + "\n" + clean_html
16 docs.append(doc)
17

That's it, we are done with preprocessing our data.

Note: It's not important, that the extracted text does not "look" very clean. We might still have some html or utf tags left. However, the LLM will later on be able to handle these imperfections.

Storing the data in our PGVector vector store

Now that we have our data extracted, we can put them in our PGVector vector store.

If not already done, install PGVector in you Postgres database. On your Postgres database host, run

1cd /tmp
2git clone --branch v0.7.0 https://github.com/pgvector/pgvector.git
3cd pgvector
4make
5make install # may need sudo

Alternatively, you can also run a PGVector-enabled Postgres directly from docker with image pgvector/pgvector:pg16.

Note: If you use a managed Postgres installation like supabase or Postgres on Microsoft Azure, have a look at their extension documentation. Most, if not all providers offer support for PGVector.

Next, enable the PGVector extension. Connect to your Postgres database and run

1CREATE EXTENSION vector;

That's all we need to initialize PGVector. Next we can already use Langchain to insert our data.

Creating embeddings from texts and inserting to PGVector

On a high level, we have two tasks:

  1. Create vector embeddings from the confluence texts
  2. Insert these embeddings alongside the texts to our PGVector vector database

While we could do this manually, Langchain provides convenient abstractions, reducing it to a few lines of code.

1from langchain_cohere import OpenAIEmbeddings
2from langchain_core.documents import Document
3from langchain_postgres import PGVector
4from langchain_postgres.vectorstores import PGVector
5
6# See docker command above to launch a postgres instance with pgvector enabled.
7connection = "postgresql+psycopg://<user>:<password>@<host>:<port>/<database>" # Change to your postgres instance
8collection_name = "blog_docs"
9embeddings = OpenAIEmbeddings(api_key="<your-openai-api-key>")
10
11vectorstore = PGVector(
12 embeddings=embeddings,
13 collection_name=collection_name,
14 connection=connection,
15 use_jsonb=True,
16)
17
18vectorstore.add_documents(docs)

Note that Langchain will use the OpenAI text-embeddings-v2 embedding model to create the vector embeddings. Therefore you need a working OpenAI api key.

That's it! We stored our embeddings in our PGVector database.

Using the Langchain chat interface to Chat with our Confluence data

Now that we have stored and prepared our confluence data, we can finally use our beloved LLM - GPT-4 - to ask questions and get answers based on our confluence data.

1from langchain.chains import RetrievalQA
2from langchain_openai import ChatOpenAI
3
4qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=0, model="gpt-4-turbo", api_key="<your-openai-api-key>", streaming=True),
5 chain_type="stuff",
6 retriever=vectorstore.as_retriever())

To run a prompt, simply call

1qa.run("How to create invoices with SevDesk?")

That's all we need. Under the hood, Langchain does the following:

  1. Create vector embeddings from the search query.
  2. Use vector similarity search of PGVector to find documents in the vector store which are semantically similar to our search query.
  3. Send the question plus the semantically similar documents to the GPT-4-TURBO large language model.
  4. Retrieve the answer from the AI model.

Conclusion

In this blog post, we provided a step-by-step guide on how to build an AI chatbot that interfaces with Confluence data, leveraging tools like Airbyte for extremely easy data extraction, PGVector for storing text embeddings, and Langchain for gluing the components together.

Key points:

  • Large Language Models (LLMs) like GPT-4 enable chatbots to provide more natural, accurate, and contextually relevant responses to user queries by understanding and generating human-like text.

  • Airbyte simplifies syncing data from sources like Confluence to databases. PyAirbyte, Airbyte's Python library, allows using Airbyte connectors directly from Python without running the full platform.

  • PGVector is a PostgreSQL extension for efficient vector operations, very useful and convenient for handling LLM-based queries on large datasets.

  • Langchain integrates various tools into a chain. It connects Airbyte with PGVector for seamless data transfer and query optimization in the chatbot.

  • The AirbyteLoader in Langchain utilizes PyAirbyte to load Confluence data. The data is pre-processed to extract relevant fields and clean HTML content.

  • PGVector is used to store the cleaned Confluence data and their vector embeddings generated using OpenAI's embedding model via Langchain.

  • Finally, Langchain's RetrievalQA chain-type interfaces with the PGVector data store and the GPT-4 model to provide answers to user queries based on the stored Confluence knowledge.

The post demonstrated the power of combining Langchain, Airbyte, and PGVector to create an intelligent chatbot solution that seamlessly integrates with Confluence data, providing a realistic, state-of-the art way for data accessibility within organizations.

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use
More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog