Using AI directly from your database - with PostgreSQL and pgai
Artificial Intelligence (AI) has become THE tool of choice for businesses looking to stay ahead of the curve. From personalized recommendations to predictive analytics, AI is revolutionizing the way we interact with technology. However, integrating AI into existing - or even new- applications can be a challenging task - especially for developers who are mostly working in database environments.
That's where the team at Timescale just
released a new extension called
pgai
. With pgai
, you can
seamlessly incorporate AI capabilities directly into your PostgreSQL
database, eliminating the need for separate AI infrastructure and
simplifying your workflow.
Imagine being able to perform advanced text analysis, sentiment classification, and embedding creation within your PostgreSQL queries. pgai makes this a reality, empowering database administrators and developers to build intelligent, responsive applications without leaving the familiar confines of their database environment.
In this guide, we will introduce the extension and demonstrate how one can use use AI directly from a PostgreSQL database, with only needing SQL. We'll start with how to install pgai, how to provide API keys and ultimately, how to use the extension.
We'll also discuss some very interesting use cases and provide showcases. So, let's get started.
What is pgai?
pgai is an innovative extension for PostgreSQL designed to integrate artificial intelligence capabilities directly into the database environment. Developed by the folks at Timescale, pgai aims to streamline AI workflows and AI workflows and enhance the capabilities of PostgreSQL databases for AI-native applications.
Key Features of pgai
-
AI Workflow Integration: pgai includes helper functions that facilitate the execution and management of AI workflows within PostgreSQL. This integration allows users to user AI without needing to switch between different tools or platforms.
-
Dependencies and Requirements: The extension requires certain components to function, such as vector data types and the PL/Python procedural language (plpython3u), which enables Python functions to be executed within the PostgreSQL environment.
-
Enhanced Capabilities: By embedding AI directly into the database, pgai allows for more sophisticated data analysis and processing tasks. This can include anything from predictive analytics to complex data transformations, all managed within the familiar PostgreSQL framework
Use-Cases for AI in PostgreSQL and pgai?
So, why would you want to use pgai? Or put differently - why would you want to want to integrate AI capabilities directly into your PostgreSQL database?
The answer is overwhelmingly simple and manifold. Let's take a look at just 5 of many Use-Cases enabled by such an integration.
Automatically Tagging and Categorizing Data
Consider a scenario where you have a dataset of user-generated content. You You want to automatically tag and categorize this data based on its content.
LLMs like GPT-4 are incredibly good at doing just that - labelling data based on content.
Traditionally, you would most probably create a small, separate service which which:
- Continuously fetches data from your database
- Sends these data to an LLM
- Receives the labels from the AI model
- Updates the corresponding rows in the database
Besides the overhead of setting up a separate service, you also have to provide a runtime environment, create secure access to the database, provide provide monitoring, logging, and so on.
How about doing all of this directly from within your database? With pgai,
the scenario above is a single SQL query! Even better, you can
create an insert trigger
to automatically tag and categorize new
data as it is being being inserted. No batch processing, no job
scheduling, no separate runtime runtime environment.
Content moderation
Similarly to the above, you can use AI to automatically moderate content. Consider we have an application with a comment section. Manually moderating user comments is certainly not the AI-way.
Again, we could set up a dedicated service just for that - or, we simply
again again use a single SQL query. For moderating content, pgai
explicitly implements the openai_moderate
function, interfacing the
well-known OpenAI content moderation
API.
Automatically Summarizing Text
Summarizing text is yet another very common task, perfectly suited for AI. We
can use the openai_chat_complete
function, which provides the fully
abstracted chat completion API from OpenAI. Once again, our summarization
is is a single SQL query away.
Creating Embeddings for Text
In addition to the chat completion and moderation API, pgai as of time of this this writing also provides an abstraction for the OpenAI embeddings API. Calling this method simply creates vector embeddings for any texts, either directly provided to the function call or stored in a column of a table.
Bulk-Updating Embeddings, Tags, or Categories
From my point of view one of the best use cases for this kind of extension is the ability to integrate these AI function calls in the set-like nature of relational SQL databases.
What do I mean by that?
Assume, you have a list of thousands of texts and you need to create summaries for all of them at once (because you didn't automatically created summaries created summaries when inserting the data). This is a rather annoying task. Again, setting up a python script, loading data, batching, updating, etc.
With pgai, we can update potentially ALL rows of a table with just a single a single statement. Just imagine the delight of creating summaries of 1000s of summaries of 1000s of texts with a single SQL query.
Note: Please keep in mind that these operations will take some time. So
So make sure to have your statement_timeout
set accordingly.
Another great example for this is bulk-updating or creating embeddings. With
our product Pondhouse AI, we recently
recently switched our embedding model to a more capable one. We had to manually
re-index millions of rows of texts. With pgai, this would've been much simpler.
much simpler. Just a simple UPDATE
statement and we're done.
(By the way, read more on how we use Matryoshka Embeddings and Adaptive Retrieval to not only have very high retrieval accuracy but also very low query latency)
How to install pgai?
There are currently two supported ways of installing pgai:
-
Using the pre-built TimescaleDB Docker image.
-
Installing the extension from source.
This might answer one of your questions: Does one need to run TimescaleDB to run pgai? No, you don't. While it's easier to use the pre-built image, you can certainly install the extension on any vanilla Postgres server.
Installing from source
To install pgai from source:
- Clone the pgai github repository
- Run
make install
Using the pre-built TimescaleDB Docker image
Simply run:
After we installed the extension using any one of the two options above, we can create the extension on our server:
-
Connect to your server
-
Create the extension:
Note: The
CASCADE
keyword will also install the required dependenciesplpython3u
andpgvector
Providing an OpenAI API Key
Most functions of pgai require an OpenAI API key to interface with any of the OpenAI LLM models.
While there are several options to handling API keys with pgai, we found the most straight forward method being using a session level parameter, simply setting the API key once during session creation and never worrying about it ever again.
-
Export your API key to your environment:
-
When connecting to your database, use a session parameter as follows:
From now on all functions of pgai will use this API key and therefore be authenticated against the OpenAI API.
One word of caution: While we discussed the great potential of bulk-updating huge amounts of data - each individual row will potentially be an OpenAI API call - potentially creating quite a significant bill... Be careful...
Hands-On: Example queries with pgai
Ok, enough talking. Let's see some examples.
Before we get started, one little remark: While we'll be able to achieve all our use-cases with a single SQL query (which, again, is sucha remarkable thing. a remarkable thing. A single SQL query to create summaries of 1000s of rows of text), some of the queries themselves will look a little more complex, as we need to map our AI requests to a JSON API - which in itself requires some data some data meddling. But - let's ot get ahead of ourselves - we'll explain each explain each step in detail.
First, let's create a demo table with some dummy data:
Arguably, these data are quite lazy - but should do the trick. The queries
above create a table user_content
, potentially holding user generated
entries. We have columns for the updated_at
timestamp, the user_id
of
the author, the content
of the entry, an embedding
column for storing
the vector embeddings of the text, a summary
column for storing summaries
and finally a blocked
column, indicating, if the content is blocked
or not - based on content moderation.
Creating Embeddings using pgai
Now that we have a nice and tidy demonstration table, let's create embeddings for each of the texts.
You might ask, whether we forgot something, but - dear reader - we didn't. This is indeed all one needs to create embeddings for ALL the rows in our table.
Let's check the results:
And indeed, all the rows are now equipped with embeddings.
As you might note, it's just simple, plain old SQL. You could add WHERE clauses, ON CONFLICT statements - basically the full range of PostgreSQL.
Invoking the text completion API using pgai to create summaries of data
Next, let's see how to invoke the text completion API to generate summaries for our texts.
The overall process is similar to creating embedddings, but we need to do some JSON manipulation to get the data in the right format.
The OpenAI Chat Completions API requires to provide a JSON object containing the messages to be sent to the model. Furthermore, also the results of the API are returned as a JSON object.
To familiarize ourselves with what we need to pass to the function call, let's first look at how one uses the API with python (as we assume, that most of us are quite familiar with the python API invocation):
As you can see, we need to provide a list of messages, each containing a role and the content of the message.
The result object subsequently looks as follows:
The most relevant part for us is the content
field of the message
object.
Knowing how to invoke the API and what we get back, we can create the SQL equivalent:
- We call the
openai_chat_complete
method to get chat completion results - The first parameter is the OpenAI model to use
- The second parameter needs to be a JSON array (
jsonb_build_array
), containing the messages we want to send to the LLM. Typically, you want to have a message with rolesystem
, setting the main context for the model and auser
message, providing the actual user question or text. - As the OpenAI API returns
JSON
, but we only need thecontent
field, we use the excellent Postgres JSON field select operator (->
) to select the property in question.
Overall, the example above invokes the API and gets the message content of the returned LLM answer - as text.
To create summaries for all our texts is straight forward. Let's simply apply the newly learned material:
Note the content
column in the second message object. We send the same system
prompt as in our example from before, but append the content of each of the
rows of our user_content
table to the user message of the API call.
After getting used to the JSON syntax, this is as simple as it gets. It's quite astonishing, if you think about what we just did: We created text summaries for each and every of your database rows, with just an SQL query!
Using pgai to moderate our user generated content
Finally, let's have a look at the moderation API. The function to use is called
openai_moderate
. It returns the following JSON object:
It contains the following information:
flagged
: A summarizing property, indicating if any one of the moderation categories was found to be violated by the text to moderate.categories
: Detailed flags for each individual moderation category.category_scores
: Numeric score for each of the categories. This is quite handy if you don't want to rely on the automatically generated flags, but want more control over which category should be treated how seriously.
We can again use the handy Postgres JSON operators to get the field which we
are most interested in. Let's say we simply want to use the flagged
field to
decide whether content should be blocked, our UPDATE
statement to moderate
all our texts at once is as follows:
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use
Showcase: Automatically creating AI embeddings when inserting text
Now that we got the basics covered, let's discuss a more complex - and even more useful - example: Instead of manually running these statement, we can create insert triggers to automatically create embeddings (or summaries, or moderation, or all 3 of them), directly when our data is inserted. Let's look at how this works.
Postgres triggers came in a variety of flavors. Summarizing, triggers can run before or after a INSERT, UPDATE, DELETE or TRUNCATE statement. They can be run for each row or for each statement.
Let's quickly consider our use-case: We want to create embeddings for each
new text that is inserted into our user_content
table. We can use a BEFORE INSERT
trigger to achieve this. We might also add a BEFORE UPDATE
trigger
to update the embeddings if the text is changed. As this is a row-wise operation,
we want a row-trigger (FOR EACH ROW
).
-
Create the function which is called by the trigger:
-
Create the trigger:
In conclusion, we created a trigger, executes each time when a new row is inserted or an existing row is updated. The trigger calls a function which creates new text embeddings for the new or updated text.
Let's see if it works, by inserting 2 new rows:
Let's check if this worked:
And indeed, both of our newly created pieces of content have embeddings stored
in the corresponding embedding
column.
Note: While we think this is a highly compelling use case, one needs
to consider the performance implications of using triggers in such a way. As
the trigger will call the embedding API on each insert, insert latency will
by definition go up.
Please make sure to carefully evaluate your use case and opt for batch
processing, if triggers are not a valid option.
As a matter of fact, TimescaleDB provides a powerful automation framework
called User Defined Actions.
These might be used to schedule pgai calls to automatically generate your
AI outputs. If you don't run TimescaleDB, consider using pg_cron
.
That being said, many use-cases don't require sub-second insert latency,
and therefore benefit from this incredibly simple and maintainable solution.
Conclusion
In this post we demonstrated the delightful features of Timescales new
pgai
extension. The extension seamlessly integrates with Postgres and
enables many exciting, automatable use-cases. Some of which we discussed
in this post are:
- Automatically moderating user generating content
- Automatically creating vector embeddings for our texts
- Automatically creating summaries, texts and labels
and many more.
One of the most convincing features of the extension is, that it abstracts away all the complexities of the OpenAI API and allows to interact with LLMs with simple and straight forward PostgreSQL functions.
What's also quite remarkable is the fact, that pgai
wonderfully aligns
with the SQL syntax of Postgres. No matter whether it's updates, inserts
or triggers - it feels natural to incorporate pgai
in any query.
This not only makes working with AI much easier, but it potentially also
allows database professionals to use AI productively for the first time.
As they potentially didn't want to fight with python, separate runtimes,
builds, etc. "just" for AI. And rightly so. With pgai
, these previous
obstacles are cleared away.