Using AI directly from your database - with PostgreSQL and pgai

blog preview

Artificial Intelligence (AI) has become THE tool of choice for businesses looking to stay ahead of the curve. From personalized recommendations to predictive analytics, AI is revolutionizing the way we interact with technology. However, integrating AI into existing - or even new- applications can be a challenging task - especially for developers who are mostly working in database environments.

That's where the team at Timescale just released a new extension called pgai. With pgai, you can seamlessly incorporate AI capabilities directly into your PostgreSQL database, eliminating the need for separate AI infrastructure and simplifying your workflow.

Imagine being able to perform advanced text analysis, sentiment classification, and embedding creation within your PostgreSQL queries. pgai makes this a reality, empowering database administrators and developers to build intelligent, responsive applications without leaving the familiar confines of their database environment.

In this guide, we will introduce the extension and demonstrate how one can use use AI directly from a PostgreSQL database, with only needing SQL. We'll start with how to install pgai, how to provide API keys and ultimately, how to use the extension.

We'll also discuss some very interesting use cases and provide showcases. So, let's get started.

What is pgai?

pgai is an innovative extension for PostgreSQL designed to integrate artificial intelligence capabilities directly into the database environment. Developed by the folks at Timescale, pgai aims to streamline AI workflows and AI workflows and enhance the capabilities of PostgreSQL databases for AI-native applications.

Key Features of pgai

  • AI Workflow Integration: pgai includes helper functions that facilitate the execution and management of AI workflows within PostgreSQL. This integration allows users to user AI without needing to switch between different tools or platforms.

  • Dependencies and Requirements: The extension requires certain components to function, such as vector data types and the PL/Python procedural language (plpython3u), which enables Python functions to be executed within the PostgreSQL environment.

  • Enhanced Capabilities: By embedding AI directly into the database, pgai allows for more sophisticated data analysis and processing tasks. This can include anything from predictive analytics to complex data transformations, all managed within the familiar PostgreSQL framework

Use-Cases for AI in PostgreSQL and pgai?

So, why would you want to use pgai? Or put differently - why would you want to want to integrate AI capabilities directly into your PostgreSQL database?

The answer is overwhelmingly simple and manifold. Let's take a look at just 5 of many Use-Cases enabled by such an integration.

Automatically Tagging and Categorizing Data

Consider a scenario where you have a dataset of user-generated content. You You want to automatically tag and categorize this data based on its content.

LLMs like GPT-4 are incredibly good at doing just that - labelling data based on content.

Traditionally, you would most probably create a small, separate service which which:

  1. Continuously fetches data from your database
  2. Sends these data to an LLM
  3. Receives the labels from the AI model
  4. Updates the corresponding rows in the database

Besides the overhead of setting up a separate service, you also have to provide a runtime environment, create secure access to the database, provide provide monitoring, logging, and so on.

How about doing all of this directly from within your database? With pgai, the scenario above is a single SQL query! Even better, you can create an insert trigger to automatically tag and categorize new data as it is being being inserted. No batch processing, no job scheduling, no separate runtime runtime environment.

Content moderation

Similarly to the above, you can use AI to automatically moderate content. Consider we have an application with a comment section. Manually moderating user comments is certainly not the AI-way.

Again, we could set up a dedicated service just for that - or, we simply again again use a single SQL query. For moderating content, pgai explicitly implements the openai_moderate function, interfacing the well-known OpenAI content moderation API.

Automatically Summarizing Text

Summarizing text is yet another very common task, perfectly suited for AI. We can use the openai_chat_complete function, which provides the fully abstracted chat completion API from OpenAI. Once again, our summarization is is a single SQL query away.

Creating Embeddings for Text

In addition to the chat completion and moderation API, pgai as of time of this this writing also provides an abstraction for the OpenAI embeddings API. Calling this method simply creates vector embeddings for any texts, either directly provided to the function call or stored in a column of a table.

Bulk-Updating Embeddings, Tags, or Categories

From my point of view one of the best use cases for this kind of extension is the ability to integrate these AI function calls in the set-like nature of relational SQL databases.

What do I mean by that?

Assume, you have a list of thousands of texts and you need to create summaries for all of them at once (because you didn't automatically created summaries created summaries when inserting the data). This is a rather annoying task. Again, setting up a python script, loading data, batching, updating, etc.

With pgai, we can update potentially ALL rows of a table with just a single a single statement. Just imagine the delight of creating summaries of 1000s of summaries of 1000s of texts with a single SQL query.

Note: Please keep in mind that these operations will take some time. So So make sure to have your statement_timeout set accordingly.

Another great example for this is bulk-updating or creating embeddings. With our product Pondhouse AI, we recently recently switched our embedding model to a more capable one. We had to manually re-index millions of rows of texts. With pgai, this would've been much simpler. much simpler. Just a simple UPDATE statement and we're done.

(By the way, read more on how we use Matryoshka Embeddings and Adaptive Retrieval to not only have very high retrieval accuracy but also very low query latency)

How to install pgai?

There are currently two supported ways of installing pgai:

  1. Using the pre-built TimescaleDB Docker image.

  2. Installing the extension from source.

This might answer one of your questions: Does one need to run TimescaleDB to run pgai? No, you don't. While it's easier to use the pre-built image, you can certainly install the extension on any vanilla Postgres server.

Installing from source

To install pgai from source:

  1. Clone the pgai github repository
  2. Run make install

Using the pre-built TimescaleDB Docker image

Simply run:

1docker run -d --name timescaledb -p 5432:5432 -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg16

After we installed the extension using any one of the two options above, we can create the extension on our server:

  1. Connect to your server

    1psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>"
  2. Create the extension:

    1CREATE EXTENSION IF NOT EXISTS ai CASCADE;

    Note: The CASCADE keyword will also install the required dependencies plpython3u and pgvector

Providing an OpenAI API Key

Most functions of pgai require an OpenAI API key to interface with any of the OpenAI LLM models.

While there are several options to handling API keys with pgai, we found the most straight forward method being using a session level parameter, simply setting the API key once during session creation and never worrying about it ever again.

  1. Export your API key to your environment:

    1export OPENAI_API_KEY="this-is-my-super-secret-api-key-dont-tell"
  2. When connecting to your database, use a session parameter as follows:

    1PGOPTIONS="-c ai.openai_api_key=$OPENAI_API_KEY" psql -d
    2"postgres://<username>:<password>@<host>:<port>/<database-name>"

From now on all functions of pgai will use this API key and therefore be authenticated against the OpenAI API.

One word of caution: While we discussed the great potential of bulk-updating huge amounts of data - each individual row will potentially be an OpenAI API call - potentially creating quite a significant bill... Be careful...

Hands-On: Example queries with pgai

Ok, enough talking. Let's see some examples.

Before we get started, one little remark: While we'll be able to achieve all our use-cases with a single SQL query (which, again, is sucha remarkable thing. a remarkable thing. A single SQL query to create summaries of 1000s of rows of text), some of the queries themselves will look a little more complex, as we need to map our AI requests to a JSON API - which in itself requires some data some data meddling. But - let's ot get ahead of ourselves - we'll explain each explain each step in detail.

First, let's create a demo table with some dummy data:

1CREATE TABLE user_content (
2 updated_at TIMESTAMP NOT NULL,
3 user_id INT NOT NULL,
4 content TEXT NOT NULL,
5 embedding VECTOR(1536),
6 summary TEXT,
7 blocked BOOLEAN DEFAULT FALSE
8);
1INSERT INTO user_content (updated_at, user_id, content)
2VALUES
3 ('2024-06-14 10:00:00', 1, 'The French Revolution, which began in 1789 and lasted until 1799, was a period of profound political and social upheaval in France that fundamentally transformed the nation and had far-reaching impacts across the globe'),
4 ('2024-06-14 11:00:00', 2, 'Triggered by widespread discontent with the monarchy of Louis XVI, economic hardship, and the influence of Enlightenment ideas, the revolution sought to dismantle the old feudal structures and create a society based on principles of liberty, equality, and fraternity'),
5 ('2024-06-14 12:00:00', 3, 'Key events included the storming of the Bastille, the Reign of Terror led by figures such as Robespierre, and the eventual rise of Napoleon Bonaparte'),

Arguably, these data are quite lazy - but should do the trick. The queries above create a table user_content, potentially holding user generated entries. We have columns for the updated_at timestamp, the user_id of the author, the content of the entry, an embedding column for storing the vector embeddings of the text, a summary column for storing summaries and finally a blocked column, indicating, if the content is blocked or not - based on content moderation.

Creating Embeddings using pgai

Now that we have a nice and tidy demonstration table, let's create embeddings for each of the texts.

1UPDATE user_content
2SET embedding = openai_embed('text-embedding-ada-002', content);

You might ask, whether we forgot something, but - dear reader - we didn't. This is indeed all one needs to create embeddings for ALL the rows in our table.

Let's check the results:

1SELECT * FROM user_content;

And indeed, all the rows are now equipped with embeddings.

As you might note, it's just simple, plain old SQL. You could add WHERE clauses, ON CONFLICT statements - basically the full range of PostgreSQL.

Invoking the text completion API using pgai to create summaries of data

Next, let's see how to invoke the text completion API to generate summaries for our texts.

The overall process is similar to creating embedddings, but we need to do some JSON manipulation to get the data in the right format.

The OpenAI Chat Completions API requires to provide a JSON object containing the messages to be sent to the model. Furthermore, also the results of the API are returned as a JSON object.

To familiarize ourselves with what we need to pass to the function call, let's first look at how one uses the API with python (as we assume, that most of us are quite familiar with the python API invocation):

1response = client.chat.completions.create(
2 model="gpt-3.5-turbo",
3 messages=[
4 {"role": "system", "content": "You are a helpful assistant."},
5 {"role": "user", "content": "Who won the world series in 2020?"},
6 {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
7 {"role": "user", "content": "Where was it played?"}
8 ]
9)

As you can see, we need to provide a list of messages, each containing a role and the content of the message.

The result object subsequently looks as follows:

1{
2 "choices": [
3 {
4 "finish_reason": "stop",
5 "index": 0,
6 "message": {
7 "content": "The 2020 World Series was played in Texas at Globe Life Field in Arlington.",
8 "role": "assistant"
9 },
10 "logprobs": null
11 }
12 ],
13 "created": 1677664795,
14 "id": "chatcmpl-7QyqpwdfhqwajicIEznoc6Q47XAyW",
15 "model": "gpt-3.5-turbo-0613",
16 "object": "chat.completion",
17 "usage": {
18 "completion_tokens": 17,
19 "prompt_tokens": 57,
20 "total_tokens": 74
21 }
22}

The most relevant part for us is the content field of the message object.

Knowing how to invoke the API and what we get back, we can create the SQL equivalent:

1select openai_chat_complete
2( 'gpt-4o'
3, jsonb_build_array
4 ( jsonb_build_object('role', 'system', 'content', 'You are tasked to summarize texts. Tone should be educational and neutral.')
5 , jsonb_build_object('role', 'user', 'content', 'Text to summarize: ')
6 )
7 )
8)->'choices'->0->'message'->>'content'
9;
  1. We call the openai_chat_complete method to get chat completion results
  2. The first parameter is the OpenAI model to use
  3. The second parameter needs to be a JSON array (jsonb_build_array), containing the messages we want to send to the LLM. Typically, you want to have a message with role system, setting the main context for the model and a user message, providing the actual user question or text.
  4. As the OpenAI API returns JSON, but we only need the content field, we use the excellent Postgres JSON field select operator (->) to select the property in question.

Overall, the example above invokes the API and gets the message content of the returned LLM answer - as text.

To create summaries for all our texts is straight forward. Let's simply apply the newly learned material:

1UPDATE user_content
2SET summary = openai_chat_complete
3( 'gpt-4o'
4, jsonb_build_array
5 ( jsonb_build_object('role', 'system', 'content', 'You are tasked to summarize texts. Tone should be educational and neutral.')
6 , jsonb_build_object('role', 'user', 'content', 'Text to summarize: ' || content)
7 )
8)->'choices'->0->'message'->>'content';

Note the content column in the second message object. We send the same system prompt as in our example from before, but append the content of each of the rows of our user_content table to the user message of the API call.

After getting used to the JSON syntax, this is as simple as it gets. It's quite astonishing, if you think about what we just did: We created text summaries for each and every of your database rows, with just an SQL query!

Using pgai to moderate our user generated content

Finally, let's have a look at the moderation API. The function to use is called openai_moderate. It returns the following JSON object:

1{
2 "id": "modr-9RsN6qZWoZYm1AK4mtrKuEjfOcMWp",
3 "model": "text-moderation-007",
4 "results": [
5 {
6 "flagged": true,
7 "categories": {
8 "hate": false,
9 "sexual": false,
10 "violence": true,
11 "self-harm": false,
12 "self_harm": false,
13 "harassment": true,
14 "sexual/minors": false,
15 "sexual_minors": false,
16 "hate/threatening": false,
17 "hate_threatening": false,
18 "self-harm/intent": false,
19 "self_harm_intent": false,
20 "violence/graphic": false,
21 "violence_graphic": false,
22 "harassment/threatening": true,
23 "harassment_threatening": true,
24 "self-harm/instructions": false,
25 "self_harm_instructions": false
26 },
27 "category_scores": {
28 "hate": 0.2324090600013733,
29 "sexual": 0.00001205232911161147,
30 "violence": 0.997192919254303,
31 "self-harm": 0.0000023696395601291442,
32 "self_harm": 0.0000023696395601291442,
33 "harassment": 0.5278584957122803,
34 "sexual/minors": 0.00000007506431387582779,
35 "sexual_minors": 0.00000007506431387582779,
36 "hate/threatening": 0.024183575063943863,
37 "hate_threatening": 0.024183575063943863,
38 "self-harm/intent": 0.0000017161115692942985,
39 "self_harm_intent": 0.0000017161115692942985,
40 "violence/graphic": 0.00003399916022317484,
41 "violence_graphic": 0.00003399916022317484,
42 "harassment/threatening": 0.5712487697601318,
43 "harassment_threatening": 0.5712487697601318,
44 "self-harm/instructions": 0.000000001132860139030356,
45 "self_harm_instructions": 0.000000001132860139030356
46 }
47 }
48 ]
49}

It contains the following information:

  • flagged: A summarizing property, indicating if any one of the moderation categories was found to be violated by the text to moderate.
  • categories: Detailed flags for each individual moderation category.
  • category_scores: Numeric score for each of the categories. This is quite handy if you don't want to rely on the automatically generated flags, but want more control over which category should be treated how seriously.

We can again use the handy Postgres JSON operators to get the field which we are most interested in. Let's say we simply want to use the flagged field to decide whether content should be blocked, our UPDATE statement to moderate all our texts at once is as follows:

1UPDATE user_content
2SET blocked = (openai_moderate
3 ( 'text-moderation-stable'
4 , content
5 ) ->'results'->0->>'flagged')::BOOLEAN;

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use

Showcase: Automatically creating AI embeddings when inserting text

Now that we got the basics covered, let's discuss a more complex - and even more useful - example: Instead of manually running these statement, we can create insert triggers to automatically create embeddings (or summaries, or moderation, or all 3 of them), directly when our data is inserted. Let's look at how this works.

Postgres triggers came in a variety of flavors. Summarizing, triggers can run before or after a INSERT, UPDATE, DELETE or TRUNCATE statement. They can be run for each row or for each statement.

Let's quickly consider our use-case: We want to create embeddings for each new text that is inserted into our user_content table. We can use a BEFORE INSERT trigger to achieve this. We might also add a BEFORE UPDATE trigger to update the embeddings if the text is changed. As this is a row-wise operation, we want a row-trigger (FOR EACH ROW).

  1. Create the function which is called by the trigger:

    1CREATE OR REPLACE FUNCTION create_embedding()
    2RETURNS TRIGGER AS $$
    3BEGIN
    4 NEW.embedding = openai_embed('text-embedding-ada-002', NEW.content);
    5 RETURN NEW;
    6END;
    7$$ LANGUAGE plpgsql;
  2. Create the trigger:

    1CREATE TRIGGER create_embedding_trigger
    2BEFORE INSERT OR UPDATE ON user_content
    3FOR EACH ROW
    4EXECUTE FUNCTION create_embedding();

In conclusion, we created a trigger, executes each time when a new row is inserted or an existing row is updated. The trigger calls a function which creates new text embeddings for the new or updated text.

Let's see if it works, by inserting 2 new rows:

1INSERT INTO user_content (updated_at, user_id, content)
2VALUES
3 ('2024-06-14 13:00:00', 4, 'The quick brown fox jumps over the lazy dog'),
4 ('2024-06-14 14:00:00', 5, 'The five boxing wizards jump quickly');

Let's check if this worked:

1select content, embedding from user_content where user_id in (4, 5);

And indeed, both of our newly created pieces of content have embeddings stored in the corresponding embedding column.

Note: While we think this is a highly compelling use case, one needs to consider the performance implications of using triggers in such a way. As the trigger will call the embedding API on each insert, insert latency will by definition go up. Please make sure to carefully evaluate your use case and opt for batch processing, if triggers are not a valid option. As a matter of fact, TimescaleDB provides a powerful automation framework called User Defined Actions. These might be used to schedule pgai calls to automatically generate your AI outputs. If you don't run TimescaleDB, consider using pg_cron. That being said, many use-cases don't require sub-second insert latency, and therefore benefit from this incredibly simple and maintainable solution.

Conclusion

In this post we demonstrated the delightful features of Timescales new pgai extension. The extension seamlessly integrates with Postgres and enables many exciting, automatable use-cases. Some of which we discussed in this post are:

  • Automatically moderating user generating content
  • Automatically creating vector embeddings for our texts
  • Automatically creating summaries, texts and labels

and many more.

One of the most convincing features of the extension is, that it abstracts away all the complexities of the OpenAI API and allows to interact with LLMs with simple and straight forward PostgreSQL functions.

What's also quite remarkable is the fact, that pgai wonderfully aligns with the SQL syntax of Postgres. No matter whether it's updates, inserts or triggers - it feels natural to incorporate pgai in any query.

This not only makes working with AI much easier, but it potentially also allows database professionals to use AI productively for the first time. As they potentially didn't want to fight with python, separate runtimes, builds, etc. "just" for AI. And rightly so. With pgai, these previous obstacles are cleared away.

Further Reading

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog