How to classify, describe and analyze images using GPT-4o vision

GPT-4o is - as of time of this writing - one of the best multimodal models on the market.

Multimodal in this context means, it can work with multiple different input formats. Text, Images and soon also audio.

In this post we want to explore the vision part of GPT-4o. More specifically, we'll:

Look at how to actually use images with the OpenAI API and GPT-4o
How to label and categorize images, without first needing to train a image labelling model
How to simply describe an image
and finally, how to do some open-ended image analyses - with a little data analytics sprinkled on top of it

How to use the GPT-4o vision capabilities

Using GPT-4o's vision capabilities is quite straight forward - arguably very similar to how to use it's well-known chat completion capabilities.

Similar to chat completion, vision requires us to create a messages-array, containing an optional system message as well as a user message. Furthermore, you can also add "assistant"-messages as part of the chat history or as part of your few-shot prompting strategy. The content-field of the user message then can contain not only text, but also image data.

The image needs to be provided as a downloadable url or as base64-encoded image data. Let's look at an example in python:

Using GPT-4o vision with image url

First, let's look at how to create our prompt message array using an image link. Just upload the image to a public web host, and add this url as shown below.

1  messages=[
2    {
3      "role": "system",
4      "content": "You are an image classification system. Classify the following images to be either 'cat' or 'dog', in the following format {'type': 'cat/dog'}"
5    },
6    {
7      "role": "user",
8      "content": [
9        {"type": "text", "text": "Classify this image."},
10        {
11          "type": "image_url",
12          "image_url": {
13            "url": "https://upload.wikimedia.org/wikipedia/commons/9/99/Brooks_Chase_Ranger_of_Jolly_Dogs_Jack_Russell.jpg",
14          },
15        },
16      ],
17    }
18  ]

Using GPT-4o vision with base64-encoded image data

If your image resides on your host, read the image file and encode it in base64 format. Again, use the image_url field, but this time with image base64 data.

Note: Make sure to change the jpeg modifier in the example below to your actual file type.

1  import base64
2
3  with open(my_image, "rb") as image_file:
4    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
5
6  messages=[
7    {
8      "role": "system",
9      "content": "You are an image classification system. Classify the following images to be either 'cat' or 'dog', in the following format {'type': 'cat/dog'}"
10    },
11    {
12      "role": "user",
13      "content": [
14        {"type": "text", "text": "Classify this image."},
15        {
16          "type": "image_url",
17          "image_url": {
18            "url": f"data:image/jpeg;base64,{base64_image}"
19          },
20        },
21      ],
22    }
23  ]

Using GPT-4o vision with multiple images

Just as a side note: GPT-4o can not only work with one image per prompt, but multiple. Simply add an additional object of type image_url to your user message. You can even mix base64 and url-type images.

Note: Make sure to have a recent version of the OpenAI python package to follow this example. Version 1.40.0 or above is sufficient.

1  with open(my_image, "rb") as image_file:
2    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
3
4  messages=[
5    {
6      "role": "system",
7      "content": "You are an image classification system. Classify the following images to be either 'cat' or 'dog', in the following format {'type': 'cat/dog'}"
8    },
9    {
10      "role": "user",
11      "content": [
12        {"type": "text", "text": "Classify this image."},
13        {
14          "type": "image_url",
15          "image_url": {
16            "url": f"data:image/jpeg;base64,{base64_image}"
17          },
18        },
19      ],
20    },
21    {
22      "role": "user",
23      "content": [
24        {"type": "text", "text": "Classify this image."},
25        {
26          "type": "image_url",
27          "image_url": {
28            "url": "https://upload.wikimedia.org/wikipedia/commons/9/99/Brooks_Chase_Ranger_of_Jolly_Dogs_Jack_Russell.jpg"
29          },
30        },
31      ],
32    }
33  ]

After that, it's as simple as using the OpenAI SDKs or OpenAI API to use the model. Let's finish our simple python example:

1from openai import OpenAI
2
3client = OpenAI(api_key="your api key")
4# message = ... see the above examples ...
5response = client.chat.completions.create(
6  model="gpt-4o-2024-08-06",
7  messages=messages,
8  max_tokens=300,
9)
10print(response.choices[0].message.content)
11#{'type': 'dog'}

As simple as that:

Create the messages-array as shown above
Use the OpenAI API or SDKs to send the image to GPT-4o - combined with a prompt of your choosing.
Get LLM-created results back - according to your prompt

About the image format

In general, you can put any image with any resolution and any format into your prompt - and GPT-4o (as well as the image pipeline in front of GPT-4o) will make the best out of it. However, for cost and latency management purposes, OpenAI provides two different image detail modes - or actually three:

low: The low-res mode. The image will automatically be sampled down to 512x512 pixels. The image will cost the equivalent of 85 tokens. This mode is fast and quite cost efficient.
high: High resolution mode will first also sample down the image to 512x512 pixels and show the LLM this low-res version (so that it can get an overview about the image). Then, it will get detailed crops (cut-outs) of the original image - with 512x512 pixels each. Each one of them costs 170 tokens. Therefore, this takes quite significantly longer than for the low mode.
auto: The default. This mode will look at the size of the original image and depending on it's outline will either use high or low.

Note: If you need very detailed analyses of your images, you might opt for the high res option. However, keep in mind that the details are fed to the LLM with 512x512 pixel crops of your original image. So if your details are larger than 512x512 pixels, they will not be presented correctly. Quite contrary, in such cases, make sure to either manually downsample the image so that your details are 'small enough', or simply use the low res version.

To set the appropriate mode, add and additional detail key to the image_url object in your messages array:

1"image_url": {
2  "url": "your-url-here",
3  "detail": "high"
4},

Image classification with GPT-4o

Image classification used to be quite a time-consuming process: First, you needed to select an image classification model, than you needed to create a dataset of at least 100 or so training data, containing the exact image categories you want to classify later on. Then, decide on a training strategy, execute the training, validate the result, host your model and then, only then, you could use the image classifier model.

This whole process is more or less obsolete with models like GPT-4o. These models are incredibly good in Zero-Shot classification-meaning you don't need to train them for a specific task, but they can predict a class that wasn't seen by the model during training.

For image classification, this means, we can simply send an image to the model, ask them to classify it based on some categories we provide, and we'll probably get a very good classification result - without ever needing to worry about model training.

We found, that this works best, if you shortly describe the classification task in the system message as well as add the pool of categories you want to classify also to the system message.

In the user message, simply add the image.

Note: There are some use-cases where you still might create your own image classifier - namely when you want to classify very specific niches. Eg. if you want to classify different types of mushrooms, you better create your own, specialized model. GPT-4o and the likes are able to correctly identify more generalized things, like different animal types (rather than eg. breeds of one animal class).

Example time:

1from openai import OpenAI
2
3client = OpenAI(api_key="your api key")
4messages=[
5  {
6    "role": "system",
7    "content": """You are an image classification system, asked to classify animals.
8    Classify the following images to be one of the following categories: ['dog', 'cat', 'cow', 'cangaroo', 'other'].
9    In the following format {'type': '<type of animal>'}"""
10  },
11  {
12    "role": "user",
13    "content": [
14      {"type": "text", "text": "Classify this image."},
15      {
16        "type": "image_url",
17        "image_url": {
18          "url": "https://upload.wikimedia.org/wikipedia/commons/9/99/Brooks_Chase_Ranger_of_Jolly_Dogs_Jack_Russell.jpg",
19        },
20      },
21    ],
22  }
23]
24response = client.chat.completions.create(
25  model="gpt-4o-2024-08-06",
26  messages=messages,
27  max_tokens=300,
28)
29print(response.choices[0].message.content)

Describing images using GPT-4o vision

Another very common task is to use the LLM to describe the image. The most important aspect here is to be precise on what aspect of the image you want to be described - as images might have various angles to describe on. Again, it's best to use the system - message to provide the definition of what exactly you want to have described.

An example for describing an image might be to store the image in a database and use the descriptions to search for it.

1from openai import OpenAI
2
3client = OpenAI(api_key="your api key")
4messages=[
5  {
6    "role": "system",
7    "content": """You are a system to describe images. The images should be stored in a database and your descriptions are used to search and find these images.
8    Therefore, describe the main content of the image which might be relevant for search/find operations.
9    Just provide the description of the image, no other text.
10    Example: House with beautiful garden in small town, sunny wheather, crowd of people in front.
11    """
12  },
13  {
14    "role": "user",
15    "content": [
16      {"type": "text", "text": "Describe this image."},
17      {
18        "type": "image_url",
19        "image_url": {
20          "url": "https://upload.wikimedia.org/wikipedia/commons/9/99/Brooks_Chase_Ranger_of_Jolly_Dogs_Jack_Russell.jpg",
21        },
22      },
23    ],
24  }
25]
26response = client.chat.completions.create(
27  model="gpt-4o-2024-08-06",
28  messages=messages,
29  max_tokens=300,
30)
31print(response.choices[0].message.content)

The example-output was

1A small, alert dog standing on green grass, with a predominantly white coat
2and a brown patch on its head and a black spot on its side. The dog has a short
3tail and a slightly raised head, looking toward the distance, with a
4background of stacked wooden logs.

The incredible thing here is: If you find this description too detailed, or has too little details, or you want a different style of description, you might simply change the system prompt. That's the really powerful aspect here - you have a full-blown LLM with all it's prompt understanding. On top of that, they now can also read and understand images.

You might even use Few-shot prompting or similar prompting techniques.

Light data analytics using GPT-4o vision

Ok, we've seen that we can use GPT-4o to understand images, create descriptions and labels and that we can use simple prompting techniques to define, how the response should look like.

Now let's try something more elaborate: Many thousands of data analysts have created beautiful dashboards with potentially very helpful insights. However, time and time again it was shown, that not a single soul is looking at this poor, beautiful dashboards. (I can make fun of that fact as I was one of these data analysts...)

Now, as we've to acknowledge that most people are apparently too busy to look at these analytics results, what if we make AI look at our dashboards? (And yes, I'm aware that this all sounds a little desperate :-) ).

The strategy is simple: We create a screenshot of one of our dashboards, send it to GPT-4o to analyze and ask specific questions about it. On real-world example could be to create sort of an open-ended alerting system. The LLM should analyze a dashboard with data over time, and notify only when something out of ordinary happens.

In our case, let's see, whether GPT-4o makes a good dashboard-analyst by asking for a written summary of a specific dashboard.

Dashboard we want to analyze

Install the pytest-playwright package. Its Playwright for python. (Playwright allows to navigate to web-pages and, among other things, create screenshot of them).

Automatically create a screenshot of your dashboard.

In this example we assume the dashboard is openly available. If your dashboard needs authentication, have a look at the playwright documentation.

1import asyncio
2from playwright.async_api import async_playwright
3import base64
4
5from openai import OpenAI
6client = OpenAI(api_key="your api key")
7
8async with async_playwright() as p:
9   browser = await p.chromium.launch()
10   page = await browser.new_page()
11   await page.goto('http://playwright.dev')
12   await page.screenshot(path=f'dashboard_gpt4o.png')
13   await browser.close()

Use the screenshot for GPT-4o vision inference:

1
2with open("./public/blog/gpt4ovision/dashboard.png", "rb") as image_file:
3  base64_image = base64.b64encode(image_file.read()).decode('utf-8')
4
5messages=[
6  {
7    "role": "system",
8    "content": """You are a data analyst. You get an image of a dashboard.
9    Create a written report based on this dashboard.
10    The report should be short and concise and should contain the most important metrics.
11    The report is aimed at high-level management, so make sure it only covers the most prevalent facts.
12    Focus on actionable insights.
13    """
14  },
15  {
16    "role": "user",
17    "content": [
18      {"type": "text", "text": "Create a written summary of this dashboard."},
19      {
20        "type": "image_url",
21        "image_url": {
22           "url": f"data:image/jpeg;base64,{base64_image}",
23           "detail": "high"
24        },
25      },
26    ],
27  }
28]
29response = client.chat.completions.create(
30  model="gpt-4o",
31  messages=messages,
32  max_tokens=1024,
33)
34print(response.choices[0].message.content)

This was the response from the LLM:

1### Dashboard Summary Report
2
3#### Referring Domains (Last 14 Days)
4
5- **Top Referring Domain:** www.google.com is the leading referrer with approximately 1,100+ referrals.
6- **Direct Visits:** The second highest source, with around 400 direct visits.
7- **Other Key Referrers:** Medium.com, linkedin.com, and pondhouse-data.com also contribute notably to the referral traffic, with each driving between 100 and 200 referrals.
8
9#### Pageview Funnel (Last 7 Days)
10
11- **First Page View:** 1,149 users (100%) completed the first page view.
12- **Second Page View Drop-off:** 5.3% or 61 users continued to the second page view, indicating a 94.69% drop-off (1,088 users).
13- **Third Page View Drop-off:** From the second page, only 1.9% or 22 users progressed to the third page view, with another substantial drop-off of 63.93% (39 users).
14
15#### Pageview Count by Event's Path Name (Last 30 Days)
16
17- **Daily Variation:** Pageview counts exhibit significant daily fluctuations, generally ranging between 50 to 300 pageviews.
18- **Peak Traffic Day:** Notably, July 18, 2024, saw a peak in pageviews, surpassing 400 views, indicating a potential event or campaign spike.
19
20### Actionable Insights
21
221. **Enhance Engagement Post First Page:** The significant drop-offs after the first page view indicate potential issues in user journey or content engagement beyond the first page.
23   Further analysis and improvement of the user experience after the first page could help retain users.
242. **Focus on Key Referrers:** Investment in strengthening relationships, collaborations or advertising with top referrers, particularly Google, can help maintain or enhance traffic.
25
263. **Investigate High-traffic Days:** The spike in traffic on July 18, 2024, warrants closer inspection to understand what drove the increase. Replicating successful strategies from that day could boost future traffic.
27
284. **Continual Monitoring and Testing:** Regular monitoring is essential to understand trends better and identify smaller opportunities and issues in user traffic and engagement patterns.

Wow, this is actually quite amazing - considering that we simply threw a dashboard at the model and asked it to write a summary! I'm especially amazed at how well the model was able to read the charts and find the correct dates (eg. the spike of 400 pageviews on July, 18th).

Costs and pricing of GPT-4o vision for image prompts

That's all nice and good, but how much does it cost? Well, it's quite cheap we'd say. The following prices are for the newly release GPT-4o variant gpt-4o-2024-08-06:

Any image in low resolution mode: 0.02 cents ($0.000213) or 2$ per 10.000 images.
1920 x 1080 in high resolution mode: 0.28 cents ($0.002763) or 2.76$ per 1.000 images.

Important note: This newly release gpt-4o-2024-08-06 model is the cheapest model of the GPT-4o series in terms of vision capabilities. Even the otherwise quite cost efficient gpt-4o-mini model is approx. twice as expensive - with worse output quality. So stick with this model at least for now.

Conclusion

In conclusion, the vision capabilities of GPT-4o open up an array of possibilities for image classification, description, and analysis. This tool eliminates the need for specialized training and complex setups, allowing users to better, easier and more economically combine AI and image data.

By integrating GPT-4o into your workflow, you can save time, reduce costs, and achieve high-quality outputs that were previously time-consuming and resource-intensive. The versatility of GPT-4o, from handling single images to analyzing quite complex dashboards, demonstrates its potential to change how we interact with and interpret visual data. Especially this last part - analyzing visual image data - is very intriguing.

With its user-friendly implementation and cost-effective pricing, GPT-4o is accessible to a broad range of users, from developers to data analysts. This guide has provided you with the foundational knowledge to start utilizing GPT-4o's vision capabilities. We encourage you to experiment with different prompts and applications, to get the best for your use case.

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide