Building Better AI Applications with LLM Tracing using Opik

Introduction
If you're building AI applications with Large Language Models, you've likely encountered some common frustrations: unexpected outputs, high API costs, and the challenge of figuring out why your application isn't performing as expected. These issues become even more complex when your application moves towards agentic workflows, where the AI is expected to interact with other AI systems or humans in a more dynamic and context-aware manner. One of such a more complex example being demonstrated in our blog series about advanced agent building from scratch.
Opik is an open-source debugging tool created by Comet that helps developers tackle these practical challenges. At its core, it's a monitoring and evaluation platform that gives you visibility into how your LLM applications actually behave, both during development and in production.
Opik Dashboard
Common Development Challenges
When building LLM applications, developers typically face these issues:
- Not knowing why certain prompts fail while others succeed
- Difficulty tracking token usage and associated costs
- Uncertain whether retrieved context (in RAG systems) is actually relevant
- No clear way to measure application performance over time
- Limited visibility into production behavior
How Opik Helps
Opik provides practical solutions through three main features:
- Tracing: Track every interaction with your LLM, including prompts, responses, and metadata
- Evaluation: Test your application systematically with automated checks
- Monitoring: Watch your application's behavior in production through easy-to-read dashboards
The tool works with common frameworks like LangChain, OpenAI, and LlamaIndex, integrating directly into your existing development workflow.
In this guide, we'll walk through the concrete steps to use Opik for debugging and improving your LLM applications, with real examples you can implement today.
What is LLM Tracing?
Before we get our hands dirty, let's clarify what we mean by "tracing" in the context of LLM applications, as this is a heavily-used term in the industry. LLM tracing is essentially a way to record and analyze every interaction between your application and a language model. Think of it like logging for traditional applications, but specifically designed for LLM-based systems.
When your AI application isn't working as expected, you need to know:
- What prompt was sent to the model
- What response came back
- What context or history influenced the interaction
- How long the request took
- How many tokens were used (and the associated cost)
Without tracing, debugging these issues becomes guesswork. A trace - a log of each interaction - might look as follows:
So, in summary LLM tracing is nothing more than fancy wording for good old application logging.
Setting up Opik for LLM Tracing
Okay, let's dive into the practical steps to set up Opik for your LLM applications. There are two ways to get to use Opik:
-
Use the fully managed Comet LLM evaluation platform. It's a fully fledged platform for all your LLM logging and observing needs. They provide a managed version of Opik.
-
Use the open source, self-hosted, Apache-2 licensed Opik project. This is the version we'll focus on in this guide.
Installing Opik
There are basically two ways to install Opik: Either by using docker compose or kubernetes.
Make sure to have docker installed. If you want to use kubernetes, you'll need a running kubernetes cluster as well as helm and kubectl.
Running Opik using docker compose
-
Get yourself the Opik docker compose file. Easiest, clone their repo:
-
Change into the opik directory and start the services:
Your application should now be running on http://localhost:5173
. Either
use the tool on the same server as your application or expose the port to
the outside world using a webserver like nginx
or caddy.
Note: You might want to have a look at the docker-compose.yml
file
to adjust the configuration to your needs. Especially take note of the
volume mounts for clickhouse, mysql and redis. These are docker volumes,
depending on your needs you might want to adjust them to folder mounts.
Running Opik using kubernetes
-
Get yourself the Opik helm chart and install it as follows:
Your application will provide a service svc/opik-frontend
on port 5173
.
If you need to forward the service to your local machine, you can use the
following command:
Note: No matter the method you chose, Opik itself does not provide any means of authentication - the application is freely available. Make sure to add any form of authentication provider in front of the application or make sure to run it in a secured network.
If everything went well, you should now be able to access the Opik
frontend on http://localhost:5173
and be greeted by this screen:
Opik welcome screen
Now that we have the platform up and running, we need to install the Opik python SDK to start tracing our LLM applications (this needs to be installed in the context of the application you want to trace).
Using Opik in Development
Enough preparation, let's see how we can use Opik in our development workflow to trace our LLM applications. Start by initializing the Opik SDK on your development machine:
Tracing your LLM application with Opik
The most simple use-case for Opik is to trace your LLM applications. Meaning we want to see every interaction with an LLM model, including the prompt, the response, and any metadata associated with the interaction.
For this guide we'll assume we use an OpenAI LLM. Adding Opik tracing in this case is as simple as wrapping the OpenAI client as follows:
That's it. From now on, every call to the OpenAI client will be traced by Opik. If we for example run this code:
We get this lovely, first trace in Opik:
Our first Opik trace
Basically we could end this blog article here, as this is and will be the most important application for any tracing endeavor. We see the question, the answer, the model used, the tokens used, the latency and some metadata.
Furthermore, most frameworks allow to use a custom OpenAI client to be passed as parameter. This means you can use the Opik tracing with any framework that uses the OpenAI client.
For example if you want to use Opik tracing for the new PydanticAI agent framework, simple use their custom OpenAI agent feature and pass the wrapped client to the agent.
If you have multiple function calls (like tools) that interact with the LLM
model, you can use the opik.trace
decorator to trace these calls.
This will log the function name as well as model details and function timings.
Opik Integrations
However, Opik makes this even more comfortable. They provide a ton of native integrations for many popular tools and frameworks.
For reference, let's see how we can integrate Opik with Langchain:
As you can see, all we need is one additional line. We create an OpikTracer
and pass it as a callback to the run
method of the LLMChain
. This will
automatically trace the interaction with the LLM model via Langchain.
For a full list of integrations, please visit their outstanding documentation
Manual tracing if none of the integrations fit
Let's say you have a framework that is not supported by Opik, and that does not allow you to pass a custom LLM client. Which might happen with all the new agent frameworks popping up. In this case you can always manually trace the interactions.
Let's consider we have a simple agent application, similar to the one of
our previous blog post. Let's assume we
use an agent framework that does not allow to pass a custom LLM client.
In this case we can add the opik.trace
function decorator to any methods
that interact with the LLM model.
While this will not log the exact prompts and responses as well as the costs incurred by the LLM, we at least get an idea about tool usage and latency. Note that in this case Opik traces the inputs, outputs and timings of the functions.
Annotations
Another great feature of Opik is the ability to annotate LLM traces with
feedback scores. In the UI, navigate to the trace you want to annotate and
hit the Annotate
button at the top-right corner.
If it's the first time you're annotating a trace, you'll be asked to create
a Feedback definition. Fill out the feedback definition form and click
Create feedback definition
.
Creating a feedback definition
Now you can annotate the trace with the feedback category you just created. Simply select the category from the dropdown.
Annotating a trace
Tracking LLM costs
For OpenAI and Gemini models, Opik automatically tracks the costs of each LLM call and provides a breakdown of the costs in the trace view as well as the project metrics overview.
Estimated costs overview
This is especially useful when you're working with a budget or need to optimize your LLM usage.
Note that while this cost tracking is not available for all LLM models, token usage and latency are always tracked.
Setting the project to use
In all our examples so far, we've been using the default project. However, you will most probably create multiple projects. Opik allows you to set the project you want to trace for as simple function parameter, for both the client creation as well as the track decorator.
If you want to change the project for all traces, you can set the project name as environment variable:
LLM Monitoring and Tracing in Production with Opik
Now that we hopefully successfully created a well-performing LLM application we need to make sure it also behaves in production.
Trace Logging
Nothing new here. Opik is designed to be highly scalable, so simply use the same setup as you did for development.
Monitoring Dashboards
All trace metrics (number of traces, tokens and feedback scores) are available in a dashboard, giving you a high level overview of your application. You can spot if your LLM costs run away or your application is not performing as expected.
Opik Dashboard (kindly taken from
https://www.comet.com/docs/opik/production/production_monitoring)
Online Evaluation of LLM Performance
Talking about feedback scores, in production we need to create them for each trace. This can't be done manually, as you most probably have too many of them. So Opik provides a way to automatically create feedback scores using an LLM model themselves. (This is called 'LLM as a judge').
Let's assume you built a RAG system using Azure AI Search, then you would want to know, whether your LLM generates relevant answers - and that it doesn't hallucinate.
Opik provides 3 predefined LLM as a judge validation prompts:
- Hallucination: This metric checks if the LLM output contains any hallucinated information.
- Moderation: This metric checks if the LLM output contains any offensive content.
- Answer Relevance: This metric checks if the LLM output is relevant to the given context.
Alternatively you can also create your own validation prompts.
-
First, we need to configure an LLM provider (which is used as judge). Navigate to the AI Provider configuration and add a new AI provider.
-
Then navigate to your project, select the
Rules
tab and create a new rule.Create rule screen
-
Fill out the form by selecting the LLM judge model, selecting the prompt you want to use - and most importantly - defining the variable mappings. Text provided in two
{{ brackets }}
will be replaced by the actual values from the trace.Note: You can use any value in your trace as variable. Feel free to experiment with the
metadata
parameter of thetrack
decorator to add additional information to your traces.Last but not least add a score definition.
Click on
Create rule
and you're done. From now on every trace will be annotated with the feedback score.
Note: This automatic scoring feature is one of the most powerful ones of Opik. It allows you to automatically validate your LLM output in production and detect potential issues early on.
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use