Langfuse: The Open Source Observability Platform for building better LLM Applications

Ever deployed an LLM application only to find yourself blindly guessing why it suddenly started hallucinating? Or spent hours reconstructing what happened during a complex agent execution? The reality of building AI applications today involves navigating a maze of black boxes - from the unpredictable nature of model outputs to the complex chains of reasoning that happen behind the scenes.

When applications fail, developers often lack the tools to understand why. When they succeed, replicating that success becomes equally challenging. Traditional debugging approaches fall short in a world where execution isn't deterministic and "correctness" exists on a spectrum rather than as a binary state.

This is where Langfuse enters the picture. As an open-source LLM engineering platform, it provides the missing observability layer that transforms LLM development from art to engineering. By capturing the full context of every execution - including prompts, completions, retrieval steps, and tool usage - Langfuse gives development teams much-needed visibility into their applications.

From tracing complex agent workflows to measuring output quality at scale, Langfuse brings structure to the chaos of LLM development. The result? Faster debugging, data-driven improvements, and the ability to build AI applications that get better over time rather than gradually degrading.

Let's explore what Langfuse is, how to use it and what features it brings to the table. This is an interesting tool for us to look at, as we already covered a similar tool Opik in the past.

The Challenges of AI Application and Agent Development

Building AI applications - especially those involving autonomous agents, presents a unique set of challenges that traditional software engineering tools simply weren't designed to handle.

The Black Box Problem

Unlike conventional software where inputs reliably produce the same outputs, LLMs introduce inherent unpredictability. A minor change in phrasing can dramatically alter model behavior, and even identical prompts can result in different responses depending on sampling parameters. This creates a "black box" effect where understanding what's happening inside your application becomes nearly impossible without specialized tooling.

Untangling Complex Execution Paths

Modern AI applications rarely consist of simple prompt-response pairs. Instead, they involve intricate sequences of:

Multiple LLM calls with different contexts
Retrieval operations pulling data from various sources
Tool usage and API interactions
Conditional branching based on model outputs
Parallel processing and asynchronous operations

When something goes wrong - as it inevitably does - pinpointing exactly where the failure occurred becomes a daunting task. Did the retrieval system return irrelevant context? Did the agent choose the wrong tool? Did the core reasoning step contain a logical flaw? Without visibility into each step, debugging becomes an exercise in frustration.

The Quality Assessment Puzzle

Traditional software relies on unit tests with binary pass/fail results. LLM applications operate in a world of subjective quality where "correctness" spans a spectrum. Is a response accurate but not helpful? Helpful but not accurate? Both accurate and helpful but poorly formatted? Establishing metrics for success becomes a significant challenge, made worse by the lack of standardized evaluation frameworks.

The Moving Target of Production

Perhaps most challenging is that LLM applications often degrade in production in ways that development environments never reveal:

Users ask questions developers never anticipated
External APIs and tools change their behavior
The underlying models themselves receive updates
Edge cases emerge that weren't covered in testing

Without continuous monitoring and evaluation, applications that worked perfectly during development can quickly become unreliable in the real world.

The Agent Development Conundrum

For agent-based systems, these challenges compound exponentially. Agents make autonomous decisions about which actions to take, creating a vast space of possible execution paths. When an agent gets stuck in a loop, makes poor tool choices, or fails to achieve its objective, reconstructing what happened and why becomes extraordinarily difficult.

Traditional logging - capturing inputs and outputs - falls woefully short. What's needed is comprehensive tracing of the agent's decision-making process, including its reasoning steps, tool selections, intermediate results, and the full context informing each choice.

The Iteration Bottleneck

Perhaps most frustrating for teams is how these challenges combine to create an iteration bottleneck. Without clear visibility into what's happening, improvements become speculative rather than data-driven. Teams find themselves asking:

Which prompts need refinement?
Where should we focus our optimization efforts?
Are our model choices appropriate for each task?
How do we measure whether changes are actually improvements?

Without answers to these questions, the development cycle becomes slow, expensive, and frustrating - more art than engineering.

Traditional software engineering tools simply weren't built to address these challenges. Teams need an entirely new category of tooling designed specifically for the unique requirements of LLM application development, one that brings observability, evaluation, and systematic improvement to every stage of the development process.

What is Langfuse?

Langfuse is an open-source LLM engineering platform that provides developers with the essential infrastructure and tools needed to debug, analyze, and systematically improve their AI applications. At its core, Langfuse functions as an observability layer that captures the complete execution context of language model applications, from simple chatbots to complex autonomous agents.

Unlike general-purpose logging or monitoring tools, Langfuse is specifically built for the challenges of LLM development, as outlined above. It treats traces as first-class citizens, allowing developers to visualize and understand the hierarchical, often non-linear execution paths that characterize modern AI applications. Every LLM call, retrieval operation, tool usage, and intermediate reasoning step can be captured with full context, creating a very detailed record of what happened during execution.

The platform consists of four integrated components that work together to support the complete LLM development lifecycle:

LLM Application Observability: Capture every aspect of your application's execution through lightweight SDKs and framework integrations that record prompts, completions, metadata, and relationships between execution steps.
Prompt Management: Centrally store, version, and collaboratively iterate on prompts, decoupled from application code for faster experimentation without redeployment.
Evaluations: Assess output quality through multiple methods, LLM-as-a-judge, user feedback collection, manual labeling, or custom evaluation pipelines.
Datasets: Create benchmarks from production data, test sets, or synthetic examples to systematically evaluate application performance across versions.

What separates Langfuse from other tools is how these components are natively integrated to accelerate the development workflow. A problematic trace in production can be instantly added to a dataset, evaluated with an LLM-as-a-judge, addressed with prompt improvements, and verified before deployment, all within the same platform.

Langfuse is designed with flexibility in mind, supporting Python, JavaScript/TypeScript, and popular frameworks like LangChain, LlamaIndex, and the OpenAI SDK. It's fully open-source (MIT license) and can be self-hosted or used as a managed service through Langfuse Cloud, with both EU and US hosting options available for teams with specific compliance requirements.

How to run Langfuse

There are various documented methods for running Langfuse, including:

Note: If you want to know more about the Langfuse architecture, you can check the Langfuse architecture overview page.

For the sake of having a test setup, we'll use the docker-compose configuration in this post.

Clone the Langfuse repository:

1git clone https://github.com/langfuse/langfuse.git
2cd langfuse

Open the docker-compose.yml file and set all the environment variables marked with # Change me to your own values. These are secrets used to setup and connect to the various services. You can choose your own values on initial setup.
Run the following command to start the services:

1docker compose up -d

As simple as that, we have ourselves a running Langfuse instance. We'll have the following services running:

langfuse worker
langfuse web
clickhouse
minio
redis
postgres

Nothing out of the ordinary, but we need to keep in mind that Langfuse requires quite a few services to run. If you want to run it in production, you'll need to set up a proper environment with all the services configured and running.

Langfuse Features

Now that we have Langfuse running, let's take a look at some of the features it provides.

Note: We'll stick to the features available in the open-source version of Langfuse. To get an idea of all the features available, see the Langfuse pricing page.

Tracing

The main feature of Langfuse - obviously - is tracing. In simple terms, it allows you to capture the execution of your LLM applications and visualize it.

Langfuse makes this rather easy:

Install the required packages (in our case we want to use an openai model):

1pip install langfuse openai

In your Langfuse isntance, create a new project and get the secret key and public key.

In your organization overview, click on "New Project". Fill out the form. Creating a Project

In the next screen, click on "Create API Keys". This will reveal the public and private key. Copy them. Getting the keys

Create a .env file:

1LANGFUSE_SECRET_KEY="sk-lf-..." # private key
2LANGFUSE_PUBLIC_KEY="pk-lf-..." # public key
3LANGFUSE_HOST="https://<your-langfuse-instance>.com" # your langfuse instance

In you project, make sure the environment variables are loaded. Then, simply import the Langfuse wrapper of you AI SDK, instead of the SDK itself. For example, if you're using OpenAI, you can do the following:

1from langfuse.openai import openai # OpenAI integration
2completion = openai.chat.completions.create(
3    name="my-completion",
4    model="gpt-4o",
5    messages=[
6        {"role": "system", "content": "You are a helpful assistant."},
7        {"role": "user", "content": "What is the capital of France?"}
8    ],
9)

Note: That's basically all there is to trace all LLM calls. Each and every call to OpenAI will now be logged, including latencies and costs. However, each call will be a separate trace. Follow the next step to nest the traces to where you want them to be.

Our first trace in Langfuse

Combining LLM calls to a single trace

Now that we log our individual LLM calls, it would be nice to group them together, if we have an application which requires multiple LLM calls. Let's say you have a chat agent, it's highly likely that you need multiple LLM calls to answer the user's question. Eg. you need one call to extract relevant information, another call to decide what tools to use and finally a call to create the final answer. By grouping all these calls together we can get a better overview of a single user interaction.

For that, we have two options:

Add a custom trace id to the openai LLM calls.

1import uuid
2
3my_trace_id = str(uuid.uuid4()) # create a custom trace id
4
5completion = openai.chat.completions.create(
6    name="my-completion",
7    model="gpt-4o",
8    messages=[
9        {"role": "system", "content": "You are a helpful assistant."},
10        {"role": "user", "content": "What is the capital of France?"}
11    ],
12    trace_id=my_trace_id,
13)

Just make sure to add the same trace id to all LLM calls you want to group. So in most cases you'll want to create a custom trace id at the beginning of the user interaction.

While this is the easiest way of getting quite a good picture of how your application is working, it's a bit cumbersome to add the trace_id to every LLM call. You also don't get information about in which step or function the LLM call was made.

The second method solves both the before-mentioned issues: Langfuse provides a handy decorator to wrap the functions where you run you LLM calls.

Simply put this decorator on top of your functions and all LLM calls under this function will be grouped together.

1 from langfuse.decorators import observe
2 from langfuse.openai import openai # OpenAI integration
3
4 @observe()
5 def call_llm():
6     return openai.chat.completions.create(
7         model="gpt-4o",
8         messages=[
9           {"role": "system", "content": "You are a helpful assistant."},
10           {"role": "user", "content": "What is the capital of France?"}
11         ],
12     ).choices[0].message.content
13
14 @observe()
15 def main():
16     return call_llm()
17
18 main()

What you'll get is a single trace, containing both the information which functions were called, when they called and nested a bit deeper the LLM calls.

Note: It's important to @observe all the functions which you want to be traced. The same is true for the function call hierarchy. If you have a function main(), do_something() and call_llm(), where main calls do_something and do_something calls call_llm, you need to @observe all three functions, if you want to get the complete picture.

If you would only trace eg. main() and call_llm(), you would get two different traces, one for main() and one for call_llm() (which would then have the LLM call as a child). So make sure, @observe is added to all functions in your call hierarchy.

A trace with multiple LLM calls

Note how in this view it's very easy to see the amount of tokens used, the latency and the cost of each LLM call.

Trace timeline

By clicking the "Timeline" toggle button in the traces view, you can switch to a handy timing-based overview. This allows to clearly see what took the most time, where bottlenecks are, where you have room for parallelization and where you can optimize your code.

You even see the time to first token in streaming scenarios (this is a game changer...)

Trace timeline

Sessions

Let's say you have a chat agent. The user probably not only sends one message, but they might ask multiple questions and follow-up questions in the same chat context.

While we could group all of them in one trace, it rarely makes sense, as most of the time you want to monitor and analyze one single question (which by definition than is a trace). However, you still would like to see how a full session took place.

For this Langfuse offers Sessions. A session in Langfuse are a way to group these traces together and see a simple session replay of the entire interaction.

Instrumenting Langfuse for sessions is again quite simple. Use the langfuse_context method to update the current trace with the session id. As simple as that. (Instead of chat-id below, use your actual chat identifier. Most chat SDKs provide a chat id).

1from langfuse.decorators import langfuse_context, observe
2from langfuse.openai import openai
3
4@observe()
5def call_llm():
6    session_id = "chat-id"
7    langfuse_context.update_current_trace(
8        session_id=session_id,
9    )
10
11    completion = openai.chat.completions.create(
12       name="my-completion",
13       model="gpt-4o",
14       messages=[
15           {"role": "system", "content": "You are a helpful assistant."},
16           {"role": "user", "content": "What is the capital of France?"}
17       ],
18    )
19
20call_llm()

If you don't use the @observe decorator, you can also use simply add the session id to the trace:

1    completion = openai.chat.completions.create(
2       name="my-completion",
3       model="gpt-4o",
4       messages=[
5           {"role": "system", "content": "You are a helpful assistant."},
6           {"role": "user", "content": "What is the capital of France?"}
7       ],
8      session_id="chat-id",
9    )

Users

Now that we have traces and sessions, what's missing? Correct, "user" entities.

Note: Users does not necessarily mean "users" in the sense of single identifiable persons. "Users" can also mean "agents" or organizations. "Users" are just a mere collection of traces and sessions. In general, make sure to respect privacy and data protection laws when using Langfuse.

To add user identifiers to your trace, use the same pattern as for the sessions, but use the user_id parameter.

1from langfuse.decorators import langfuse_context, observe
2from langfuse.openai import openai
3@observe()
4def call_llm():
5    user_id = "user-id"
6    langfuse_context.update_current_trace(
7        user_id=user_id,
8    )
9
10    completion = openai.chat.completions.create(
11       name="my-completion",
12       model="gpt-4o",
13       messages=[
14           {"role": "system", "content": "You are a helpful assistant."},
15           {"role": "user", "content": "What is the capital of France?"}
16       ],
17    )

In Langfuse you then see all the user entities, what they consumed and how much they cost you.

User overview. Source: https://langfuse.com/docs/tracing-features/users

More Tracing features

Trace Metadata: Add custom metadata to traces for better organization and filtering.

1  # Update trace metadata from anywhere inside call stack
2  langfuse_context.update_current_trace(
3      metadata={"key":"value"}
4  )
5
6  # Update observation metadata for current observation
7  langfuse_context.update_current_observation(
8      metadata={"key": "value"}
9  )

Trace Tags: Use tags to categorize and filter traces. Makes for really great overviews and filtering experiences.

1 langfuse_context.update_current_trace(
2     tags=["tag-1", "tag-2"]
3 )

Tag filtering

Multi-Modal Tracing: Capture and visualize multi-modal data, such as images or audio, alongside text traces.

Multi-Modal Trace. Source: https://langfuse.com/docs/tracing-features/multi-modality

Sampling: If you have a lot of users (and we hope you do), the you probably don't want to trace every single user interaction. Simply set the environment variable LANGFUSE_SAMPLE_RATE to a value between 0 and 1 to sample a percentage of traces. For example, 0.1 will sample 10% of all traces.

1os.environ["LANGFUSE_SAMPLE_RATE"] = '0.5'

Trace Sharing: Share traces with your team or external stakeholders for collaboration and debugging. Click on the a bit too small "share" icon and hit "Share" to get a public, shareable link to the trace.

Trace Sharing
Native integrations: While tracing LLM calls is mostly enough, Langfuse offers dozens of integrations to other tools, such as:
- LangChain
- LlamaIndex
- Smolagents
- and many more
These integrations allow you to trace the Frameworks inner workings better then the simple LLM call integration would be able to.

(Just for completeness: Langfuse is also fully Open Telemetry compatible)

Dashboard

The Langfuse dashboard is the heart of the platform, providing a centralized view of all traces, sessions, and user interactions. It allows you to easily get an overview about your application, performance and spot potential degradations over time.

A picture or two say more than thousand words, so have a look:

Langfuse Dashboard

Evaluations

Evaluations are a key feature of Langfuse, allowing you to assess the quality of your LLM outputs systematically.

In your project settings, click Scores/Evaluations and create a new score config. A score config is simply the definition of how you want to evaluate your traces (eg. by using numbers or labels).

Creating a new score config

Now in your trace view, click the Annotate button (the small chat bubble) and select the score config you want to use.

Annotate a trace: Select score config

Enter your score (or select the label, depending on your score config) and optionally add a comment. Click Save to save your score.

Annotate a trace: Add score

In the Tracing -> Scores menu, you find an overview of all your scores. With the filtering and sorting option you have a good way of getting an idea if your scores are degrading or improving over time.

Scores overview

Datasets

Via Langfuse Datasets you can create test sets and benchmarks to evaluate the performance of your LLM application.

Use cases (according to the Langfuse documentation):

Continuous improvement: Create datasets from production edge cases to improve your application
Pre-deployment testing: Benchmark new releases before deploying to production
Structured testing: Run experiments on collections of inputs and expected outputs
Flexible evaluation: Add custom evaluation metrics or use llm-as-a-judge
Integrates well: Works with popular frameworks like LangChain and LlamaIndex

As datasets are a bit out of scope for this more monitoring-related post, I'll leave a link to the rather good Langfuse Dataset documentation here.

Prompt Management

The final feature I'm gonna talk about is prompt management, as it's a lovely addition to the Langfuse platform.

By using the Prompts menu in the Langfuse platform, you can create and version control prompts. This allows you to easily iterate on your prompts and keep track of changes over time.

Just click on Create prompt and fill out the prompt form - it should be self-explanatory.

Creating a new prompt

If you want to create a new version of the prompt, simply open the prompt and in the "Versions" tab click on "New". This will give a nice git-like version history.

Prompt version history

If you click the Metrics tab, you can see the usage of the prompt over time, what they cost, and if you use the Evaluations feature as described above, you'll also see the scores of the prompt and their versions.

To use these prompts in your code, follow these steps:

1chat_prompt = langfuse.get_prompt("rag-chat-prompt", type="chat")
2
3# If you need to add variables to the prompt
4compiled_chat_prompt = chat_prompt.compile(rag_question="What is...", rag_context="...")

To link the prompt to a trace - you guessed it - add the prompt to the @observe decorated function:

1from langfuse.decorators import langfuse_context, observe
2
3@observe()
4def main():
5    prompt = langfuse.get_prompt("rag-chat-prompt", type="chat")
6
7    langfuse_context.update_current_observation(
8        prompt=prompt,
9    )
10
11main()

Note: Langfuse offers caching for the prompts, see here.

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide

Langfuse: The Open Source Observability Platform for building better LLM Applications

The Challenges of AI Application and Agent Development

The Black Box Problem

Untangling Complex Execution Paths

The Quality Assessment Puzzle

The Moving Target of Production

The Agent Development Conundrum

The Iteration Bottleneck

What is Langfuse?

How to run Langfuse

Langfuse Features

Tracing

Combining LLM calls to a single trace

Trace timeline

Sessions

Users

More Tracing features

Dashboard

Evaluations

Datasets

Prompt Management

Interested in building high-quality AI agent systems?

Further reading