Langfuse: The Open Source Observability Platform for building better LLM Applications

Ever deployed an LLM application only to find yourself blindly guessing why it suddenly started hallucinating? Or spent hours reconstructing what happened during a complex agent execution? The reality of building AI applications today involves navigating a maze of black boxes - from the unpredictable nature of model outputs to the complex chains of reasoning that happen behind the scenes.
When applications fail, developers often lack the tools to understand why. When they succeed, replicating that success becomes equally challenging. Traditional debugging approaches fall short in a world where execution isn't deterministic and "correctness" exists on a spectrum rather than as a binary state.
This is where Langfuse enters the picture. As an open-source LLM engineering platform, it provides the missing observability layer that transforms LLM development from art to engineering. By capturing the full context of every execution - including prompts, completions, retrieval steps, and tool usage - Langfuse gives development teams much-needed visibility into their applications.
From tracing complex agent workflows to measuring output quality at scale, Langfuse brings structure to the chaos of LLM development. The result? Faster debugging, data-driven improvements, and the ability to build AI applications that get better over time rather than gradually degrading.
Let's explore what Langfuse is, how to use it and what features it brings to the table. This is an interesting tool for us to look at, as we already covered a similar tool Opik in the past.
The Challenges of AI Application and Agent Development
Building AI applications - especially those involving autonomous agents, presents a unique set of challenges that traditional software engineering tools simply weren't designed to handle.
The Black Box Problem
Unlike conventional software where inputs reliably produce the same outputs, LLMs introduce inherent unpredictability. A minor change in phrasing can dramatically alter model behavior, and even identical prompts can result in different responses depending on sampling parameters. This creates a "black box" effect where understanding what's happening inside your application becomes nearly impossible without specialized tooling.
Untangling Complex Execution Paths
Modern AI applications rarely consist of simple prompt-response pairs. Instead, they involve intricate sequences of:
- Multiple LLM calls with different contexts
- Retrieval operations pulling data from various sources
- Tool usage and API interactions
- Conditional branching based on model outputs
- Parallel processing and asynchronous operations
When something goes wrong - as it inevitably does - pinpointing exactly where the failure occurred becomes a daunting task. Did the retrieval system return irrelevant context? Did the agent choose the wrong tool? Did the core reasoning step contain a logical flaw? Without visibility into each step, debugging becomes an exercise in frustration.
The Quality Assessment Puzzle
Traditional software relies on unit tests with binary pass/fail results. LLM applications operate in a world of subjective quality where "correctness" spans a spectrum. Is a response accurate but not helpful? Helpful but not accurate? Both accurate and helpful but poorly formatted? Establishing metrics for success becomes a significant challenge, made worse by the lack of standardized evaluation frameworks.
The Moving Target of Production
Perhaps most challenging is that LLM applications often degrade in production in ways that development environments never reveal:
- Users ask questions developers never anticipated
- External APIs and tools change their behavior
- The underlying models themselves receive updates
- Edge cases emerge that weren't covered in testing
Without continuous monitoring and evaluation, applications that worked perfectly during development can quickly become unreliable in the real world.
The Agent Development Conundrum
For agent-based systems, these challenges compound exponentially. Agents make autonomous decisions about which actions to take, creating a vast space of possible execution paths. When an agent gets stuck in a loop, makes poor tool choices, or fails to achieve its objective, reconstructing what happened and why becomes extraordinarily difficult.
Traditional logging - capturing inputs and outputs - falls woefully short. What's needed is comprehensive tracing of the agent's decision-making process, including its reasoning steps, tool selections, intermediate results, and the full context informing each choice.
The Iteration Bottleneck
Perhaps most frustrating for teams is how these challenges combine to create an iteration bottleneck. Without clear visibility into what's happening, improvements become speculative rather than data-driven. Teams find themselves asking:
- Which prompts need refinement?
- Where should we focus our optimization efforts?
- Are our model choices appropriate for each task?
- How do we measure whether changes are actually improvements?
Without answers to these questions, the development cycle becomes slow, expensive, and frustrating - more art than engineering.
Traditional software engineering tools simply weren't built to address these challenges. Teams need an entirely new category of tooling designed specifically for the unique requirements of LLM application development, one that brings observability, evaluation, and systematic improvement to every stage of the development process.
What is Langfuse?
Langfuse is an open-source LLM engineering platform that provides developers with the essential infrastructure and tools needed to debug, analyze, and systematically improve their AI applications. At its core, Langfuse functions as an observability layer that captures the complete execution context of language model applications, from simple chatbots to complex autonomous agents.
Unlike general-purpose logging or monitoring tools, Langfuse is specifically built for the challenges of LLM development, as outlined above. It treats traces as first-class citizens, allowing developers to visualize and understand the hierarchical, often non-linear execution paths that characterize modern AI applications. Every LLM call, retrieval operation, tool usage, and intermediate reasoning step can be captured with full context, creating a very detailed record of what happened during execution.
The platform consists of four integrated components that work together to support the complete LLM development lifecycle:
-
LLM Application Observability: Capture every aspect of your application's execution through lightweight SDKs and framework integrations that record prompts, completions, metadata, and relationships between execution steps.
-
Prompt Management: Centrally store, version, and collaboratively iterate on prompts, decoupled from application code for faster experimentation without redeployment.
-
Evaluations: Assess output quality through multiple methods, LLM-as-a-judge, user feedback collection, manual labeling, or custom evaluation pipelines.
-
Datasets: Create benchmarks from production data, test sets, or synthetic examples to systematically evaluate application performance across versions.
What separates Langfuse from other tools is how these components are natively integrated to accelerate the development workflow. A problematic trace in production can be instantly added to a dataset, evaluated with an LLM-as-a-judge, addressed with prompt improvements, and verified before deployment, all within the same platform.
Langfuse is designed with flexibility in mind, supporting Python, JavaScript/TypeScript, and popular frameworks like LangChain, LlamaIndex, and the OpenAI SDK. It's fully open-source (MIT license) and can be self-hosted or used as a managed service through Langfuse Cloud, with both EU and US hosting options available for teams with specific compliance requirements.
How to run Langfuse
There are various documented methods for running Langfuse, including:
-
Using docker or compose
-
Using Kubernetes
-
On railway
Note: If you want to know more about the Langfuse architecture, you can check the Langfuse architecture overview page.
For the sake of having a test setup, we'll use the docker-compose configuration in this post.
- Clone the Langfuse repository:
-
Open the
docker-compose.yml
file and set all the environment variables marked with# Change me
to your own values. These are secrets used to setup and connect to the various services. You can choose your own values on initial setup. -
Run the following command to start the services:
As simple as that, we have ourselves a running Langfuse instance. We'll have the following services running:
- langfuse worker
- langfuse web
- clickhouse
- minio
- redis
- postgres
Nothing out of the ordinary, but we need to keep in mind that Langfuse requires quite a few services to run. If you want to run it in production, you'll need to set up a proper environment with all the services configured and running.
Langfuse Features
Now that we have Langfuse running, let's take a look at some of the features it provides.
Note: We'll stick to the features available in the open-source version of Langfuse. To get an idea of all the features available, see the Langfuse pricing page.
Tracing
The main feature of Langfuse - obviously - is tracing. In simple terms, it allows you to capture the execution of your LLM applications and visualize it.
Langfuse makes this rather easy:
-
Install the required packages (in our case we want to use an openai model):
-
In your Langfuse isntance, create a new project and get the secret key and public key.
In your organization overview, click on "New Project". Fill out the form.
Creating a Project
In the next screen, click on "Create API Keys". This will reveal the public and private key. Copy them.
Getting the keys
-
Create a .env file:
-
In you project, make sure the environment variables are loaded. Then, simply import the Langfuse wrapper of you AI SDK, instead of the SDK itself. For example, if you're using OpenAI, you can do the following:
Note: That's basically all there is to trace all LLM calls. Each and every call to OpenAI will now be logged, including latencies and costs. However, each call will be a separate trace. Follow the next step to nest the traces to where you want them to be.
Our first trace in Langfuse
Combining LLM calls to a single trace
Now that we log our individual LLM calls, it would be nice to group them together, if we have an application which requires multiple LLM calls. Let's say you have a chat agent, it's highly likely that you need multiple LLM calls to answer the user's question. Eg. you need one call to extract relevant information, another call to decide what tools to use and finally a call to create the final answer. By grouping all these calls together we can get a better overview of a single user interaction.
For that, we have two options:
-
Add a custom trace id to the openai LLM calls.
Just make sure to add the same trace id to all LLM calls you want to group. So in most cases you'll want to create a custom trace id at the beginning of the user interaction.
While this is the easiest way of getting quite a good picture of how your application is working, it's a bit cumbersome to add the trace_id to every LLM call. You also don't get information about in which step or function the LLM call was made.
-
The second method solves both the before-mentioned issues: Langfuse provides a handy decorator to wrap the functions where you run you LLM calls.
Simply put this decorator on top of your functions and all LLM calls under this function will be grouped together.
What you'll get is a single trace, containing both the information which functions were called, when they called and nested a bit deeper the LLM calls.
Note: It's important to @observe all the functions which you want to be traced. The same is true for the function call hierarchy. If you have a function main(), do_something() and call_llm(), where main calls do_something and do_something calls call_llm, you need to @observe all three functions, if you want to get the complete picture.
If you would only trace eg. main() and call_llm(), you would get two different traces, one for main() and one for call_llm() (which would then have the LLM call as a child). So make sure, @observe is added to all functions in your call hierarchy.
A trace with multiple LLM calls
Note how in this view it's very easy to see the amount of tokens used, the latency and the cost of each LLM call.
Trace timeline
By clicking the "Timeline" toggle button in the traces view, you can switch to a handy timing-based overview. This allows to clearly see what took the most time, where bottlenecks are, where you have room for parallelization and where you can optimize your code.
You even see the time to first token in streaming scenarios (this is a game changer...)
Trace timeline
Sessions
Let's say you have a chat agent. The user probably not only sends one message, but they might ask multiple questions and follow-up questions in the same chat context.
While we could group all of them in one trace, it rarely makes sense, as most of the time you want to monitor and analyze one single question (which by definition than is a trace). However, you still would like to see how a full session took place.
For this Langfuse offers Sessions. A session in Langfuse are a way to group these traces together and see a simple session replay of the entire interaction.
Instrumenting Langfuse for sessions is again quite simple. Use the
langfuse_context
method to update the current trace with the session id.
As simple as that.
(Instead of chat-id
below, use your actual chat identifier. Most chat SDKs
provide a chat id).
If you don't use the @observe
decorator, you can also use simply add
the session id to the trace:
Users
Now that we have traces and sessions, what's missing? Correct, "user" entities.
Note: Users does not necessarily mean "users" in the sense of single identifiable persons. "Users" can also mean "agents" or organizations. "Users" are just a mere collection of traces and sessions. In general, make sure to respect privacy and data protection laws when using Langfuse.
To add user identifiers to your trace, use the same pattern as for the
sessions, but use the user_id
parameter.
In Langfuse you then see all the user entities, what they consumed and how much they cost you.
User overview. Source: https://langfuse.com/docs/tracing-features/users
More Tracing features
-
Trace Metadata: Add custom metadata to traces for better organization and filtering.
-
Trace Tags: Use tags to categorize and filter traces. Makes for really great overviews and filtering experiences.
Tag filtering
-
Multi-Modal Tracing: Capture and visualize multi-modal data, such as images or audio, alongside text traces.
Multi-Modal Trace. Source: https://langfuse.com/docs/tracing-features/multi-modality
-
Sampling: If you have a lot of users (and we hope you do), the you probably don't want to trace every single user interaction. Simply set the environment variable
LANGFUSE_SAMPLE_RATE
to a value between 0 and 1 to sample a percentage of traces. For example,0.1
will sample 10% of all traces. -
Trace Sharing: Share traces with your team or external stakeholders for collaboration and debugging. Click on the a bit too small "share" icon and hit "Share" to get a public, shareable link to the trace.
Trace Sharing
-
Native integrations: While tracing LLM calls is mostly enough, Langfuse offers dozens of integrations to other tools, such as:
These integrations allow you to trace the Frameworks inner workings better then the simple LLM call integration would be able to.
(Just for completeness: Langfuse is also fully Open Telemetry compatible)
Dashboard
The Langfuse dashboard is the heart of the platform, providing a centralized view of all traces, sessions, and user interactions. It allows you to easily get an overview about your application, performance and spot potential degradations over time.
A picture or two say more than thousand words, so have a look:
Langfuse Dashboard
Langfuse Dashboard
Evaluations
Evaluations are a key feature of Langfuse, allowing you to assess the quality of your LLM outputs systematically.
In your project settings, click Scores/Evaluations
and create a new score
config. A score config is simply the definition of how you want to evaluate
your traces (eg. by using numbers or labels).
Creating a new score config
Now in your trace view, click the Annotate
button (the small chat bubble) and
select the score config you want to use.
Annotate a trace: Select score config
Enter your score (or select the label, depending on your score config) and
optionally add a comment. Click Save
to save your score.
Annotate a trace: Add score
In the Tracing -> Scores menu, you find an overview of all your scores. With the filtering and sorting option you have a good way of getting an idea if your scores are degrading or improving over time.
Scores overview
Datasets
Via Langfuse Datasets you can create test sets and benchmarks to evaluate the performance of your LLM application.
Use cases (according to the Langfuse documentation):
- Continuous improvement: Create datasets from production edge cases to improve your application
- Pre-deployment testing: Benchmark new releases before deploying to production
- Structured testing: Run experiments on collections of inputs and expected outputs
- Flexible evaluation: Add custom evaluation metrics or use llm-as-a-judge
- Integrates well: Works with popular frameworks like LangChain and LlamaIndex
As datasets are a bit out of scope for this more monitoring-related post, I'll leave a link to the rather good Langfuse Dataset documentation here.
Prompt Management
The final feature I'm gonna talk about is prompt management, as it's a lovely addition to the Langfuse platform.
By using the Prompts
menu in the Langfuse platform, you can create and
version control prompts. This allows you to easily iterate on your prompts and
keep track of changes over time.
Just click on Create prompt
and fill out the prompt form - it should be
self-explanatory.
Creating a new prompt
If you want to create a new version of the prompt, simply open the prompt and in the "Versions" tab click on "New". This will give a nice git-like version history.
Prompt version history
If you click the Metrics
tab, you can see the usage of the prompt over time,
what they cost, and if you use the Evaluations
feature as described above,
you'll also see the scores of the prompt and their versions.
To use these prompts in your code, follow these steps:
To link the prompt to a trace - you guessed it - add the prompt
to the
@observe
decorated function:
Note: Langfuse offers caching for the prompts, see here.
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use