DSPy Tutorial 2025: Build Better AI Systems with Automated Prompt Optimization

Manual prompt engineering is broken. After hours of crafting the perfect prompt, you finally get your LLM to generate exactly what you want. The output is clear, accurate, and perfectly formatted. You lean back in your chair, feeling victorious – only to watch your carefully engineered prompt produce completely different, often nonsensical results on the very next run.

This "works once, fails often" phenomenon is the daily reality of prompt engineering. It's like trying to nail jelly to a wall. What works brilliantly in one moment might generate garbage the next, leaving developers frustrated and systems unreliable. The random nature of Large Language Models means that even when you think you've cracked the code, you're often just experiencing a lucky hit rather than a reproducible solution.

Traditional prompt engineering feels more like alchemy than engineering – a mix of gut feeling, trial and error, and hoping for the best. It's a time-consuming process of tweaking words, adjusting contexts, and crossing fingers, only to discover that your prompts need to be completely reworked when you switch to a different model or slightly modify your use case.

Enter DSPy: the solution to prompt engineering chaos. DSPy offers a systematic, programmatic approach to building reliable AI systems. Instead of playing this endless game of prompt whack-a-mole, DSPy automates prompt optimization using machine learning techniques. In this comprehensive DSPy tutorial for 2025, we'll explore how DSPy transforms your prompt engineering workflow from artistic guesswork into a robust, reproducible process that scales across different language models and use cases.

DSPy Installation Guide: Setting Up Automated Prompt Optimization

First things first, let's install the thing. We'll install the base DSPy along with the pgvector extras, which are required for optimizing prompts in combination with pgvector. As we are big fans of PostgreSQL and pgvector, it's probable that we'll use it in one of our next DSPy tutorials.

1pip install dspy-ai[pgvector]

Just so you know, there are a ton of other extras to install, just follow the official installation guide.

DSPy Framework: 8 Essential Steps for Automated Prompt Programming

On their great website, DSPy describes the 8 steps necessary to create great prompts. We'll look at them here in more detail as to better understand the following examples.

We'll throw a bunch of new concepts at you - just bear with us. It's important to learn all of this, however many things will get clearer in the next chapter when we run our first example.

1. Define Your AI Task: Foundation for DSPy Optimization

The first step in using DSPy is to clearly define the problem you're trying to solve. Yes, I know this sounds boring, however this is crucial. Not only for DSPy but any LLM project.

Consider the following aspects:

Expected Input/Output Behavior: What type of system are you building? (e.g., chatbot, code assistant, information extraction system)
Quality and Cost Specs: What are your budget constraints and performance requirements?
Language Model Selection: Which LM is most suitable for your task? (e.g., GPT-3.5, Mistral-7B, GPT-4-turbo). If you don't know yet, start with a state of the art model like GPT-4o or Llama 3.2 70B as time of this writing. Downsizing comes later.

Pro tip: Start by creating 3-4 example inputs and outputs to help visualize yourtask.

Note: Write these things down. We are not even kidding.

2. Design Your DSPy Pipeline Architecture

Next, outline the structure of your DSPy program:

Determine the necessary steps for solving your problem. Think about it in in terms of a sequence of operations. What steps would you - as a fellow human - need to take? The same steps need to be done by an LLM.
Consider if you need additional tools (e.g., retrieval, calculator, API integrations)
Start simple: Begin with a single dspy.ChainofThought module and add complexity incrementally

Again, write this down!

3. Explore Examples

Run your initial examples through the pipeline:

Use a powerful LLM for exploration to understand the possibilities
Test with different LLMs to compare performance
Identify where the simple usage falls short. So, where is a simple prompt not enough?
Keep track of interesting examples, both easy and challenging ones
Repeat each of your simple examples 5 times. Again, not kidding. Just because it works once, doesn't mean it will always work. Note down the situations where it failed on repetition.

4. Prepare Training Data for DSPy Optimization

Now that we know what to build, and we know what to expect and we know where simple prompts have their limitations, we need to think about optimizing.

That's where a dataset comes into play. And actually that's where the strength and main value proposition of DSPy lies.

For our optimizations to work, we need examples. Good examples. What do we mean with examples? Questions and answers, or inputs and outputs of what your pipeline will produce. Let's say you build a simple chatbot. Then you'll need a set of questions and correct answers. The LLM will produce the answers in the real production pipeline, however for our optimization step, we'll need to 'show' what the good answer will be.

If you build a classification system, you need to provide sample texts and correct classification results as part of your test data.

If you build a RAG system, you need to provide a question, relevant context and correct answer for the users questions.

I guess you get the idea.

Prepare your training and validation data:

Aim for 50-100 examples, with 300-500 being ideal
Utilize existing datasets if available (e.g., from HuggingFace)
Consider creating a small set of hand-crafted examples for unique tasks
Collect data through initial system deployment if possible. Yes. Deploy your system to production. Let real users provide real-world use-cases. You have officially permissions to test in production!
Also using strong LLMs to create synthetic data can be an option, however hand-crafted examples are preferable.

Note: When preparing the data you might look a bit ahead to the metric section, as your validation data need to be in the same format as your metric definition. Eg. if your metric provides a score of 0 to 10, also your validation data need a column providing a score per row. Alternatively, you can use your metric program to caluclate a score for your validation data. See more in our example below.

5. Create DSPy Evaluation Metrics for AI System Performance

So, we're getting closer to the optimization step. But before we can run our optimizations, we need to define how to validate our results - we need to define a metric. Metrics in DSPy guide both evaluation and optimization of your language model pipeline. A well-defined metric helps you track progress and enables DSPy to enhance your program's effectiveness.

In DSPy, a metric is a function that takes two primary inputs:

An example from your dataset
The output (prediction) from your DSPy program

The metric function then returns a score that quantifies the quality of the output. This score can be a float, integer, or boolean value, depending on your task requirements.

There are three main types of metrics you can use in DSPy:

Simple Metrics

For straightforward tasks, you can use basic metrics such as:

Accuracy
Exact match
F1 score

Here's an example of a simple metric that compares the predicted answer to the correct answer:

1def validate_answer(example, pred):
2    return example.answer.lower() == pred.answer.lower()

DSPy also provides built-in utility metrics:

dspy.evaluate.metrics.answer_exact_match
dspy.evaluate.metrics.answer_passage_match

Complex Metrics

For more sophisticated applications, especially those involving long-form outputs, you may need to create a metric that checks multiple properties. Here's an example that evaluates both answer accuracy and context relevance:

1def validate_context_and_answer(example, pred):
2    # check the gold label and the predicted answer are the same
3    answer_match = example.answer.lower() == pred.answer.lower()
4    # check the predicted answer comes from one of the retrieved contexts
5    context_match = any((pred.answer.lower() in c) for c in pred.context)

Using AI Feedback for Metrics

For complex, long-form outputs, you can (and most probably want to) use AI feedback from language models to assess multiple dimensions of quality. Why? Because it's hard to impossible to define the correctness of a natural language text answer by programmatic means. You mostly need an LLM to assess whether the answer is correct or not.

Here's an example using GPT-4 to evaluate generated tweets (shamelessly from DSPy's documentation).

1class Assess(dspy.Signature):
2    """Assess the quality of a tweet along the specified dimension."""
3    assessed_text = dspy.InputField()
4    assessment_question = dspy.InputField()
5    assessment_answer = dspy.OutputField(desc="Yes or No")
6
7gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=1000, model_type='chat')
8
9def metric(gold, pred):
10    question, answer, tweet = gold.question, gold.answer, pred.output
11
12    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
13    correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"
14
15    with dspy.context(lm=gpt4T):
16        correct =  dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
17        engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)
18
19    correct, engaging = [m.assessment_answer.lower() == 'yes' for m in [correct, engaging]]
20    score = (correct + engaging) if correct and (len(tweet) <= 280) else 0
21
22    return score / 2.0

This metric checks if the tweet:

Correctly answers the given question
Is engaging
Adheres to Twitter's character limit

Note: The avid reader might object: If I need an LLM to judge my other LLM, who tells me that the second LLM is correct? Well, that's a valid point. DSPy therefore allows you to even 'compile' the metric pipeline. Read more about it in the last section here.

Iterative Refinement of DSPy metrics

Remember that defining an effective metric is an iterative process. Start with a simple metric, run evaluations, analyze the results, and refine your metric based on insights gained. As you iterate, you'll develop a more comprehensive and accurate way to assess your DSPy program's performance.

In summary, develop a way to measure the quality of your system's outputs:

Start with simple metrics for basic tasks (e.g., accuracy, F1 score)
For complex tasks, create a DSPy program to evaluate multiple output properties
Iterate on your metric definition as you refine your system

6. Collect Preliminary Evaluations

The steps with creating metrics and data are the most important but also time consuming ones. So we can relax a bit.

Get a sense of your pipelines performance and note down the pre-optimization metric results. Create your baseline so to say.

Run evaluations on your pipeline using your defined data and metric
Analyze outputs and scores to identify major issues
Establish a baseline for measuring improvements

7. DSPy Optimization: Automated Prompt Compilation Techniques

Finnally, optimization. DSPy provides quite some optimizers. Choose an appropriate DSPy optimizer based on your data availability:

For very limited data (≈10 examples): Use BootstrapFewShot
For moderate data (≈50 examples): Try BootstrapFewShotWithRandomSearch
For larger datasets (300+ examples): Implement MIPRO
To optimize for efficiency with a smaller LLM: Apply BootstrapFinetune

8. Iterate and Refine

After initial optimization:

Revisit previous steps to identify areas for improvement
Consider updating your task definition, data collection, metric, or program structure
Explore advanced features like DSPy Assertions
Experiment with multiple optimizers in sequence

Again, working with DSPy is an iterative process. Continuously refine your approach based on results and new insights to create the most effective LM pipeline for your specific task.

Note: We could generalize: Working with LLMs is an iterative process. This is not unique to DSPy.

Get our Newsletter!

The latest on AI, RAG, and data

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide

DSPy Text Summarization Tutorial: Complete Implementation Example

So, to get started, let's think of a very simple pipeline: Summarizing text. You might already object by saying that's way too simple, however let's think about it more carefully. Is it really that simple? Summarizing text involves finding the most important information in a text without being overly long. Keeping to the truth. Preserving the tone and target group of the original text. Be consistent in output quality across varying texts, with different content, text lengths, and complexity.

Text summarization is actaully one of these examples where LLM projects regularly are classified as 'failures'. Early summarization attempts and prototypes produce great results. However when runnning these programs in production, the many flaws of simple summarization systems happen. And all too often the AI is blamed for being 'dumb' when in reality just the used prompts are bad.

Following the 8 steps introduced above, let's execute these steps. First, let's define the theory and show the complete code sample after that.

Define your task: Summarize text, keep the same tone and style as the original, maximum length of 300 characters.
Define your pipeline: Use a simple dspy.ChainOfThought module with an LLM. No tools, retrievers or APIs required.
Explore examples: Homework: Run a few examples using plain openai model calls. Use a single long-form text and run a summarization prompt 5 times. Experience the variety of the results.
Defining the data: As per our task definition, we need a set of long-form texts and their corresponding summaries. We can use existing datasets like CNN/DailyMail or create our own. In this specific example we use the billsum dataset from huggingface. It's a dataset with columns text, summary and title.

The summary provided in this dataset will be treated as 'gold' summary. Our own summarizer should produce a summary as close as possible to this one and will be judged by that.
Define your metric: As it's quite impossible to create a purely programmatic metric definition for judging whether our summary is close to the gold summary, we'll create an AI judge, which will score the likeness.

Now, how do you create such a metric: Scoring summarization is surprisingly complex. The most common strategy is:
- break down the text into it's main ideas
- score the importance of each idea
- compare a created summary to the key ideas. Are they represented?
Collect preliminary evaluations: Run the pipeline with the initial setup and note down the results.
Compile with a DSPy Optimizer: Use the BootstrapFewShotWithRandomSearch optimizer to optimize the summarization prompt.
Iterate and refine: After optimization, revisit the previous steps to identify areas for improvement. Consider updating your task definition, data collection, metric, or program structure. Experiment with multiple optimizers in sequence.

Complete DSPy Code Example: Automated Text Summarization

Now we're finally ready for some code. Please refer to the inline comments for explanations.

1import pandas as pd
2import dspy
3import os
4
5os.environ["OPENAI_API_KEY"] = "sk-your-key"
6
7# 4. Defining the data
8
9
10splits = {
11    "train": "data/train-00000-of-00001.parquet",
12    "test": "data/test-00000-of-00001.parquet",
13}
14df = pd.read_parquet("hf://datasets/FiscalNote/billsum/" + splits["train"])
15# Let's divide the dataset into train, dev and validation sets.
16df_train = df[0:40]
17df_dev = df[40:80]
18df_val = df[80:120]
19# Note: This dataset only has columns 'text', 'summary' and 'title', but no 'score'
20# which is required for our validation. We'll deal with that later.
21
22dataset_train = df_train.apply(
23    lambda row: dspy.Example(text_section=row["text"], summary=row["summary"]), axis=1
24).tolist()
25
26dataset_dev = df_dev.apply(
27    lambda row: dspy.Example(text_section=row["text"], summary=row["summary"]), axis=1
28).tolist()
29
30dataset_val = df_val.apply(
31    lambda row: dspy.Example(text_section=row["text"], summary=row["summary"]), axis=1
32).tolist()
33
34trainset = [x.with_inputs("text_section") for x in dataset_train]
35valset = [x.with_inputs("text_section") for x in dataset_val]
36devset = [x.with_inputs("text_section") for x in dataset_dev]
37
38
39# 5. Define your metric
40class KeyIdeas(dspy.Signature):
41    """
42    You'll get a long-form text section. Break it down into key ideas.
43    Rate the importance of each key idea with High, Medium or Low.
44    """
45
46    text_section = dspy.InputField()
47    key_ideas: str = dspy.OutputField(
48        desc="list of key ideas, with one key idea per line."
49        + "Each key idea get's a number."
50        + "e.g. 1. <Idea here>: High"
51    )
52    importances: list[str] = dspy.OutputField(
53        desc="list of importance levels for each key idea."
54        + 'e.g. ["High", "Medium", "Low"].'
55    )
56
57
58class SummaryRating(dspy.Signature):
59    """
60    You get an auto-generated summary. Compare it to the key ides from
61    the text section it was generated from.
62    Create a binary score for each key idea. 1, if the key idea is present
63    in the summary, 0 if not.
64    Finally, create an overall score based on the binary scores.
65    """
66
67    key_ideas: str = dspy.InputField(
68        desc="key ideas present in the text section to summarize"
69    )
70    summary: str = dspy.InputField()
71    binary_scores: list[bool] = dspy.OutputField(
72        desc="list of binary scores for each key idea, " + "e.g. [1, 0, 1]"
73    )
74    overall_score: float = dspy.OutputField(
75        desc="overall score for the summary out of 1.0"
76    )
77
78
79## We are still in step 5, defining the metric. This below is now the
80# final DSPy program used to calculate the metric.
81class Metric(dspy.Module):
82    """
83    Compute a score for the correctness of a summary.
84    """
85
86    def __init__(self):
87        self.extracted_key_ideas = dspy.ChainOfThought(KeyIdeas)
88        self.rate = dspy.ChainOfThought(SummaryRating)
89
90    def forward(self, example, pred, trace=None):
91        extracted_key_ideas = self.extracted_key_ideas(
92            text_section=example.text_section
93        )
94        key_ideas = extracted_key_ideas.key_ideas
95        importances = extracted_key_ideas.importances
96
97        scores = self.rate(
98            key_ideas=key_ideas,
99            summary=pred.summary,
100        )
101
102        try:
103            weight_map = {"High": 1.0, "Medium": 0.7}
104            score = sum(
105                weight_map.get(g, 0.2) * int(b)
106                for g, b in zip(importances, scores.binary_scores)
107            )
108            score /= sum(weight_map.get(g, 0.2) for g in importances)
109        except Exception:
110            score = float(scores.overall_score)
111
112        return score if trace is None else score >= 0.75
113
114
115def metric(gold, pred, trace=None):
116    metric_program = Metric()
117    example = dspy.Example(text_section=gold.text_section)
118    predicted = dspy.Example(summary=pred)
119    pred_score = metric_program(example=example, pred=predicted)
120    # In the next line we use our own metric program to generate a score for our
121    # gold data. This ensures that both, our gold data as well as our
122    # predicted data are scored by the same program. However, requires
123    # our metric program to be quite good. Keep that in mind.
124    gold_score = metric_program(example=example, pred=gold, trace=None)
125    # check if they are almost equal
126    return abs(float(gold_score) - float(pred_score)) < 0.2
127
128
129# 6. Collect preliminary evaluations
130## Prerequisite: create our pipeline program
131## Note: DSPy programs are always classes with an __init__ and a forward method
132
133
134class SummarizeSignature(dspy.Signature):
135    """
136    Given a text section, generate a summary.
137    """
138
139    text_section = dspy.InputField(desc="a text to summarize")
140    summary: str = dspy.OutputField(desc="a concise summary of the text section")
141
142
143class Summarize(dspy.Module):
144    def __init__(self):
145        self.summarize = dspy.ChainOfThought(SummarizeSignature)
146
147    def forward(self, text_section: str):
148        summary = self.summarize(text_section=text_section)
149        # Note: You can add multiple dspy modules here, for multi-step pipelines
150        # If you add multiple modules, DSPy will optimize all modules in one go
151        return summary
152
153
154## Next, we can define the LLM we want to use. We want to use gpt-4o-mini
155# for summarizing and gpt-4o as teacher model - for finetuning our summary prompt.
156lm = dspy.LM("openai/gpt-4o-mini", max_tokens=1000, cache=False)
157gpt4T = dspy.LM("openai/gpt-4o", max_tokens=1000, cache=False)
158dspy.settings.configure(lm=lm)
159
160# And finally, let's create the program (the actual pipeline) we want to
161# validate and optimize.
162program = Summarize()
163
164evaluate = dspy.Evaluate(
165    devset=devset,
166    metric=metric,
167    display_progress=True,
168    display_table=True,
169    provide_traceback=True,
170)
171
172res = evaluate(program, devset=devset)
173print(res)

This code snippet so far should have provided a number in percentage of how many summarizations were good. In my case I got the number 42. Meaning, we got 42% of key ideas correctly identified.

Internals: What prompt did we use to generate these summaries?

You might wonder, what prompt was actually used to generate these summaries? The good news is: We do not need to think in prompts anymore. Only Signatures. Signatures are a way to define what your program should do, by defining inputs and outputs. DSPy will then create a prompt for us.

If you want to see the actual prompt used, run:

1dspy.inspect_history(n=1)

This will output something like:

1[2024-10-30T19:48:46.135956]
2
3System message:
4
5Your input fields are:
61. `key_ideas` (str): key ideas present in the text section to summarize
72. `summary` (str)
8
9Your output fields are:
101. `reasoning` (str)
112. `binary_scores` (list[bool]): list of binary scores for each key idea, e.g. [1, 0, 1]
123. `overall_score` (float): overall score for the summary out of 1.0
13
14All interactions will be structured in the following way, with the appropriate values filled in.
15
16[[ ## key_ideas ## ]]
17{key_ideas}
18
19[[ ## summary ## ]]
20{summary}
21
22[[ ## reasoning ## ]]
23{reasoning}
24
25[[ ## binary_scores ## ]]
26{binary_scores}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "boolean"}}
27
28[[ ## overall_score ## ]]
29{overall_score}        # note: the value you produce must be a single float value
30
31[[ ## completed ## ]]
32
33In adhering to this structure, your objective is:
34        You get an auto-generated summary. Compare it to the key ides from
35        the text section it was generated from.
36        Create a binary score for each key idea. 1, if the key idea is present
37        in the summary, 0 if not.
38        Finally, create an overall score based on the binary scores.
39
40
41User message:
42
43....

This output is quite nice to recognize what DSPy is doing under the hood.

Ok, but 42% is not good enough, right? Let's try to improve this. For that, we use th BootstrapFewShotWithRandomSearch optimizer. It will automatically improve the prompt for us.

1tp = dspy.BootstrapFewShotWithRandomSearch(
2    metric=metric,
3    num_threads=24,
4    max_bootstrapped_demos=8,
5    max_labeled_demos=8,
6    num_candidate_programs=10,
7    teacher_settings=dict(lm=gpt4T),
8)
9
10optimized_program = tp.compile(
11    Summarize(),
12    trainset=trainset,
13    valset=valset,
14)
15
16result = evaluate(optimized_program)

In my example, the result was 68, which is a significant improvement!!

To compare the prompt used from before the optimization and after, use:

1dspy.inspect_history(n=2)

DSPy Best Practices and Next Steps for 2025

DSPy revolutionizes AI development by automating prompt optimization. We've seen how DSPy automates prompt and LLM parameter optimization, eliminating manual prompt engineering bottlenecks. The key paradigm shift: DSPy moves us away from manual prompts toward programmatic language model interaction. Through DSPy Signatures, we define program behavior declaratively, while optimizers automatically discover optimal prompts using training data.

Why DSPy matters for enterprise AI in 2025: This approach scales AI applications by automating the most time-consuming aspect of LLM development—prompt engineering. DSPy also enables seamless model switching: transitioning from GPT-4o to self-hosted models like Llama requires only changing the DSPy configuration and re-running optimization, rather than manually re-engineering every prompt. For organizations building production AI systems, DSPy provides the reliability and maintainability that manual prompt engineering cannot deliver.