DSPy: Build Better AI Systems with Automated Prompt Optimization

blog preview

We've all been there. After hours of crafting the perfect prompt, you finally get your LLM to generate exactly what you want. The output is clear, accurate, and perfectly formatted. You lean back in your chair, feeling victorious – only to watch your carefully engineered prompt produce completely different, often nonsensical results on the very next run.

This "works once, fails often" phenomenon is the daily reality of prompt engineering. It's like trying to nail jelly to a wall. What works brilliantly in one moment might generate garbage the next, leaving developers frustrated and systems unreliable. The random nature of Large Language Models means that even when you think you've cracked the code, you're often just experiencing a lucky hit rather than a reproducible solution.

Traditional prompt engineering feels more like alchemy than engineering – a mix of gut feeling, trial and error, and hoping for the best. It's a time-consuming process of tweaking words, adjusting contexts, and crossing fingers, only to discover that your prompts need to be completely reworked when you switch to a different model or slightly modify your use case.

This is where DSPy comes in. Instead of playing this endless game of prompt whack-a-mole, DSPy offers a systematic, programmatic approach to building reliable AI systems. In this tutorial, we'll explore how DSPy can transform your prompt engineering workflow from artistic guesswork into a robust, reproducible process.

Installing DSPy with pgvector extras

First things first, let's install the thing. We'll install the base DSPy along with the pgvector extras, which are required for optimizing prompts in combination with pgvector. As we are big fans of PostgreSQL and pgvector, it's probable that we'll use it in one of our next DSPy tutorials.

1pip install dspy-ai[pgvector]

Just so you know, there are a ton of other extras to install, just follow the official installation guide.

Working with DSPy - the 8 steps to prompt programming

On their great website, DSPy describes the 8 steps necessary to create great prompts. We'll look at them here in more detail as to better understand the following examples.

We'll throw a bunch of new concepts at you - just bear with us. It's important to learn all of this, however many things will get clearer in the next chapter when we run our first example.

1. Define Your Task

The first step in using DSPy is to clearly define the problem you're trying to solve. Yes, I know this sounds boring, however this is crucial. Not only for DSPy but any LLM project.

Consider the following aspects:

  • Expected Input/Output Behavior: What type of system are you building? (e.g., chatbot, code assistant, information extraction system)
  • Quality and Cost Specs: What are your budget constraints and performance requirements?
  • Language Model Selection: Which LM is most suitable for your task? (e.g., GPT-3.5, Mistral-7B, GPT-4-turbo). If you don't know yet, start with a state of the art model like GPT-4o or Llama 3.2 70B as time of this writing. Downsizing comes later.

Pro tip: Start by creating 3-4 example inputs and outputs to help visualize yourtask.

Note: Write these things down. We are not even kidding.

2. Define Your Pipeline

Next, outline the structure of your DSPy program:

  • Determine the necessary steps for solving your problem. Think about it in in terms of a sequence of operations. What steps would you - as a fellow human - need to take? The same steps need to be done by an LLM.
  • Consider if you need additional tools (e.g., retrieval, calculator, API integrations)
  • Start simple: Begin with a single dspy.ChainofThought module and add complexity incrementally

Again, write this down!

3. Explore Examples

Run your initial examples through the pipeline:

  • Use a powerful LLM for exploration to understand the possibilities
  • Test with different LLMs to compare performance
  • Identify where the simple usage falls short. So, where is a simple prompt not enough?
  • Keep track of interesting examples, both easy and challenging ones
  • Repeat each of your simple examples 5 times. Again, not kidding. Just because it works once, doesn't mean it will always work. Note down the situations where it failed on repetition.

4. Define Your Data

Now that we know what to build, and we know what to expect and we know where simple prompts have their limitations, we need to think about optimizing.

That's where a dataset comes into play. And actually that's where the strength and main value proposition of DSPy lies.

For our optimizations to work, we need examples. Good examples. What do we mean with examples? Questions and answers, or inputs and outputs of what your pipeline will produce. Let's say you build a simple chatbot. Then you'll need a set of questions and correct answers. The LLM will produce the answers in the real production pipeline, however for our optimization step, we'll need to 'show' what the good answer will be.

If you build a classification system, you need to provide sample texts and correct classification results as part of your test data.

If you build a RAG system, you need to provide a question, relevant context and correct answer for the users questions.

I guess you get the idea.

Prepare your training and validation data:

  • Aim for 50-100 examples, with 300-500 being ideal
  • Utilize existing datasets if available (e.g., from HuggingFace)
  • Consider creating a small set of hand-crafted examples for unique tasks
  • Collect data through initial system deployment if possible. Yes. Deploy your system to production. Let real users provide real-world use-cases. You have officially permissions to test in production!
  • Also using strong LLMs to create synthetic data can be an option, however hand-crafted examples are preferable.

Note: When preparing the data you might look a bit ahead to the metric section, as your validation data need to be in the same format as your metric definition. Eg. if your metric provides a score of 0 to 10, also your validation data need a column providing a score per row. Alternatively, you can use your metric program to caluclate a score for your validation data. See more in our example below.

5. Define Your Metric

So, we're getting closer to the optimization step. But before we can run our optimizations, we need to define how to validate our results - we need to define a metric. Metrics in DSPy guide both evaluation and optimization of your language model pipeline. A well-defined metric helps you track progress and enables DSPy to enhance your program's effectiveness.

In DSPy, a metric is a function that takes two primary inputs:

  • An example from your dataset
  • The output (prediction) from your DSPy program

The metric function then returns a score that quantifies the quality of the output. This score can be a float, integer, or boolean value, depending on your task requirements.

There are three main types of metrics you can use in DSPy:

Simple Metrics

For straightforward tasks, you can use basic metrics such as:

  • Accuracy
  • Exact match
  • F1 score

Here's an example of a simple metric that compares the predicted answer to the correct answer:

1def validate_answer(example, pred):
2 return example.answer.lower() == pred.answer.lower()

DSPy also provides built-in utility metrics:

  • dspy.evaluate.metrics.answer_exact_match
  • dspy.evaluate.metrics.answer_passage_match

Complex Metrics

For more sophisticated applications, especially those involving long-form outputs, you may need to create a metric that checks multiple properties. Here's an example that evaluates both answer accuracy and context relevance:

1def validate_context_and_answer(example, pred):
2 # check the gold label and the predicted answer are the same
3 answer_match = example.answer.lower() == pred.answer.lower()
4 # check the predicted answer comes from one of the retrieved contexts
5 context_match = any((pred.answer.lower() in c) for c in pred.context)

Using AI Feedback for Metrics

For complex, long-form outputs, you can (and most probably want to) use AI feedback from language models to assess multiple dimensions of quality. Why? Because it's hard to impossible to define the correctness of a natural language text answer by programmatic means. You mostly need an LLM to assess whether the answer is correct or not.

Here's an example using GPT-4 to evaluate generated tweets (shamelessly from DSPy's documentation).

1class Assess(dspy.Signature):
2 """Assess the quality of a tweet along the specified dimension."""
3 assessed_text = dspy.InputField()
4 assessment_question = dspy.InputField()
5 assessment_answer = dspy.OutputField(desc="Yes or No")
6
7gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=1000, model_type='chat')
8
9def metric(gold, pred):
10 question, answer, tweet = gold.question, gold.answer, pred.output
11
12 engaging = "Does the assessed text make for a self-contained, engaging tweet?"
13 correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"
14
15 with dspy.context(lm=gpt4T):
16 correct = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
17 engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)
18
19 correct, engaging = [m.assessment_answer.lower() == 'yes' for m in [correct, engaging]]
20 score = (correct + engaging) if correct and (len(tweet) <= 280) else 0
21
22 return score / 2.0

This metric checks if the tweet:

  1. Correctly answers the given question
  2. Is engaging
  3. Adheres to Twitter's character limit

Note: The avid reader might object: If I need an LLM to judge my other LLM, who tells me that the second LLM is correct? Well, that's a valid point. DSPy therefore allows you to even 'compile' the metric pipeline. Read more about it in the last section here.

Iterative Refinement of DSPy metrics

Remember that defining an effective metric is an iterative process. Start with a simple metric, run evaluations, analyze the results, and refine your metric based on insights gained. As you iterate, you'll develop a more comprehensive and accurate way to assess your DSPy program's performance.

In summary, develop a way to measure the quality of your system's outputs:

  • Start with simple metrics for basic tasks (e.g., accuracy, F1 score)
  • For complex tasks, create a DSPy program to evaluate multiple output properties
  • Iterate on your metric definition as you refine your system

6. Collect Preliminary Evaluations

The steps with creating metrics and data are the most important but also time consuming ones. So we can relax a bit.

Get a sense of your pipelines performance and note down the pre-optimization metric results. Create your baseline so to say.

  • Run evaluations on your pipeline using your defined data and metric
  • Analyze outputs and scores to identify major issues
  • Establish a baseline for measuring improvements

7. Compile with a DSPy Optimizer

Finnally, optimization. DSPy provides quite some optimizers. Choose an appropriate DSPy optimizer based on your data availability:

  • For very limited data (≈10 examples): Use BootstrapFewShot
  • For moderate data (≈50 examples): Try BootstrapFewShotWithRandomSearch
  • For larger datasets (300+ examples): Implement MIPRO
  • To optimize for efficiency with a smaller LLM: Apply BootstrapFinetune

8. Iterate and Refine

After initial optimization:

  • Revisit previous steps to identify areas for improvement
  • Consider updating your task definition, data collection, metric, or program structure
  • Explore advanced features like DSPy Assertions
  • Experiment with multiple optimizers in sequence

Again, working with DSPy is an iterative process. Continuously refine your approach based on results and new insights to create the most effective LM pipeline for your specific task.

Note: We could generalize: Working with LLMs is an iterative process. This is not unique to DSPy.


Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use

Example: Optimizing text summarization with DSPy

So, to get started, let's think of a very simple pipeline: Summarizing text. You might already object by saying that's way too simple, however let's think about it more carefully. Is it really that simple? Summarizing text involves finding the most important information in a text without being overly long. Keeping to the truth. Preserving the tone and target group of the original text. Be consistent in output quality across varying texts, with different content, text lengths, and complexity.

Text summarization is actaully one of these examples where LLM projects regularly are classified as 'failures'. Early summarization attempts and prototypes produce great results. However when runnning these programs in production, the many flaws of simple summarization systems happen. And all too often the AI is blamed for being 'dumb' when in reality just the used prompts are bad.

Following the 8 steps introduced above, let's execute these steps. First, let's define the theory and show the complete code sample after that.

  1. Define your task: Summarize text, keep the same tone and style as the original, maximum length of 300 characters.

  2. Define your pipeline: Use a simple dspy.ChainOfThought module with an LLM. No tools, retrievers or APIs required.

  3. Explore examples: Homework: Run a few examples using plain openai model calls. Use a single long-form text and run a summarization prompt 5 times. Experience the variety of the results.

  4. Defining the data: As per our task definition, we need a set of long-form texts and their corresponding summaries. We can use existing datasets like CNN/DailyMail or create our own. In this specific example we use the billsum dataset from huggingface. It's a dataset with columns text, summary and title.

    The summary provided in this dataset will be treated as 'gold' summary. Our own summarizer should produce a summary as close as possible to this one and will be judged by that.

  5. Define your metric: As it's quite impossible to create a purely programmatic metric definition for judging whether our summary is close to the gold summary, we'll create an AI judge, which will score the likeness.

    Now, how do you create such a metric: Scoring summarization is surprisingly complex. The most common strategy is:

    • break down the text into it's main ideas
    • score the importance of each idea
    • compare a created summary to the key ideas. Are they represented?
  6. Collect preliminary evaluations: Run the pipeline with the initial setup and note down the results.

  7. Compile with a DSPy Optimizer: Use the BootstrapFewShotWithRandomSearch optimizer to optimize the summarization prompt.

  8. Iterate and refine: After optimization, revisit the previous steps to identify areas for improvement. Consider updating your task definition, data collection, metric, or program structure. Experiment with multiple optimizers in sequence.

Code for text summarization with DSPy

Now we're finally ready for some code. Please refer to the inline comments for explanations.

1import pandas as pd
2import dspy
3import os
4
5os.environ["OPENAI_API_KEY"] = "sk-your-key"
6
7# 4. Defining the data
8
9
10splits = {
11 "train": "data/train-00000-of-00001.parquet",
12 "test": "data/test-00000-of-00001.parquet",
13}
14df = pd.read_parquet("hf://datasets/FiscalNote/billsum/" + splits["train"])
15# Let's divide the dataset into train, dev and validation sets.
16df_train = df[0:40]
17df_dev = df[40:80]
18df_val = df[80:120]
19# Note: This dataset only has columns 'text', 'summary' and 'title', but no 'score'
20# which is required for our validation. We'll deal with that later.
21
22dataset_train = df_train.apply(
23 lambda row: dspy.Example(text_section=row["text"], summary=row["summary"]), axis=1
24).tolist()
25
26dataset_dev = df_dev.apply(
27 lambda row: dspy.Example(text_section=row["text"], summary=row["summary"]), axis=1
28).tolist()
29
30dataset_val = df_val.apply(
31 lambda row: dspy.Example(text_section=row["text"], summary=row["summary"]), axis=1
32).tolist()
33
34trainset = [x.with_inputs("text_section") for x in dataset_train]
35valset = [x.with_inputs("text_section") for x in dataset_val]
36devset = [x.with_inputs("text_section") for x in dataset_dev]
37
38
39# 5. Define your metric
40class KeyIdeas(dspy.Signature):
41 """
42 You'll get a long-form text section. Break it down into key ideas.
43 Rate the importance of each key idea with High, Medium or Low.
44 """
45
46 text_section = dspy.InputField()
47 key_ideas: str = dspy.OutputField(
48 desc="list of key ideas, with one key idea per line."
49 + "Each key idea get's a number."
50 + "e.g. 1. <Idea here>: High"
51 )
52 importances: list[str] = dspy.OutputField(
53 desc="list of importance levels for each key idea."
54 + 'e.g. ["High", "Medium", "Low"].'
55 )
56
57
58class SummaryRating(dspy.Signature):
59 """
60 You get an auto-generated summary. Compare it to the key ides from
61 the text section it was generated from.
62 Create a binary score for each key idea. 1, if the key idea is present
63 in the summary, 0 if not.
64 Finally, create an overall score based on the binary scores.
65 """
66
67 key_ideas: str = dspy.InputField(
68 desc="key ideas present in the text section to summarize"
69 )
70 summary: str = dspy.InputField()
71 binary_scores: list[bool] = dspy.OutputField(
72 desc="list of binary scores for each key idea, " + "e.g. [1, 0, 1]"
73 )
74 overall_score: float = dspy.OutputField(
75 desc="overall score for the summary out of 1.0"
76 )
77
78
79## We are still in step 5, defining the metric. This below is now the
80# final DSPy program used to calculate the metric.
81class Metric(dspy.Module):
82 """
83 Compute a score for the correctness of a summary.
84 """
85
86 def __init__(self):
87 self.extracted_key_ideas = dspy.ChainOfThought(KeyIdeas)
88 self.rate = dspy.ChainOfThought(SummaryRating)
89
90 def forward(self, example, pred, trace=None):
91 extracted_key_ideas = self.extracted_key_ideas(
92 text_section=example.text_section
93 )
94 key_ideas = extracted_key_ideas.key_ideas
95 importances = extracted_key_ideas.importances
96
97 scores = self.rate(
98 key_ideas=key_ideas,
99 summary=pred.summary,
100 )
101
102 try:
103 weight_map = {"High": 1.0, "Medium": 0.7}
104 score = sum(
105 weight_map.get(g, 0.2) * int(b)
106 for g, b in zip(importances, scores.binary_scores)
107 )
108 score /= sum(weight_map.get(g, 0.2) for g in importances)
109 except Exception:
110 score = float(scores.overall_score)
111
112 return score if trace is None else score >= 0.75
113
114
115def metric(gold, pred, trace=None):
116 metric_program = Metric()
117 example = dspy.Example(text_section=gold.text_section)
118 predicted = dspy.Example(summary=pred)
119 pred_score = metric_program(example=example, pred=predicted)
120 # In the next line we use our own metric program to generate a score for our
121 # gold data. This ensures that both, our gold data as well as our
122 # predicted data are scored by the same program. However, requires
123 # our metric program to be quite good. Keep that in mind.
124 gold_score = metric_program(example=example, pred=gold, trace=None)
125 # check if they are almost equal
126 return abs(float(gold_score) - float(pred_score)) < 0.2
127
128
129# 6. Collect preliminary evaluations
130## Prerequisite: create our pipeline program
131## Note: DSPy programs are always classes with an __init__ and a forward method
132
133
134class SummarizeSignature(dspy.Signature):
135 """
136 Given a text section, generate a summary.
137 """
138
139 text_section = dspy.InputField(desc="a text to summarize")
140 summary: str = dspy.OutputField(desc="a concise summary of the text section")
141
142
143class Summarize(dspy.Module):
144 def __init__(self):
145 self.summarize = dspy.ChainOfThought(SummarizeSignature)
146
147 def forward(self, text_section: str):
148 summary = self.summarize(text_section=text_section)
149 # Note: You can add multiple dspy modules here, for multi-step pipelines
150 # If you add multiple modules, DSPy will optimize all modules in one go
151 return summary
152
153
154## Next, we can define the LLM we want to use. We want to use gpt-4o-mini
155# for summarizing and gpt-4o as teacher model - for finetuning our summary prompt.
156lm = dspy.LM("openai/gpt-4o-mini", max_tokens=1000, cache=False)
157gpt4T = dspy.LM("openai/gpt-4o", max_tokens=1000, cache=False)
158dspy.settings.configure(lm=lm)
159
160# And finally, let's create the program (the actual pipeline) we want to
161# validate and optimize.
162program = Summarize()
163
164evaluate = dspy.Evaluate(
165 devset=devset,
166 metric=metric,
167 display_progress=True,
168 display_table=True,
169 provide_traceback=True,
170)
171
172res = evaluate(program, devset=devset)
173print(res)

This code snippet so far should have provided a number in percentage of how many summarizations were good. In my case I got the number 42. Meaning, we got 42% of key ideas correctly identified.

Internals: What prompt did we use to generate these summaries?

You might wonder, what prompt was actually used to generate these summaries? The good news is: We do not need to think in prompts anymore. Only Signatures. Signatures are a way to define what your program should do, by defining inputs and outputs. DSPy will then create a prompt for us.

If you want to see the actual prompt used, run:

1dspy.inspect_history(n=1)

This will output something like:

1[2024-10-30T19:48:46.135956]
2
3System message:
4
5Your input fields are:
61. `key_ideas` (str): key ideas present in the text section to summarize
72. `summary` (str)
8
9Your output fields are:
101. `reasoning` (str)
112. `binary_scores` (list[bool]): list of binary scores for each key idea, e.g. [1, 0, 1]
123. `overall_score` (float): overall score for the summary out of 1.0
13
14All interactions will be structured in the following way, with the appropriate values filled in.
15
16[[ ## key_ideas ## ]]
17{key_ideas}
18
19[[ ## summary ## ]]
20{summary}
21
22[[ ## reasoning ## ]]
23{reasoning}
24
25[[ ## binary_scores ## ]]
26{binary_scores} # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "boolean"}}
27
28[[ ## overall_score ## ]]
29{overall_score} # note: the value you produce must be a single float value
30
31[[ ## completed ## ]]
32
33In adhering to this structure, your objective is:
34 You get an auto-generated summary. Compare it to the key ides from
35 the text section it was generated from.
36 Create a binary score for each key idea. 1, if the key idea is present
37 in the summary, 0 if not.
38 Finally, create an overall score based on the binary scores.
39
40
41User message:
42
43....

This output is quite nice to recognize what DSPy is doing under the hood.

Ok, but 42% is not good enough, right? Let's try to improve this. For that, we use th BootstrapFewShotWithRandomSearch optimizer. It will automatically improve the prompt for us.

1tp = dspy.BootstrapFewShotWithRandomSearch(
2 metric=metric,
3 num_threads=24,
4 max_bootstrapped_demos=8,
5 max_labeled_demos=8,
6 num_candidate_programs=10,
7 teacher_settings=dict(lm=gpt4T),
8)
9
10optimized_program = tp.compile(
11 Summarize(),
12 trainset=trainset,
13 valset=valset,
14)
15
16result = evaluate(optimized_program)

In my example, the result was 68, which is a significant improvement!!

To compare the prompt used from before the optimization and after, use:

1dspy.inspect_history(n=2)

What's next

We've seen how DSPy works and how it automates prompt and LLM parameter optimization. However, we've also seen one very important aspect: By using DSPy we moved away from using prompts. We went in the direction of a more programmatic way of interacting with language models. By introducing Signatures, we can define what our program should do and by providing datasets we can guide an optimizer to find the prompt for us.

This not only is a more scalable way of building AI applications (as the most time consuming part of 'guessing' the right prompt is automated), but it's also a golden way to adjust AI apps for different language models. Let's say you use GPT-4o as your default model, but then you need to switch to a self-hosted open source model, like Llama. Without tools like DSPy, you'd need to carefully re-engineer or at least validate each and every prompt. By incorporating DSPy early in your development workflow, all you have to do is change the LLM in the DSPy config and run the optimizer again.

Further Reading

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog