DSPy: Build Better AI Systems with Automated Prompt Optimization
We've all been there. After hours of crafting the perfect prompt, you finally get your LLM to generate exactly what you want. The output is clear, accurate, and perfectly formatted. You lean back in your chair, feeling victorious – only to watch your carefully engineered prompt produce completely different, often nonsensical results on the very next run.
This "works once, fails often" phenomenon is the daily reality of prompt engineering. It's like trying to nail jelly to a wall. What works brilliantly in one moment might generate garbage the next, leaving developers frustrated and systems unreliable. The random nature of Large Language Models means that even when you think you've cracked the code, you're often just experiencing a lucky hit rather than a reproducible solution.
Traditional prompt engineering feels more like alchemy than engineering – a mix of gut feeling, trial and error, and hoping for the best. It's a time-consuming process of tweaking words, adjusting contexts, and crossing fingers, only to discover that your prompts need to be completely reworked when you switch to a different model or slightly modify your use case.
This is where DSPy comes in. Instead of playing this endless game of prompt whack-a-mole, DSPy offers a systematic, programmatic approach to building reliable AI systems. In this tutorial, we'll explore how DSPy can transform your prompt engineering workflow from artistic guesswork into a robust, reproducible process.
Installing DSPy with pgvector extras
First things first, let's install the thing. We'll install the base DSPy
along with the pgvector
extras, which are required for optimizing prompts
in combination with pgvector. As we
are big fans of PostgreSQL and pgvector, it's probable that we'll use it
in one of our next DSPy tutorials.
Just so you know, there are a ton of other extras to install, just follow the official installation guide.
Working with DSPy - the 8 steps to prompt programming
On their great website, DSPy describes the 8 steps necessary to create great prompts. We'll look at them here in more detail as to better understand the following examples.
We'll throw a bunch of new concepts at you - just bear with us. It's important to learn all of this, however many things will get clearer in the next chapter when we run our first example.
1. Define Your Task
The first step in using DSPy is to clearly define the problem you're trying to solve. Yes, I know this sounds boring, however this is crucial. Not only for DSPy but any LLM project.
Consider the following aspects:
- Expected Input/Output Behavior: What type of system are you building? (e.g., chatbot, code assistant, information extraction system)
- Quality and Cost Specs: What are your budget constraints and performance requirements?
- Language Model Selection: Which LM is most suitable for your task? (e.g., GPT-3.5, Mistral-7B, GPT-4-turbo). If you don't know yet, start with a state of the art model like GPT-4o or Llama 3.2 70B as time of this writing. Downsizing comes later.
Pro tip: Start by creating 3-4 example inputs and outputs to help visualize yourtask.
Note: Write these things down. We are not even kidding.
2. Define Your Pipeline
Next, outline the structure of your DSPy program:
- Determine the necessary steps for solving your problem. Think about it in in terms of a sequence of operations. What steps would you - as a fellow human - need to take? The same steps need to be done by an LLM.
- Consider if you need additional tools (e.g., retrieval, calculator, API integrations)
- Start simple: Begin with a single
dspy.ChainofThought
module and add complexity incrementally
Again, write this down!
3. Explore Examples
Run your initial examples through the pipeline:
- Use a powerful LLM for exploration to understand the possibilities
- Test with different LLMs to compare performance
- Identify where the simple usage falls short. So, where is a simple prompt not enough?
- Keep track of interesting examples, both easy and challenging ones
- Repeat each of your simple examples 5 times. Again, not kidding. Just because it works once, doesn't mean it will always work. Note down the situations where it failed on repetition.
4. Define Your Data
Now that we know what to build, and we know what to expect and we know where simple prompts have their limitations, we need to think about optimizing.
That's where a dataset comes into play. And actually that's where the strength and main value proposition of DSPy lies.
For our optimizations to work, we need examples. Good examples. What do we mean with examples? Questions and answers, or inputs and outputs of what your pipeline will produce. Let's say you build a simple chatbot. Then you'll need a set of questions and correct answers. The LLM will produce the answers in the real production pipeline, however for our optimization step, we'll need to 'show' what the good answer will be.
If you build a classification system, you need to provide sample texts and correct classification results as part of your test data.
If you build a RAG system, you need to provide a question, relevant context and correct answer for the users questions.
I guess you get the idea.
Prepare your training and validation data:
- Aim for 50-100 examples, with 300-500 being ideal
- Utilize existing datasets if available (e.g., from HuggingFace)
- Consider creating a small set of hand-crafted examples for unique tasks
- Collect data through initial system deployment if possible. Yes. Deploy your system to production. Let real users provide real-world use-cases. You have officially permissions to test in production!
- Also using strong LLMs to create synthetic data can be an option, however hand-crafted examples are preferable.
Note: When preparing the data you might look a bit ahead to the metric section, as your validation data need to be in the same format as your metric definition. Eg. if your metric provides a score of 0 to 10, also your validation data need a column providing a score per row. Alternatively, you can use your metric program to caluclate a score for your validation data. See more in our example below.
5. Define Your Metric
So, we're getting closer to the optimization step. But before we can run our optimizations, we need to define how to validate our results - we need to define a metric. Metrics in DSPy guide both evaluation and optimization of your language model pipeline. A well-defined metric helps you track progress and enables DSPy to enhance your program's effectiveness.
In DSPy, a metric is a function that takes two primary inputs:
- An example from your dataset
- The output (prediction) from your DSPy program
The metric function then returns a score that quantifies the quality of the output. This score can be a float, integer, or boolean value, depending on your task requirements.
There are three main types of metrics you can use in DSPy:
Simple Metrics
For straightforward tasks, you can use basic metrics such as:
- Accuracy
- Exact match
- F1 score
Here's an example of a simple metric that compares the predicted answer to the correct answer:
DSPy also provides built-in utility metrics:
dspy.evaluate.metrics.answer_exact_match
dspy.evaluate.metrics.answer_passage_match
Complex Metrics
For more sophisticated applications, especially those involving long-form outputs, you may need to create a metric that checks multiple properties. Here's an example that evaluates both answer accuracy and context relevance:
Using AI Feedback for Metrics
For complex, long-form outputs, you can (and most probably want to) use AI feedback from language models to assess multiple dimensions of quality. Why? Because it's hard to impossible to define the correctness of a natural language text answer by programmatic means. You mostly need an LLM to assess whether the answer is correct or not.
Here's an example using GPT-4 to evaluate generated tweets (shamelessly from DSPy's documentation).
This metric checks if the tweet:
- Correctly answers the given question
- Is engaging
- Adheres to Twitter's character limit
Note: The avid reader might object: If I need an LLM to judge my other LLM, who tells me that the second LLM is correct? Well, that's a valid point. DSPy therefore allows you to even 'compile' the metric pipeline. Read more about it in the last section here.
Iterative Refinement of DSPy metrics
Remember that defining an effective metric is an iterative process. Start with a simple metric, run evaluations, analyze the results, and refine your metric based on insights gained. As you iterate, you'll develop a more comprehensive and accurate way to assess your DSPy program's performance.
In summary, develop a way to measure the quality of your system's outputs:
- Start with simple metrics for basic tasks (e.g., accuracy, F1 score)
- For complex tasks, create a DSPy program to evaluate multiple output properties
- Iterate on your metric definition as you refine your system
6. Collect Preliminary Evaluations
The steps with creating metrics and data are the most important but also time consuming ones. So we can relax a bit.
Get a sense of your pipelines performance and note down the pre-optimization metric results. Create your baseline so to say.
- Run evaluations on your pipeline using your defined data and metric
- Analyze outputs and scores to identify major issues
- Establish a baseline for measuring improvements
7. Compile with a DSPy Optimizer
Finnally, optimization. DSPy provides quite some optimizers. Choose an appropriate DSPy optimizer based on your data availability:
- For very limited data (≈10 examples): Use
BootstrapFewShot
- For moderate data (≈50 examples): Try
BootstrapFewShotWithRandomSearch
- For larger datasets (300+ examples): Implement
MIPRO
- To optimize for efficiency with a smaller LLM: Apply
BootstrapFinetune
8. Iterate and Refine
After initial optimization:
- Revisit previous steps to identify areas for improvement
- Consider updating your task definition, data collection, metric, or program structure
- Explore advanced features like DSPy Assertions
- Experiment with multiple optimizers in sequence
Again, working with DSPy is an iterative process. Continuously refine your approach based on results and new insights to create the most effective LM pipeline for your specific task.
Note: We could generalize: Working with LLMs is an iterative process. This is not unique to DSPy.
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use
Example: Optimizing text summarization with DSPy
So, to get started, let's think of a very simple pipeline: Summarizing text. You might already object by saying that's way too simple, however let's think about it more carefully. Is it really that simple? Summarizing text involves finding the most important information in a text without being overly long. Keeping to the truth. Preserving the tone and target group of the original text. Be consistent in output quality across varying texts, with different content, text lengths, and complexity.
Text summarization is actaully one of these examples where LLM projects regularly are classified as 'failures'. Early summarization attempts and prototypes produce great results. However when runnning these programs in production, the many flaws of simple summarization systems happen. And all too often the AI is blamed for being 'dumb' when in reality just the used prompts are bad.
Following the 8 steps introduced above, let's execute these steps. First, let's define the theory and show the complete code sample after that.
-
Define your task: Summarize text, keep the same tone and style as the original, maximum length of 300 characters.
-
Define your pipeline: Use a simple
dspy.ChainOfThought
module with an LLM. No tools, retrievers or APIs required. -
Explore examples: Homework: Run a few examples using plain openai model calls. Use a single long-form text and run a summarization prompt 5 times. Experience the variety of the results.
-
Defining the data: As per our task definition, we need a set of long-form texts and their corresponding summaries. We can use existing datasets like CNN/DailyMail or create our own. In this specific example we use the billsum dataset from huggingface. It's a dataset with columns
text
,summary
andtitle
.The summary provided in this dataset will be treated as 'gold' summary. Our own summarizer should produce a summary as close as possible to this one and will be judged by that.
-
Define your metric: As it's quite impossible to create a purely programmatic metric definition for judging whether our summary is close to the gold summary, we'll create an AI judge, which will score the likeness.
Now, how do you create such a metric: Scoring summarization is surprisingly complex. The most common strategy is:
- break down the text into it's main ideas
- score the importance of each idea
- compare a created summary to the key ideas. Are they represented?
-
Collect preliminary evaluations: Run the pipeline with the initial setup and note down the results.
-
Compile with a DSPy Optimizer: Use the
BootstrapFewShotWithRandomSearch
optimizer to optimize the summarization prompt. -
Iterate and refine: After optimization, revisit the previous steps to identify areas for improvement. Consider updating your task definition, data collection, metric, or program structure. Experiment with multiple optimizers in sequence.
Code for text summarization with DSPy
Now we're finally ready for some code. Please refer to the inline comments for explanations.
This code snippet so far should have provided a number in percentage of how
many summarizations were good. In my case I got the number 42
. Meaning, we
got 42% of key ideas correctly identified.
Internals: What prompt did we use to generate these summaries?
You might wonder, what prompt was actually used to generate these summaries?
The good news is: We do not need to think in prompts anymore. Only Signatures
.
Signatures are a way to define what your program should do, by defining
inputs and outputs. DSPy will then create a prompt for us.
If you want to see the actual prompt used, run:
This will output something like:
This output is quite nice to recognize what DSPy is doing under the hood.
Ok, but 42% is not good enough, right? Let's try to improve this.
For that, we use th BootstrapFewShotWithRandomSearch
optimizer. It will
automatically improve the prompt for us.
In my example, the result was 68
, which is a significant improvement!!
To compare the prompt used from before the optimization and after, use:
What's next
We've seen how DSPy works and how it automates prompt and LLM parameter
optimization. However, we've also seen one very important aspect: By using
DSPy we moved away from using prompts. We went in the direction of a more
programmatic way of interacting with language models. By introducing
Signatures
, we can define what our program should do and by providing
datasets we can guide an optimizer to find the prompt for us.
This not only is a more scalable way of building AI applications (as the most time consuming part of 'guessing' the right prompt is automated), but it's also a golden way to adjust AI apps for different language models. Let's say you use GPT-4o as your default model, but then you need to switch to a self-hosted open source model, like Llama. Without tools like DSPy, you'd need to carefully re-engineer or at least validate each and every prompt. By incorporating DSPy early in your development workflow, all you have to do is change the LLM in the DSPy config and run the optimizer again.