Azure AI Studio: How to evaluate and upgrade your models, using the Prompt Flow SDK

Let's face it: keeping up with AI model updates can feel like chasing a hyperactive squirrel. Just when you've gotten comfortable with one version, another pops up, promising to be faster, safer, and maybe even make you coffee in the morning.

But here's the thing – those shiny new models aren't just for show. They can seriously improve your AI application, potentially saving you time, money, and headaches. The flip side? Your trusty old models will eventually be put out to pasture.

This blog post will guide you through the process of evaluating new model versions and upgrading your deployments in the Azure OpenAI Service. We'll explore how to use Azure AI Studio Evaluations to compare different model versions, assess their performance, and make informed decisions about which version best suits your needs. And most importantly - find out whether a new model will really increase our applications performance or whether it will even introduce regressions.

More specifically, we'll cover:

Using Azure AI Studio Evaluations to assess new model versions
Comparing different models using both code-based and UI-friendly methods
Best practices for upgrading your deployments

What is the Azure AI Studio?

Azure AI Studio is a platform within Microsoft Azure designed to help developers create and deploy AI-powered applications. It offers a range of tools for building generative AI models, suitable for developers with varying levels of experience. The platform allows users to work with existing AI models or develop custom ones using their own data.

One of Azure AI Studio's main features is its adaptability. Users can choose between pre-built AI models or customize them with proprietary data, allowing organizations to tailor solutions to their specific needs. The platform also supports team collaboration, integrating with common development tools like GitHub and Visual Studio.

The major selling point at the moment is the integration of OpenAI's models. Most of the OpenAI models can be managed and provisioned through the Azure AI Studio. You can add content filters, deploy models, log metrics, and - as we are about to see - evaluate and upgrade models, by perfectly integrating with the Prompt Flow SDK.

What is the Microsoft Prompt Flow SDK?

The Microsoft Prompt Flow SDK is a toolkit designed to assist with the development of applications based on large language models (LLMs). It provides a robust framework for developers to streamline the entire lifecycle of LLM app development, from initial prototyping to production deployment and ongoing monitoring.

Key Features and Capabilities

Prototyping and Experimentation:

Prompt Engineering: The SDK supports advanced prompt engineering, enabling developers to experiment with different prompt configurations to optimize model outputs.
Python Integration: Developers can integrate Python code directly into their workflows, allowing for complex logic and data manipulation alongside prompt engineering.

Testing and Evaluation:

Unit Testing: Prompt Flow provides tools to write unit tests for LLM prompts, ensuring that changes to prompts or underlying models do not introduce regressions.
Performance Metrics: The SDK includes performance evaluation features, allowing developers to measure the effectiveness of different prompts and configurations.

Deployment and Monitoring:

CI/CD Integration: The SDK integrates with Continuous Integration and Continuous Deployment (CI/CD) pipelines, enabling seamless deployment of LLM applications.
Monitoring and Logging: It provides robust monitoring tools to track the performance of deployed applications, capturing key metrics and logs for ongoing optimization.

Collaboration and Versioning:

Version Control: Prompt Flow SDK supports version control, making it easy to track changes and collaborate on LLM projects across teams.
Collaboration Tools: Developers can work together in a shared environment, making the development process more efficient and ensuring consistency across different versions of an application.

Extensibility:

Custom Plugins: The SDK allows developers to create custom plugins, extending its functionality to suit specific project needs.
Third-Party Integrations: It supports integrations with various third-party tools and platforms

Benefits of Using Prompt Flow SDK

Efficiency: By offering a unified platform for all stages of LLM app development, the Prompt Flow SDK reduces the time and effort required to move from prototype to production.
Quality Assurance: The built-in testing and evaluation tools help maintain high standards for LLM applications, ensuring reliable and consistent performance.
Scalability: The SDK is designed to handle projects of any size, from small prototypes to large-scale, production-grade applications.

Why upgrade your models?

Now that we know the tools we're gonna use in this tutorial, let's take a step back and ask the question, why are we doing this? Why should we upgrade our models? Why do we even need an automated process for validating model upgrades?

The answer is two-fold:

First, the obvious one: Models tend to get better, very quickly. Looking just at the OpenAI model changelog, we see the rate of change of models:

GPT-3.5 was released in March 2022
GPT-3.5-turbo was released in November 2022
GPT-4 was released in March 2023
GPT-4-turbo was released in November 2023
GPT-4o was released in May 2024

As we can see, a new model drops off approximately every 6 months. However, this list does not include all the minor model versions like GPT-4o-mini, GPT-3.5-turbo-16k or eg. GPT-3.5-turbo-0125. Each of these newer models offered way better performance or 10x less costs and latency, compared to the previous one - meaning, not upgrading was almost negligent.

The second reason for the need to upgrade: Microsoft and other model providers are aggressively retiring older models as new ones arrive. Microsoft Microsoft has quite transparent model deprecation policies, with one of the main statements being: models can be retired 1 year after their initial release. So, no matter how you feel towards upgrading, a year after a models release, you have to.

What to evaluate when upgrading your models?

What metrics should you consider when evaluating new model versions? The authors of the Prompt Flow SDK suggest the following evaluation criteria:

Performance and Quality: Metrics such as groundedness, relevance, coherence, fluency, similarity, and F1 score.
Risk and Safety: Metrics assessing violence, sexual content, self-harm, and hate/unfairness.
Composite Metrics: Combined evaluations for question-answer pairs or chat messages, and content safety.

To give more context, these metrics are:

Groundedness: Measures how well the model's output aligns with factual information.
Relevance: Assesses how relevant the response is to the input prompt.
Coherence: Evaluates the logical flow of the response ('does the output make sense?').
Fluency: Measures the linguistic quality and naturalness of the response.
Similarity: Checks the similarity between model output and reference text.
F1 Score: Used for precision (the accuracy of the positive predictions) and recall (the ability to find all positive instances)-based evaluations.
Violence: Detects content that contains violent or aggressive language.
Sexual Content: Identifies content with sexual references or inappropriate language.
Self-Harm: Evaluates whether the content encourages or discusses self-harm.
Hate/Unfairness: Measures the presence of hateful, biased, or discriminatory language.
Question-Answer Pairs: Evaluates the model's responses for accuracy and completeness in a Q&A context.
Content Safety: Combines different risk and safety metrics to evaluate overall content safety.

As you can see, quite a lot to validate. Keep in mind that not all of these metrics are always relevant for your use case.

In the next section we finally get to the hands-on part of this tutorial, where we will evaluate a new model version using Azure AI Studio and the Prompt Flow SDK.

Hands on: Evaluating your models using Azure AI Studio and Prompt Flow

Prerequisites

Please prepare the following prerequisites before proceeding with the tutorial:

Sign up for Microsoft Azure AI Studio and create a project, as outlined here.

Azure AI Studio - create a project

Note: Make sure to note step 9: "Select an existing Azure AI services resource (including Azure OpenAI) from the dropdown or create a new one." You need to connect an Azure OpenAI service to your project for this tutorial. Either select an existing one or create a new one

You'll also need the Azurel CLI installed on your computer. Run az login to authenticate your Azure account. Select the subscription, you want to use.

Install the Prompt Flow evaluation and azure packages:

1pip install promptflow-evals promptflow-azure

Head over to the "Settings" - page of your Azure OpenAI Studio project and note down the following information:
- project name
- resource group name
- subscription id
Azure AI Studio - project settings
Now we need to create ourselves some AI models to evaluate - and also AI model which acts as "judge" for the evaluation.

This is a critical piece of information: Prompt flow uses an LLM itself to judge the other LLMs. Therefore, it's important to use the best possible model for the evaluation in terms of reasoning quality. Otherwise, the lacking quality of the judge model might influence the evaluation results.

Click on "Model catalog" in the Azure AI Studio left hand side menu. Then, select "GPT-4o".

Azure AI Studio - model catalog

In the next screen, click on "Deploy" - a modal will open. In this modal, set the options as you see fit. For GPT-4o, the following settings work well, as of time of this writing. In the "Connected Azure OpenAI service" dropdown, select the service you connected or created in step 1 (during project creation).

Azure AI Studio - deploy model

Hit "Deploy" to get the model provisioned.

Note: This model will act as judge for the evaluation. If you just want to follow along this tutorial and don't have a 'real' model you want to validate yet, repeat the steps above and deploy just another model - eg. GPT-35-TURBO
Click on "Deployment" in the side bar menu and select the model you just created. In the following screen, you get an overview about the deployed model. Most important for us at the moment is the "Endpoints" section. Note down both the endpoint URL as well as the endpoint key.

Next, let's create our model configurations in python. Create an object for the Azure AI project as follows, and use the credentials from step 3.

1azure_ai_project = {
2  "subscription_id": "<your subscription id>",
3  "resource_group_name": "<name of your resource group>",
4  "project_name": "<name of azure ai studio project>"
5}

Then create a model configuration object for the judge model as follows. We need the azure_endpoint, the api_key, api_version and azure_deployment.

To get these information, we can use the information from step 5.

The api_key is simply the endpoint key from the deployment screen.

For the other values, we can deconstruct the endpoint URL as follows: The endpoint URL is structured as follows: https://<deployment>.openai.azure.com/openai/deployments/<model_name>/<model_type>?api-version=<api_version> Example: https://openai-docusearch.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2023-03-15-preview

For the example above, the values would be:

azure_endpoint: https://openai-docusearch.openai.azure.com
api_version: 2023-03-15-preview
azure_deployment: gpt-4o

1from promptflow.core import AzureOpenAIModelConfiguration
2judge_config = AzureOpenAIModelConfiguration(
3    azure_endpoint="https://openai-docusearch.openai.azure.com",
4    api_key="<api-key>",
5    api_version="2023-03-15-preview",
6    azure_deployment="gpt-4o",
7)

Finally we need to define the model endpoints of the models we want to evaluate. For each model you want to evaluate, we need two information:

The model api endpoint
The api key

If the models are deployed in Azure AI Studio, simply use the endpoint and api key from the deployment screen. For other api providers, please refer to the respective documentation.

1gpt_4o = {
2    "api_endpoint": "https://openai-docusearch.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2023-03-15-preview",
3    "api_key": "<api-key>"
4}
5
6gpt_35_turbo = {
7    "api_endpoint": "https://openai-docusearch.openai.azure.com/openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-03-15-preview",
8    "api_key": "<api-key"
9}

Azure permissions for Prompt Flow

While it's not in the official documentation, Prompt Flow uploads your code, the sample data as well as the evaluation results to an Azure blob storage account which is automatically created when creating the project.

Also when provisioning the Azure AI Studio project, a service principal is created, with the same name as the project - blog in our example.

This service principal is then used to access the blob storage for file upload. By default it seems that it misses a critical permission/role, resulting in the following error:

1This request is not authorized to perform this operation using this permission.

If this error occurs, follow these steps:

Navigate to "Settings" in the Azure AI Studio project and select the "workspaceartifactstore" resource.

Azure AI Studio - workspaceartifactstore
In the next screen click on "View in Azure Portal". There, click on "Access control (IAM)", "Role Assignments" and then "Add".

Azure AI Studio - view in Azure Portal
In the "Role" - tab, select "Storage Blob Data Contributor", then click on the "Members" tab.

Azure AI Studio - role assignment
Select "Managed Identity", click on "Select members". In the now opening sidebar, select "Azure AI Project" in the managed identity dropdown. Then select the service principal with the same name as the project. Azure AI Studio - select members
Click "Select" and then "Review + assign".

That's it, your permissions should now be set.

Creating the evaluation dataset for Prompt Flow

Now that we have all the necessary information, we can create the evaluation dataset. As always with evaluating AI systems, the most important part in the whole process is exactly this, the evaluation dataset.

For Prompt Flow, we need to create a JSONL file with the following information:

question: The reference question for the model
context: Context for the model to answer the question upon
ground_truth: The correct answer to the question

There is not really a shortcut here. For a good evaluation, you need to invest the time to first create good question/answer pairs.

For this example, we'll use a simple evaluation dataset kindly provided by the Prompt Flow SDK

1{"question":"What is the capital of France?","context":"France is the country in Europe.","ground_truth":"Paris"}
2{"question": "Which tent is the most waterproof?", "context": "#TrailMaster X4 Tent, price $250,## BrandOutdoorLiving## CategoryTents## Features- Polyester material for durability- Spacious interior to accommodate multiple people- Easy setup with included instructions- Water-resistant construction to withstand light rain- Mesh panels for ventilation and insect protection- Rainfly included for added weather protection- Multiple doors for convenient entry and exit- Interior pockets for organizing small ite- Reflective guy lines for improved visibility at night- Freestanding design for easy setup and relocation- Carry bag included for convenient storage and transportatio## Technical Specs**Best Use**: Camping  **Capacity**: 4-person  **Season Rating**: 3-season  **Setup**: Freestanding  **Material**: Polyester  **Waterproof**: Yes  **Rainfly**: Included  **Rainfly Waterproof Rating**: 2000mm", "ground_truth": "The TrailMaster X4 tent has a rainfly waterproof rating of 2000mm"}
3{"question": "Which camping table is the lightest?", "context": "#BaseCamp Folding Table, price $60,## BrandCampBuddy## CategoryCamping Tables## FeaturesLightweight and durable aluminum constructionFoldable design with a compact size for easy storage and transport## Technical Specifications- **Weight**: 15 lbs- **Maximum Weight Capacity**: Up to a certain weight limit (specific weight limit not provided)", "ground_truth": "The BaseCamp Folding Table has a weight of 15 lbs"}
4{"question": "How much does TrailWalker Hiking Shoes cost? ", "context": "#TrailWalker Hiking Shoes, price $110## BrandTrekReady## CategoryHiking Footwear", "ground_truth": "The TrailWalker Hiking Shoes are priced at $110"}

Create a file named data.jsonl and paste the content above into it.

Creating the model router

Prompt Flow allows to define a target - a callable python class which basically takes all the evaluation questions and routes them to the respective model endpoints. This process is actually quite useful, as with that, you not only can evaluate different models, but also full RAG pipelines.

The target class call method needs to take a question and conetxt parameter and return a dictionary with question and answer keys.

Instead of creating a callable class, one can also create a python function with the question and context parameters and return a dictionary with question and answer keys. A class however might provide more flexibility

Add this class to a separate file named target.py:

1import requests
2from typing_extensions import Self
3from typing import TypedDict
4from promptflow.tracing import trace
5
6
7class ModelRouter:
8    def __init__(self: Self, model_type: str, model_config: dict):
9        self.model_config = model_config
10        self.model_type = model_type
11
12    class Response(TypedDict):
13        question: str
14        answer: str
15
16    @trace
17    def __call__(self: Self, question: str, context: str) -> Response:
18        if self.model_type == "gpt-4o":
19            output = self.call_gpt4o_endpoint(question, context)
20        elif self.model_type == "gpt-35-turbo":
21            output = self.call_gpt35_turbo_endpoint(question, context)
22        else:
23          raise ValueError("Model type not supported")
24        return output
25
26    def query(self: Self, endpoint: str, headers: str, payload: str) -> str:
27        response = requests.post(url=endpoint, headers=headers, json=payload)
28        return response.json()
29
30    def call_gpt4o_endpoint(self: Self, question: str, context: str) -> Response:
31        endpoint = self.model_config["api_endpoint"]
32        key = self.model_config["api_key"]
33
34        headers = {"Content-Type": "application/json", "api-key": key}
35
36        question = question + \
37        "\n Context: " + context if context else question
38
39        payload = {"messages": [{"role": "user", "content": question}], "max_tokens": 1000}
40
41        output = self.query(endpoint=endpoint, headers=headers, payload=payload)
42        answer = output["choices"][0]["message"]["content"]
43        return {"question": question, "answer": answer}
44
45    def call_gpt35_turbo_endpoint(self: Self, question: str, context: str) -> Response:
46        endpoint = self.model_config["api_endpoint"]
47        key = self.model_config["api_key"]
48
49        headers = {"Content-Type": "application/json", "api-key": key}
50
51        question = question + \
52        "\n Context: " + context if context else question
53
54        payload = {"messages": [{"role": "user", "content": question}], "max_tokens": 1000}
55
56        output = self.query(endpoint=endpoint, headers=headers, payload=payload)
57        answer = output["choices"][0]["message"]["content"]
58        return {"question": question, "answer": answer}

Some things to note here:

Please note the @trace decorator. Traces in Prompt Flow record specific events or the state of an application during execution. It can include data about function calls, variable values, system events and more. See here for more information about tracing.
Our call_gpt4o_endpoint and call_gpt35_turbo_endpoint methods are basically the same and could be implemented using just one function. I provided them here as an example for how one could implement a different API provider.
The implementation of the call-endpoint functions can be further extended, depending on your specific application. You can even add a full RAG retrieval pipeline here.

Running the evaluation

We are almost there. All that's left is to define the evaluations we'd like to run and then run them.

Define the evaluators. For a list of available evaluators see here. You can also create your own evaluators.

1from promptflow.evals.evaluators \
2import ContentSafetyEvaluator, RelevanceEvaluator, CoherenceEvaluator, GroundednessEvaluator, FluencyEvaluator, SimilarityEvaluator
3content_safety_evaluator = ContentSafetyEvaluator(project_scope=azure_ai_project)
4relevance_evaluator = RelevanceEvaluator(model_config=judge_config)
5coherence_evaluator = CoherenceEvaluator(model_config=judge_config)
6groundedness_evaluator = GroundednessEvaluator(model_config=judge_config)
7fluency_evaluator = FluencyEvaluator(model_config=judge_config)
8similarity_evaluator = SimilarityEvaluator(model_config=judge_config)

Next, create the evaluation configuration.

1import pathlib
2import random
3from promptflow.evals.evaluate import evaluate
4from target import ModelRouter
5
6path = str(pathlib.Path(pathlib.Path.cwd())) + "/data.jsonl"
7models = ["gpt-4o", "gpt-35-turbo"]
8for model in models:
9    randomNum = random.randint(111, 999)
10    results = evaluate(
11        azure_ai_project=azure_ai_project,
12        evaluation_name="Eval-Run-"+str(randomNum)+"-"+model.title(),
13        data=path,
14        # The model_config refers to the configuration objects we set up some steps before
15        target=ModelRouter(model, model_config = gpt_4o if model == "gpt-4o" else gpt_35_turbo),
16        evaluators={
17            "content_safety": content_safety_evaluator,
18            "coherence": coherence_evaluator,
19            "relevance": relevance_evaluator,
20            "groundedness": groundedness_evaluator,
21            "fluency": fluency_evaluator,
22            "similarity": similarity_evaluator,
23        },
24        evaluator_config={
25            "content_safety": {
26                "question": "${data.question}",
27                "answer": "${target.answer}"
28            },
29            "coherence": {
30                "answer": "${target.answer}",
31                "question": "${data.question}"
32            },
33            "relevance": {
34                "answer": "${target.answer}",
35                "context": "${data.context}",
36                "question": "${data.question}"
37            },
38            "groundedness": {
39                "answer": "${target.answer}",
40                "context": "${data.context}",
41                "question": "${data.question}"
42            },
43            "fluency": {
44                "answer": "${target.answer}",
45                "context": "${data.context}",
46                "question": "${data.question}"
47            },
48            "similarity": {
49                "answer": "${target.answer}",
50                "context": "${data.context}",
51                "question": "${data.question}"
52            }
53        }
54    )

To view the results, simply print them. Or - for better readability - transform them into a pandas dataframe.

1pd.DataFrame(results["rows"])

Analyzing Prompt Flow results in Azure AI Studio

As we provided an (optional) Azure AI Studio project configuration, we can use it, to view the results and some nice graphs around the Prompt Flow evaluations.

In Azure AI Studio, select "Tools -> Evaluations" from the left hand side menu. You can find all your model evaluation runs show up here if you’ve logged the results to your project in the SDK.

After selecting the Evaluations - menu, Click on "Switch to Dashboard" on the top-right corner - as this gives a better overview for comparing runs.
After that, you can select one or many 'runs' to compare. Each model we defined in our code above will show up as separate run. Let's select both, our GPT-4o and GPT-35-TURBO runs. As you can see in the screenshot below, this gives a very nice comparison of all the metrics we defined. In our example, GPT-4o seems to be the better model in almost any regards (which is not too surprising.)

Azure AI Studio - evaluation dashboard

If you scroll down, you get detailed comparisons of the individual answers of each run.

Azure AI Studio - evaluation details
Additionally to comparing two or more runs, you can open the individual runs and get detailed evaluation results for each individual run. You'll get nice charts of each metric as well as again the detailed questions and answer as well as grades per answer of each individual evaluation.

Azure AI Studio - individual run evaluations

So, that's it. Quite a few steps, but in the end this is quite a powerful setup for you to evaluate models before upgrading. Simply add a new model to the model endpoints, and you can evaluate it against your existing baseline model.

Using Prompt Flow without Azure AI Studio

If you want to use Prompt Flow without Azure AI Studio, you can do so.

Basically, the way to go is exactly as described above, just without having the lines of code which refer to Azure AI Studio. For brevity, find the code for evaluating models without Azure AI Studio below.

1from promptflow.core import AzureOpenAIModelConfiguration
2import requests
3import pathlib
4import random
5from typing_extensions import Self
6from typing import TypedDict
7from promptflow.tracing import trace
8from promptflow.evals.evaluate import evaluate
9from promptflow.evals.evaluators \
10import ContentSafetyEvaluator, RelevanceEvaluator, CoherenceEvaluator, GroundednessEvaluator, FluencyEvaluator, SimilarityEvaluator
11
12from target import ModelRouter
13
14import pandas as pd
15
16judge_config = AzureOpenAIModelConfiguration(
17    azure_endpoint="https://openai-docusearch.openai.azure.com",
18    api_key="<api-key>",
19    api_version="2024-02-01",
20    azure_deployment="gpt-4o",
21)
22
23
24gpt_4o = {
25    "api_endpoint": "https://openai-docusearch.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-02-01",
26    "api_key": "<api-key>"
27}
28
29gpt_35_turbo = {
30    "api_endpoint": "https://openai-docusearch.openai.azure.com/openai/deployments/gpt-35-turbo/chat/completions?api-version=2024-02-01",
31    "api_key": "<api-key>"
32}
33
34
35relevance_evaluator = RelevanceEvaluator(model_config=judge_config)
36coherence_evaluator = CoherenceEvaluator(model_config=judge_config)
37groundedness_evaluator = GroundednessEvaluator(model_config=judge_config)
38fluency_evaluator = FluencyEvaluator(model_config=judge_config)
39similarity_evaluator = SimilarityEvaluator(model_config=judge_config)
40
41
42path = str(pathlib.Path(pathlib.Path.cwd())) + "/data.jsonl"
43models = ["gpt-4o", "gpt-35-turbo"]
44try:
45  for model in models:
46      randomNum = random.randint(111, 999)
47      results = evaluate(
48          evaluation_name="Eval-Run-"+str(randomNum)+"-"+model.title(),
49          data=path,
50          # The model_config refers to the configuration objects we set up some steps before
51          target=ModelRouter(model, model_config = gpt_4o if model == "gpt-4o" else gpt_35_turbo),
52          evaluators={
53              # "content_safety": content_safety_evaluator,
54              "coherence": coherence_evaluator,
55              "relevance": relevance_evaluator,
56              "groundedness": groundedness_evaluator,
57              "fluency": fluency_evaluator,
58              "similarity": similarity_evaluator,
59          },
60          evaluator_config={
61              "content_safety": {
62                  "question": "${data.question}",
63                  "answer": "${target.answer}"
64              },
65              "coherence": {
66                  "answer": "${target.answer}",
67                  "question": "${data.question}"
68              },
69              "relevance": {
70                  "answer": "${target.answer}",
71                  "context": "${data.context}",
72                  "question": "${data.question}"
73              },
74              "groundedness": {
75                  "answer": "${target.answer}",
76                  "context": "${data.context}",
77                  "question": "${data.question}"
78              },
79              "fluency": {
80                  "answer": "${target.answer}",
81                  "context": "${data.context}",
82                  "question": "${data.question}"
83              },
84              "similarity": {
85                  "answer": "${target.answer}",
86                  "context": "${data.context}",
87                  "question": "${data.question}"
88              }
89          }
90      )
91except Exception as e:
92  print(e)
93
94print(results)
95
96pd.DataFrame(results["rows"])

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide