Saving costs with LLM Routing: The art of using the right model for the right task

With Large Language Models (LLMs) we mostly have two choices:

Using a powerful one like GPT-4o or Claude Sonnet 3.5, which produces all the good stuff. High quality, high intelligent (well, relatively speaking) outputs. But they cost quite a lot. And they are slower, producing fewer outputs per second - potentially limiting the scalability and usefulness of your application.
Then there are the smaller models, like GPT-4o mini or the remarkable Google Gemini 1.5 Flash. Models which are enormously efficient and cheap, but let's face it, often produce less impressive outputs when tasked with more complex tasks.

As most of us need output quality in general, we go for the more powerful models. Often reducing our margins or not starting in the first place as the calculated costs are just too high.

However, there is a better way: There is no written rule that says, that you need to use one model for all your tasks. We might simply have multiple models integrated into our processes and applications, and select the right one for the task at hand. For simple, more preparatory tasks, let's use the cheap and fast one. For the most value producing task or the task which involves the most complexity, let's use the powerful one.

There are multiple strategies we could use to make this work. One of the more brute-force ones is to use a quite strong LLM again to "ask" whether a task is complex and then choose the right model based on that answer. (I've seen this implemented quite a few times). The issue here is, that this is an inherently slow process and one which can get quite costly over time (or at least adds costs which are not necessary at all). Another approach is to train a simple, very efficient, locally runnable model which takes over the routing of tasks to the right model.

The advantage of the latter approach is very low latency, low costs during inference (as these specialized "routing models" can be very small) but quite extensive efforts for creating these.

However, thanks to the fine people at LMSYS, this task is already done for us. They created 4 different, generic routers, implementing LLM routing, depending on the task at hand.

The package in question is called RouteLLM.

What is RouteLLM and how does it work?

RouteLLM is a Python package and framework which allows to serve and also evaluate LLM routers - with 4 pre-trained, well-performing routers already included.

The LLM routers to be used - more specifically - are ones routing to a "strong" a "strong" model (like GPT-4o) or a "weak" model (like GPT-4o mini), based on the complexity of the task.

Furthermore, RouteLLM also claims, that the routers they provide as default have learned some characteristic of models that are strong as of 2024 and ones that are weak. This makes a lot of sense also from a theoretical perspective, as you can't simply rely on complexity of the task to route - but also the intricacies of the LLM itself.

RouteLLM therefore basically optimizes costs vs. quality of output. If we look at the chart below, we see that we would have the best possible output, if we always route to GPT-4. However, we'd also have the highest costs. If we use a good router however, we are able to get 90% the output quality at 10% of the costs (note the logarithmic scale of the x-axis). (The provided image is from the RouteLLM announcement blog).

Cost vs. quality of output when using RouteLLM

The overall principle of RouteLLM usage is rather simple to explain: The framework provides an OpenAI API compatible interface and sits before the model to be routed to. An easy case would be to have your application, which integrates with OpenAI connect to RouteLLM, which itself connects to the OpenAI API but uses GPT-4o or GPT-4o mini, depending on the prompt.

RouteLLM concept as part of an application

Potential cost savings when using RouteLLM

According to the benchmarks provided by LMSYS, RouteLLM can save up to cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K.

This is quite remarkable. One thing to note here: The quite different results in cost savings stem from the different nature of the used benchmarks. MT Bench is more general and less complex than GSM8K, therefore it's expected, to have much higher costs savings on rather less-complex benchmarks.

MT Bench: Is a general purpose benchmark, comprised of high-quality real-world question/answer pairs.
MMLU: MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This is a balanced benchmark, having "easier" questions like general knowledge and more complex ones involving few-shot techniques.
GSM8K: A benchmark using a dataset of 8.5K high quality linguistically diverse grade school math word problems - math is quite complicated for LLMs, therefore you'd expect the more often than not you'll use the strong model here.

All in all, the cost savings potential seems to be between 30 and 80%, depending on your use case!

How to use RouteLLM to significantly reduce LLM costs

Ok, enough talk again, let's find out how to bring this to live.

First, we need to install RouteLLM. We want to also install the server dependencies - more on that later.

1pip install "routellm[serve] pandarallel"

Then we can import the required modules and set up our api key. As in this first example we only use OpenAI models, we just need an OpenAI API key.

1import os
2from routellm.controller import Controller
3
4os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"

And finally we can create a client instance to use RouteLLM:

1client = Controller(
2  routers=["mf"], # What router type to use
3  strong_model="gpt-4o",
4  weak_model="gpt-4o-mini",
5)
6response = client.chat.completions.create(
7  # the 0.116 is the cost/quality threshold for the router. more on that
8  # in the section below.
9  model="router-mf-0.116",
10  messages=[
11    {"role": "user", "content": "How to maintain my machine?"}
12  ]
13)

That's actually all one needs to use RouteLLM as part of their application! RouteLLM will now take the user prompt, use the mf router to determine the complexity of the task and then route to the right model.

Now to clarify what I mean by "Whay router type to use": There are 4 different router types provided by RouteLLM:

mf: Uses a matrix factorization model trained on the preference data. This is the router which is the best for most use cases.
sw_ranking: Uses a weighted Elo calculation for routing, where each vote is weighted according to how similar it is to the user's prompt.
bert: Uses a BERT classifier trained on the preference data.
causal_llm: Uses an LLM-based classifier tuned on the preference data.

Long story short, from our experience (and what the authors of RouteLLM suggest), just use the mf router.

For an explanation about where we got the number 0.116 from, see the next section.

How to calibrate the RouteLLM model routing?

The different routers can be tuned for how many prompts based on a certain dataset they should send to the strong model and how many they should send to the weak one.

As per suggestion of the authors, RouteLLM should be calibrated using the freely available Chatbot Arena dataset. This dataset includes over 55,000 real-world user and LLM conversations and user preferences across over 70 state-of-the-art LLMs, such as GPT-4, Claude 2, Llama 2, Gemini, and Mistral models. Each sample represents a battle consisting of 2 LLMs which answer the same question, with a user label of either prefer model A, prefer model B, tie, or tie (both bad).

Now to run the calibration, execute:

1OPENAI_API_KEY=sk-XXXXX python -m routellm.calibrate_threshold --task calibrate --routers mf --strong-model-pct 0.5`

The output will be something along the line:

1For 50.0% strong model calls for mf, threshold = 0.11593

This value (0.11593) is the threshold for the router. It's the value you want to set in the client chat completion request as shown in the example above. (Like model="router-mf-0.11593").

Note: Setting the percentage to 50% during calibration does not mean that 50% of your actual queries are routed to the strong model - but that 50% of the prompts in the calibration dataset would be sent to the strong one. Depending on whether your average prompts are more or less complex than the calibration prompts, your actual percentage might differ.

So, the rule of thumb here is: Start by calibrating for 50%. If you want to favour the strong model more, set it higher, if you want to favour the weak model more, set it lower. And test your results!

How to use different providers (like Google Gemini) for LLM routing?

Let's say we want to route between GPT-4o and Google Gemini 1.5 Flash? Which is in general a very good idea, as GPT-4o is still the better "strong" model than anything Gemini offers, but Gemini Flash is by far the best "weak" model out there. With amazing context length, speed and costs.

All we need to do is, set the models in the client instantiation and provide API keys for both providers:

1os.environ["GEMINI_API_KEY"] = ""
2os.environ["OPENAI_API_KEY"] = ""
3
4client = Controller(
5  routers=["mf"],
6  strong_model="gpt-4o",
7  weak_model="gemini/gemini-1.5-flash",
8)

The rest stays the same.

1response = client.chat.completions.create(
2  # the 0.116 is the cost/quality threshold for the router. more on that
3  # in the section below.
4  model="router-mf-0.116",
5  messages=[
6    {"role": "user", "content": "How to maintain my machine?"}
7  ]
8)

RouteLLM uses LiteLLM under the hood. LiteLLM is a package to provide OpenAI compatible SDK/API endpoints for almost any modern LLM. If you want to use other models with RouteLLM, simply head over to the supported models in LiteLLM section of their docs and find the name of the API key to use and the model name.

You can even use local models with Ollama. Run Ollama, then set the Controller as follows:

1client = Controller(
2  routers=["mf"],
3  strong_model="gpt-4o",
4  weak_model="ollama/llama2",
5  api_base="http://localhost:11434" # set to your ollama server instance
6)

Using the RouteLLM server as drop-in-replacement for OpenAI

In the examples above, we used the RouteLLM python SDK to integrate it into our python application. However, RouteLLM can also run as a standalone OpenAI API compatible server. This is probably the way you want to go for production applications.

Run the following command to the start the server, using the mf router:

1python -m routellm.openai_server --routers mf --strong-model gpt-4o --weak-model gpt-4o-mini

To use the server, simply replace the model name and api url in your default OpenAI SDK, as follows:

1from openai import OpenAI # Use the standard OpenAI SDK
2
3client = openai.OpenAI(
4  base_url="https://localhost:6060/v1", # set to your routellm server instance
5  api_key="no_api_key"
6)
7
8response = client.chat.completions.create(
9  model="router-mf-0.11593", # Set this to the threshold you calibrated
10  messages=[
11    {"role": "user", "content": "How to maintain my machine?"}
12  ]
13)

That's quite the convenient way of potentially saving hundreds of dollars a month, by changing two lines of code in your existing application!

Conclusion

In this post we had a look at what RouteLLM is and how we can use it to save up to 80% on costs when using LLMs - by routing prompts to the right model at the right time.

We've seen how to calibrate the router and how to use different providers like OpenAI and Google Gemini. And finally, we've seen how to use the RouteLLM server as a drop-in-replacement for the OpenAI SDK.

We think RouteLLM is a great extension for any AI toolkit as - with very little effort - it provides almost free cost savings.

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide