How to Set Up a Secure, Self-Hosted Large Language Model with vLLM & Caddy

blog preview

In the rapidly advancing field of artificial intelligence, Large Language Models (LLMs) stand as key components for a wide range of applications, from processing natural language to generating text that mimics human writing. With the increasing sophistication and utility of these models, the need for privacy, security, and the ability to customize becomes essential.

Trusted platforms like Microsoft Azure offer LLMs as a service - under their generally well-regarded security and privacy policies. However, for some applications, the need for a self-hosted LLM is paramount. Self-hosting your own LLM offers the advantage of complete control over your data and the flexibility to adapt the model to meet specific requirements.

The combination of vLLM and the Caddy web server emerges as a practical solution for setting up a secure, self-hosted LLM environment. vLLM is recognized for its incredible inference performance, enabling users to apply cutting-edge AI research within their own systems - all by offering high throughput at reasonable costs. Furthermore, vLLM is, as we'll see, very easy to set up. Running state of the art Open Source AI models is a simple command away.

However, there is one aspect that vLLM does not cover: HTTPS encryption. This is also one aspect many people simply ignore. They are happy to run their own models - but don't consider to encrypt the data channel. Which ironically makes their data less secure than when using off-the-shelf AI services. This is where Caddy comes into play. Caddy is a web server that is designed to be easy to use and secure by default. It handles automatic SSL certificate generation and renewal, and it is very easy to set up.

In this guide, we will walk you through the process of setting up a secure, self-hosted LLM environment using vLLM and Caddy. We will cover the installation of vLLM, the configuration of Caddy, and the integration of the two to enable HTTPS encryption. This guide assumes that you have a basic understanding of the command line and are comfortable with running commands in a terminal.

Prerequisites

Before we begin, you will need to have the following:

  • A server with a public IP address
  • Optional: For best performance, your server should be equipped with a GPU
  • A domain name. The domain name points to the public IP address of the server
  • The server has docker installed

General strategy for running your LLM with HTTPS

Running your own LLM in general requires three main components:

  1. The model itself
  2. An inference server that can handle requests to the model. While you can directly invoke the model, an inference server not only provides a web API interface but - more importantly - mostly uses optimizations to allow for parallel requests and faster response times.
  3. A web server that handles the HTTPS encryption and forwards requests to the inference server. While you can directly access the inference server, inference servers are not able to terminate HTTPS connections and therefore mostly don't allow for secure, encrypted connections.

In this guide, we will use vLLM as the model and inference server and Caddy as the web server. We will set up vLLM in a Docker container and Caddy as a reverse proxy to vLLM. This way, Caddy will handle the HTTPS encryption and forward requests to vLLM.

What is vLLM?

vLLM is a high-performance, open-source inference server for large language models. It is designed to be easy to use and to provide high throughput. Compared to other inference servers, vLLM is up to 24x faster then comparable solutions.

vLLM inference speed comparedvLLM inference speed compared

More specifically, vLLM offers the following advantages:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference Streaming outputs
  • Support NVIDIA GPUs and AMD GPUs
  • OpenAI-compatible API server

Especially the last point is quite intriguing as vLLM allows you to use the same API as OpenAI's GPT-3/4. This means that you can use the same client libraries and tools as you would use with OpenAI's GPT-3/4.

What is Caddy?

Caddy is a web server that is designed to be easy to use and secure by default. It handles automatic SSL certificate generation and renewal, and does not require any external dependiencies. Caddy is also known for its stability and performance.

Preparation

vLLM requires a GPU to run. As we are going to deploy via docker, we need to install a GPU ware docker runtime, like the NVIDIA container toolkit.

To install the NVIDIA container toolkit, you can use the following commands:

1curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
2 && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
3 sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
4 sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
5sudo apt-get update
6sudo apt-get install -y nvidia-container-toolkit
7sudo nvidia-ctk runtime configure --runtime=docker
8sudo systemctl restart docker

Step 1: Run vLLM in a Docker container

Note: Deploying vLLM on arm Mac devices is currently not supported, as vLLM focuses on high performance inference servers and therefore mostly professional grade GPUs. That being said, vLLM works great for consumer grade GPUs like an NVIDIA 3080 or 3090.

Deploying vLLM is almost too easy to require its own blog entry. Simply launch the vLLM docker container, define the model which should be served - done.

The following models are currently supported: vLLM model support

We are going to create a new Docker network to run the vLLM server in. This is so that we can run the Caddy server in the same network and have it communicate with the vLLM server.

1docker network create vllmnetwork
2docker run --runtime nvidia --gpus all \
3 -v ~/.cache/huggingface:/root/.cache/huggingface \
4 -p 8000:8000 \
5 --ipc=host \
6 --network=vllmnetwork \
7 --name vllm-server \
8 -d \
9 vllm/vllm-openai:latest \
10 --model TheBloke/Mistral-7B-OpenOrca-GPTQ

You can either use the ipc=host flag or --shm-size flag to allow the container to access the host’s shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.

This command will start the vLLM server and expose it on port 8000. It will automatically download and serve the model specified in the model parameter. You can now access the server by sending a POST request to http://localhost:8000. You can test the server by sending a simple request to it:

1curl http://localhost:8000/v1/completions \
2 -H "Content-Type: application/json" \
3 -d '{
4 "model": "TheBloke/Mistral-7B-OpenOrca-GPTQ",
5 "prompt": "San Francisco is a",
6 "max_tokens": 20,
7 "temperature": 0.1
8 }'

Important: As of time of this writing, vLLM only serves one model at a time. Therefore, the request needs to contain the same model as was used to start the server.

Step 2: Install Caddy

Now that we have a running inference server for our LLM, we need to make sure that we can connect to it securely. This is where Caddy comes into play. It will handle the HTTPS encryption and forward requests to the inference server.

More specifically we are going to run a Caddy web server using Docker and forward all requests coming to port 443 to port 8000, with the connection coming to port 443 secured using HTTPS certificates. This process involves creating a Caddyfile which contains the configuration for Caddy and running the Caddy server as docker container.

Create a Caddyfile

In our Caddyfile, we specify how Caddy should handle incoming requests. For our requirement, the Caddyfile will forward all requests from port 443 to our internal vLLM service running on port 8000. Caddy automatically handles HTTPS certificates for you, making the process seamless.

Here's the example Caddyfile configuration:

1
2https://yourdomain.com {
3 reverse_proxy vllm-server:8000
4 }

Replace yourdomain.com with your actual domain name. This configuration tells Caddy to listen for HTTPS connections for yourdomain.com and forward those requests to localhost:8000.

Run Caddy

Now that we have our Caddyfile, we can run the Caddy server as a Docker container. We will mount the Caddyfile into the container and expose the container on port 443. To make sure to persist the automatically created SSL certificates, we create two volumes caddy_data and caddy_config on our host machine which are mapped to the /data and /config directories of the caddy container.

1docker run -d \
2 -p 443:443 \
3 -v /path/to/Caddyfile:/etc/caddy/Caddyfile \
4 -v caddy_data:/data \
5 -v caddy_config:/config \
6 --network vllmnetwork \
7 caddy

Additional notes

  • Caddy will automatically obtain and renew SSL certificates for your domain using Let's Encrypt, provided your domain is correctly pointed to the server running Caddy.
  • Ensure your server's firewall and security groups (if any) allow traffic on port 443.

Conclusion

In this guide, we have walked you through the process of setting up a secure, self-hosted LLM environment using vLLM and Caddy. We have covered the installation of vLLM, the configuration of Caddy, and the integration of the two to enable HTTPS encryption. By following these steps, you can now run your own LLM with HTTPS encryption, providing a robust, private AI environment.

Further reading

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use
More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog