DeepSeek R1 On-Prem Setup: Run Advanced AI Models on Your Hardware with SGLang

The landscape of local LLM deployment has shifted in recent months. What was once the exclusive domain of enterprise data centers with expensive hardware setups can now run efficiently on consumer-grade GPUs. This change comes thanks to two key developments: the release of high-quality distilled models, great and smaller LLMs in general and advancements in how models are served, such as the SGLang framework.

Today, we'll explore how to run DeepSeek R1, one of the most capable language models currently available, on affordable hardware. We'll be using DeepSeek R1 Distill, a compressed version that maintains impressive performance while significantly reducing hardware requirements, along with SGLang, currently the most advanced model serving framework available.

This combination allows us to achieve something nice: running a model that competes with GPT-4 on mathematical reasoning, coding, and general knowledge tasks, right on your local machine. For context, DeepSeek R1 Distill Qwen-32B outperforms GPT-4 on several benchmarks, including achieving a Codeforces rating of 1691 (compared to GPT-4's 759), while also the 14B version delivers exceptional performance for its size.

In this guide, we'll walk through:

Understanding what distilled models are and why we like them so much
Setting up SGLang for optimal model serving Installing and running
Specific instructions for running DeepSeek R1 Distill on your hardware

While this guide is focused on DeepSeek R1, the principles and techniques we'll cover can be applied to other models as well. Let's get started!

Understanding Distilled Models: Making AI More Accessible

Before we dive into the technical setup, let's understand what makes this deployment possible: model distillation. This concept is crucial for anyone looking to run powerful LLMs locally.

What Are Distilled Models?

Model distillation is a technique where a larger, more complex model (the "teacher") transfers its knowledge to a smaller, more efficient model (the "student"). Think of it as creating a concentrated essence of the original model's capabilities. The DeepSeek team has used this process to create several smaller versions of their powerful R1 model, ranging from 1.5B to 70B parameters.

To learn how these distilled models (or any LLM for that matter) fit into broader AI workflows, check out our post on Integrating Knowledge and LLMs.

The DeepSeek R1 Distill Family

DeepSeek offers several distilled versions of R1:

DeepSeek-R1-Distill-Qwen-1.5B
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1-Distill-Llama-70B

For this guide, we'll focus on the Qwen-32B model, which represents the sweet spot between the original R1's capabilities and practical hardware requirements. This model delivers exceptional performance:

Achieves 72.6% accuracy on AIME 2024 (compared to GPT-4's 9.3%)
Scores a Codeforces rating of 1691 (GPT-4: 759)
Maintains 94.3% accuracy on MATH-500 benchmark
Outperforms OpenAI's o1-mini across various tasks

Why Distilled Models Matter

The significance of these distilled models cannot be overstated. Previously, running state-of-the-art LLMs required either:

Enterprise-grade hardware with massive VRAM (often 80GB+ per GPU)
Complex setups with model parallelism across multiple GPUs
Significant compromises in model quality

With the 32B distilled model, you can now run near-original R1 performance on consumer-grade hardware (or at least hardware with consumer-grade pricing). This democratizes access to advanced AI capabilities, making them available to developers, researchers, and enthusiasts working on standard hardware.

In the next section, we'll look at SGLang, the serving framework that helps us squeeze maximum performance from these models, making the 32B version run efficiently even on consumer hardware.

SGLang: Modern Model Serving for DeepSeek R1

SGLang has emerged as one of the most efficient serving frameworks for large language models, making it our tool of choice for running DeepSeek R1 Distill locally. While alternatives like vLLM exist, SGLang's optimizations make it well-suited for running models like the 32B R1 Distill on consumer hardware.

Key Features for Local Deployment

Several SGLang features are relevant for our setup:

Tensor Parallelism support for efficient multi-GPU utilization
RadixAttention for optimized prefix caching
Continuous batching for improved throughput
FlashInfer kernels for faster inference
BF16 and FP8 support for reduced memory footprint
Torch compile integration

Hardware Requirements

For running the 32B model, you'll need:

2x Nvidia GPUs with 2x24GB VRAM (NVIDIA RTX 3090 or better)
32GB system RAM
NVMe SSD with at least 100GB free space

(For additional tips on keeping your setup budget-friendly, see How to Save on LLM Costs).

Basic Installation

SGLang requires CUDA toolkit 12.8 or higher. If you haven't installed it, please follow these steps: Nvidia Cuda Installation

1pip install sgl-kernel --force-reinstall --no-deps
2pip install "sglang[all]>=0.4.3.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

Model Setup

First, let's download the model:

1from huggingface_hub import snapshot_download
2
3snapshot_download(
4    repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
5    local_dir="./models/deepseek-r1-32b"
6)

Then launch the server:

1python3 -m sglang.launch_server \
2    --model-path ./models/deepseek-r1-32b \
3    --trust-remote-code

Running SGLang with Docker

If you prefer Docker, you can use the following command:

1docker run --gpus all \
2    --shm-size 32g \
3    -p 30000:30000 \
4    -v ~/.cache/huggingface:/root/.cache/huggingface \
5    --env "HF_TOKEN=<secret>" \
6    --ipc=host \
7    lmsysorg/sglang:latest \
8    python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --host 0.0.0.0 --port 30000

In the next section, we'll cover the client setup and basic inference patterns.

Running Your First Inferences with DeepSeek R1 and SGLang

Basic Inference Using OpenAI Client

The nice thing about SGLang is that it's compatible with the OpenAI API,so you can use the OpenAI Python client to interact with your local server-making a transition from cloud to local deployment very easy.

1import openai
2
3client = openai.Client(
4    base_url="http://localhost:30000/v1",  # Default SGLang port
5    api_key="none"
6)
7
8response = client.chat.completions.create(
9    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
10    messages=[
11        {"role": "user", "content": "Explain how neural networks work"}
12    ],
13    temperature=0.7,
14    max_tokens=512
15)
16
17print(response.choices[0].message.content)

Structured Output

For software related problems or any task requiring structured output, we can use SGLang's JSON schema support:

1response = client.chat.completions.create(
2    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
3    messages=[
4        {"role": "user", "content": "Solve this calculus problem: Find the derivative of x³ + 2x² - 5x + 3"}
5    ],
6    temperature=0.1,
7    max_tokens=512,
8    response_format={
9        "type": "json_schema",
10        "json_schema": {
11            "type": "object",
12            "properties": {
13                "steps": {"type": "array", "items": {"type": "string"}},
14                "final_answer": {"type": "string"}
15            },
16            "required": ["steps", "final_answer"]
17        }
18    }
19)

Direct API Requests

You can also use Python's requests library for more direct control:

1import requests
2
3response = requests.post(
4    "http://localhost:30000/v1/chat/completions",
5    json={
6        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
7        "messages": [
8            {"role": "user", "content": "Explain quantum computing"}
9        ],
10        "temperature": 0.7,
11        "max_tokens": 512
12    }
13)
14
15print(response.json()['choices'][0]['message']['content'])

In the next section, we'll explore advanced features like streaming responses and batch processing.

Streaming Responses

To stream your responses, which is often useful for GUI applications as users get feedback earlier and the LLM answers therefore feel snappier:

1import openai
2
3client = openai.Client(base_url=f"http://127.0.0.1:30000/v1", api_key="None")
4
5# Use stream=True for streaming responses
6response = client.chat.completions.create(
7    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
8    messages=[
9        {"role": "user", "content": "List 3 countries and their capitals."},
10    ],
11    temperature=0,
12    max_tokens=640,
13    stream=True,
14)
15
16# Handle the streaming output
17for chunk in response:
18    if chunk.choices[0].delta.content:
19        print(chunk.choices[0].delta.content, end="", flush=True)

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide