Overview
This document introduces how to deploy the Deepseek-R1 model using the SGLang framework on NVIDIA H100 GPUs and conduct performance testing on the inference service.
Summary: How SGLang Enhances DeepSeek-R1's Deployment
SGLang enhances performance and efficiency in DeepSeek-R1's deployment through advanced optimizations:
- 7x Throughput Improvement: RadixAttention enables KV cache reuse across requests, significantly boosting inference speed for DeepSeek-R1.
- 3.8x Memory Efficiency: FP8/INT4 mixed quantization and FlashInfer kernels optimize DeepSeek-R1's GPU memory usage, allowing for larger batch sizes.
- Dynamic Load Balancing: Cache-aware load balancing dynamically allocates requests, ensuring efficient multi-node scaling for DeepSeek-R1.
- Optimized Model Execution: Supports DeepSeek-R1's MoE (Mixture of Experts) architecture, ensuring high computational efficiency with expert routing.
- Seamless Distributed Deployment: Enables multi-GPU, multi-node expansion for DeepSeek-R1, making large-scale AI workloads viable.
- Improved Logical Reasoning: Through optimized reinforcement learning strategies, SGLang enhances DeepSeek-R1’s structured reasoning and output coherence.
By leveraging these optimizations, SGLang transforms DeepSeek-R1’s deployment into a high-performance, scalable, and cost-effective AI inference system.
SGLang Framework Technical Highlights:
- Full-stack Optimized Architecture
- Backend Operation Innovation: Utilizes RadixAttention for cross-request reuse of prefix KV caches, combined with zero-cost scheduling (overlapping CPU and GPU computations) and block pre-filling techniques, achieving a 7x throughput improvement compared to vLLM. Supports TP/DP/EP parallel strategies, integrating FlashInfer kernels with FP8/INT4 mixed quantization, leading to a 3.8x improvement in memory efficiency.
- Frontend Interaction Paradigm Upgrade: Uses a declarative DSL language to support multi-modal input chain programming, providing control flow, structured generation (xGrammar accelerated by 10x), and external API interaction capabilities, replacing the traditional LangChain development model.
- Hardware Ecosystem Compatibility: Covers NVIDIA/AMD GPU clusters, deeply adapts the DeepSeek series models' MLA mechanism. Data-parallel attention eliminates KV cache redundancy, supporting multi-node scaling and enterprise-level PB data processing.
- Performance Benchmark Comparison

DeepSeek-R1 Model Technical Breakthroughs:
- Architecture Design
- Built on the DeepSeek-V3 MoE architecture, a dynamic routing system is constructed that intelligently allocates simple tasks to fast paths and complex inferences to expert networks. This enables explicit thinking paths and logical loop verification.
- Training Paradigm Innovation
- Three-phase Reinforcement Learning:
- R1-Zero Phase: A purely rule-driven reward system drives the self-evolution of the reasoning chain, with zero-supervision data generation incorporating self-reflection (CoT).
- R1 Optimization Phase: Introduces high-quality cold-start data to build a feedback loop of "capability enhancement → data generation → model reinforcement," with mathematical reasoning accuracy surpassing GPT-4 from 90.2% to 93.7%.
- Distillation Phase: Outputs models ranging from 1.5B to 70B parameters, verifying that data quality determines the theoretical upper limit of model performance.
- Algorithm Engineering Breakthroughs
- GRPO Training Framework: Replaces independent Critic evaluation with group response sampling, directly optimizing the advantage function in the policy network, resulting in a 300% improvement in training efficiency. The dual reward mechanism (correctness of answers + chain of thought standardization) ensures that the generation results are interpretable.
- Industry Application Value
- Achieves expert-level performance in code generation (HumanEval 86.3%) and financial analysis, supporting structured output such as <analysis></answer>.
- The 7B distilled version enables deployment on consumer-grade GPUs, transitioning LLMs from simple generative tools to intelligent partners with logical loop validation capabilities.
Experimental Environment:
For hardware, two servers were prepared, each equipped with 8 Nvidia H100 GPUs, for a total of 16 GPUs. On the software side, the SGLang framework is used as the inference engine, and the runtime environment is Docker containers. The specific setup is as follows:
- Machine: 2 x 8-GPU H100 servers with IB devices
- OS: Ubuntu 22.04.5 LTS
- Docker: Community 28.0.0-rc.2
- SGLang: docker.io/lmsysorg/sglang:v0.4.3.post2-cu125-srt
Launch Containers:
The Deepseek-R1 model has 671B parameters, and a single Nvidia H100 GPU has 80GB of memory. Therefore, two machines need to collaborate to provide the necessary computational power for running a distributed inference service.
One machine is selected to take the MASTER role for the distributed service, while the others are designated as WORKER roles.
Start Master:
Use the following script to start the MASTER container. The parameters in the script define the role and other configurations for starting the container:

Bash
#!/bin/bash
MODEL_PATH=/data/huggingface
mkdir -p $MODEL_PATH
IPV4_IFNAME=$(ip route get 8.8.8.8 | awk '{print $5; exit}')
IPV4_ADDR=$(ip addr show "$interface" | grep 'inet ' | awk '{print
$2}' | cut -d/ -f1)
IB_IFNAME=$(ip -j a | jq -r '.[] | select(.link_type ==
"infiniband") | .ifname' | head -n1)
export NCCL_SOCKET_IFNAME=${IB_IFNAME}
export GLOO_SOCKET_IFNAME=${IPV4_IFNAME}
export
NCCL_IB_HCA="mlx5_0:1,ml5_1:1,mlx5_2:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,
mlx5_8:1,mlx5_9:1"
export NCCL_IB_GID_INDEX="0"
export NCCL_IB_DISABLE=0
export NCCL_DEBUG=INFO
export NNODES=2
export NODE_RANK=0
export TP=16
export IMAGE="docker.io/lmsysorg/sglang:v0.4.3.post2-cu125-srt"
export MASTER_ADDR=${IPV4_ADDR}:5000
docker run \
-d \
--gpus all \
--network host \
--shm-size 32g \
--name sglang \
--privileged \
--entrypoint python3 \
-v ${MODEL_PATH}:/root/.cache/huggingface \
--ipc=host \
-e NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME} \
-e GLOO_SOCKET_IFNAME=${GLOO_SOCKET_IFNAME} \
-e NCCL_IB_HCA=${NCCL_IB_HCA} \
-e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX} \
-e NCCL_IB_DISABLE=${NCCL_IB_DISABLE} \
-e NCCL_DEBUG=${NCCL_DEBUG} \
${IMAGE} \
-m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--tp ${TP} \
--dist-init-addr ${MASTER_ADDR} \
--nnodes ${NNODES} \
--node-rank ${NODE_RANK} \
--trust-remote-code \
--host 0.0.0.0 \
--port 40000
Start Worker:
When starting the WORKER, most parameters are the same as the MASTER, with the only special note being that the environment variable MASTER_IPV4_ADDR must be set to the MASTER node's IPV4_ADDR.
Bash
#!/bin/bash
MODEL_PATH=/data/huggingface
mkdir -p $MODEL_PATH
IPV4_IFNAME=$(ip route get 8.8.8.8 | awk '{print $5; exit}')
IB_IFNAME=$(ip -j a | jq -r &'.[] | select(.link_type ==
"infiniband") | .ifname' | head -n1)
export NCCL_SOCKET_IFNAME=${IB_IFNAME}
export GLOO_SOCKET_IFNAME=${IPV4_IFNAME}
export
NCCL_IB_HCA="mlx5_0:1,ml5_1:1,mlx5_2:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,
mlx5_8:1,mlx5_9:1"
export NCCL_IB_GID_INDEX="0"
export NCCL_IB_DISABLE=0
export NCCL_DEBUG=INFO
export NNODES=2
export NODE_RANK=1
export TP=16
export IMAGE="docker.io/lmsysorg/sglang:v0.4.3.post2-cu125-srt"
export MASTER_ADDR=${MASTER_IPV4_ADDR}:5000 # need edits
docker run \
-d \
--gpus all \
--network host \
--shm-size 32g \
--name sglang \
--privileged \
--entrypoint python3 \
-v ${MODEL_PATH}:/root/.cache/huggingface \
--ipc=host \
-e NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME} \
-e GLOO_SOCKET_IFNAME=${GLOO_SOCKET_IFNAME} \
-e NCCL_IB_HCA=${NCCL_IB_HCA} \
-e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX} \
-e NCCL_IB_DISABLE=${NCCL_IB_DISABLE} \
-e NCCL_DEBUG=${NCCL_DEBUG} \
${IMAGE} \
-m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--tp ${TP} \
--dist-init-addr ${MASTER_ADDR} \
--nnodes ${NNODES} \
--node-rank ${NODE_RANK} \
--trust-remote-code \
--host 0.0.0.0 \
--port 40000
Stress Testing:
The SGLang code repository includes a set of stress testing tools. Next, we will start a SGLang container and run the stress test within it.
Start Test Container:
When starting the container, to persistently store the test results, a results directory will be created in the working directory and mapped to the container. In addition, the environment variable MASTER_IPV4_ADDR needs to be set to the IPV4_ADDR of the MASTER node.
Bash
#!/bin/bash
mkdir -p ./results
export IMAGE="docker.io/lmsysorg/sglang:v0.4.3.post2-cu125-srt"
export MASTER_IPV4_ADDR=
docker run \
-it \
--rm \
--network host \
--name sglang-shell \
--entrypoint /bin/bash \
-v "$(pwd)/results:/opt/results" \
-e OUTPUT_DIR=/opt/results \
-e BASE_URL=${MASTER_IPV4_ADDR}:40000 \
${IMAGE}
Run Test:
In the stress testing script, the three main parameters are:
- BATCH_SIZES: Number of concurrent requests
- INPUT_LEN: Number of input tokens
- OUTPUT_LEN: Number of output tokens
The test script will combine these three parameters and evaluate the inference performance for each combination. The test results will be output to the results directory.
Bash
#!/bin/bash
# Define the parameter combination pool
BATCH_SIZES=(1 4 8 16 32)
INPUT_LENS=(128 256 512)
OUTPUT_LENS=(256 512 1024 2048 4096 8192)
# Iterate over all parameter combinations
for BATCH_SIZE in "${BATCH_SIZES[@]}"; do
for INPUT_LEN in "${INPUT_LENS[@]}"; do
for OUTPUT_LEN in "${OUTPUT_LENS[@]}"; do
echo "Testing batch=$BATCH_SIZE input=$INPUT_LEN output=$OUTPUT_LEN"
RESULT_FILE="${OUTPUT_DIR}/${BATCH_SIZE}-${INPUT_LEN}-${OUTPUT_LEN}.jsonl"
# Check if the result file exists [5,6](@ref)
if [[ -f "$RESULT_FILE" ]]; then
echo "Skip current iteration: batch=${BATCH_SIZE} input=${INPUT_LEN} output=${OUTPUT_LEN}"
continue # Skip current iteration [2](@ref)
fi
# Execute a single test
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url "${BASE_URL}" \
--batch-size "${BATCH_SIZE}" \
--input-len ${INPUT_LEN} \
--output-len ${OUTPUT_LEN} \
--result-filename ${RESULT_FILE}
done
done
done
Test Data:
After completing the stress tests, the following format will be used for the test results:
- Latency: Time taken to execute the test
- Output Throughput: batch_size * output_len / latency
- Overall Throughput: batch_size * (input_len + output_len) / latency
JSON
{
"run_name": "default",
"batch_size": 1,
"input_len": 128,
"output_len": 256,
"latency": 8.6279,
"output_throughput": 29.67,
"overall_throughput": 44.51
}
After completing all the test cases, a total of 90 data files will be generated, and the data will then be analyzed.
Test Results:
Throughput vs Batch Size

This chart clearly shows the impact of batch size on the throughput of language model inference. We can interpret the data from the following technical dimensions:
Core Rule Analysis
a. Batch Size and Throughput Show a Strong Positive Correlation
When the batch size increases from 1 to 32, throughput jumps from about 50 tokens/s to nearly 600 tokens/s, an increase of nearly 12 times. This indicates that in the test environment, GPU resource utilization improves significantly with larger batch sizes, which aligns with the parallel acceleration expectations of Amdahl's Law.
b. Diminishing Returns
During the increase in batch size from 8 to 16 to 32, the throughput growth rate gradually flattens. For example:
- From batch size 1 to 8, throughput doubles with each batch size increment.
- From batch size 16 to 32, throughput increases by only about 25% per batch size increment. This suggests that hardware resources (such as memory bandwidth and SM compute units) gradually become bottlenecks.
Boxplot Technical Interpretation
Each boxplot contains the following statistical information:
- Median Line: Represents the typical throughput level (e.g., for batch size 32, the median is about 580 tokens/s).
- Box Range (25%-75% Percentile): Reflects throughput stability (larger batch sizes lead to smaller box heights, indicating more concentrated results).
- Whiskers (Upper and Lower Extending Lines): Show extreme fluctuations in performance (for batch size 1, the whiskers are longer, indicating more performance variation with small batch sizes).
Engineering Practice Guidelines
Batch Size Strategy
- Offline Processing: Prioritize using BatchSize=32 to maximize hardware utilization.
- Real-time Inference: You can choose the same strategy as offline processing or consider using BatchSize=16 to balance throughput and latency.
Throughput vs Parameters Correlation

Conclusions from the Chart
Core Conclusions
- Batch Size and throughput show a very strong positive correlation (correlation coefficient of 0.993).
- Batch size is the dominant factor influencing throughput; increasing batch_size can linearly improve throughput.
- Input Length/Output Length has no significant correlation with throughput.
- Changes in input and output lengths have a negligible impact on throughput (correlation coefficient < 0.01).
- This indicates that the framework is well-optimized for handling variable-length sequences.
Key Insights
- Computation-intensive Features: Throughput is primarily determined by the parallel computation efficiency, rather than memory bandwidth limitations related to sequence length.
- Hardware Resource Utilization Optimization:
- Increase batch_size to improve GPU stream multiprocessor (SM) utilization.
- There is no need to restrict business logic design due to input/output length adjustments (such as in long text generation scenarios).