Skip to content
vLLM
benchmark
Initializing search
GitHub
Home
User Guide
Developer Guide
API Reference
CLI Reference
Community
vLLM
GitHub
Home
User Guide
User Guide
Getting Started
Getting Started
Quickstart
Installation
Installation
GPU
CPU
Google TPU
Examples
Examples
Offline Inference
Offline Inference
Async LLM Streaming
Audio Language
Automatic Prefix Caching
Basic
Batch LLM Inference
Chat With Tools
Context Extension
Data Parallel
Disaggregated Prefill V1
Disaggregated Prefill
Encoder Decoder Multimodal
KV Load Failure Recovery Test
LLM Engine Example
Load Sharded State
LoRA With Quantization Inference
Metrics
Mistral-Small
MLPSpeculator
MultiLoRA Inference
Offline Inference with the OpenAI Batch file format
Pooling models
Prefix Caching
Prithvi Geospatial MAE
Prithvi Geospatial MAE Io Processor
vLLM TPU Profiling
Prompt Embed Inference
Qwen2.5-Omni Offline Inference Examples
Qwen 1M
Reproducibility
RLHF
RLHF Colocate
RLHF Utils
Save Sharded State
Simple Profiling
Skip Loading Weights In Engine Init
Spec Decode
Structured Outputs
Torchrun Dp Example
Torchrun Example
TPU
Vision Language
Vision Language Multi Image
Vision Language Pooling
Online Serving
Online Serving
API Client
Helm Charts
Monitoring Dashboards
Disaggregated Prefill
Disaggregated Serving
Gradio OpenAI Chatbot Webserver
Gradio Webserver
Kv Events Subscriber
Multi-Node-Serving
Multi Instance Data Parallel
OpenAI Chat Completion Client
OpenAI Chat Completion Client For Multimodal
OpenAI Chat Completion Client With Tools
OpenAI Chat Completion Client With Tools Required
OpenAI Chat Completion Client With Tools Xlam
OpenAI Chat Completion Client With Tools Xlam Streaming
OpenAI Chat Completion Tool Calls With Reasoning
OpenAI Chat Completion With Reasoning
OpenAI Chat Completion With Reasoning Streaming
OpenAI Completion Client
OpenAI Cross Encoder Score
OpenAI Cross Encoder Score For Multimodal
Long Text Embedding with Chunked Processing
OpenAI Transcription Client
OpenAI Translation Client
Setup OpenTelemetry POC
Pooling models
Prithvi Geospatial MAE
Prometheus and Grafana
Prompt Embed Inference With OpenAI Client
Ray Serve Deepseek
Retrieval Augmented Generation With Langchain
Retrieval Augmented Generation With Llamaindex
Run Cluster
Sagemaker-Entrypoint
Streamlit OpenAI Chatbot Webserver
Structured Outputs
Utils
Others
Others
LMCache Examples
Logging Configuration
Tensorize vLLM Model
General
General
vLLM V1
Frequently Asked Questions
Production Metrics
Reproducibility
Security
Troubleshooting
Usage Stats Collection
Inference and Serving
Inference and Serving
Offline Inference
OpenAI-Compatible Server
Data Parallel Deployment
Troubleshooting distributed deployments
Expert Parallel Deployment
Parallelism and Scaling
Integrations
Integrations
LangChain
LlamaIndex
Deployment
Deployment
Using Docker
Using Kubernetes
Using Nginx
Frameworks
Frameworks
Anyscale
AnythingLLM
AutoGen
BentoML
Cerebrium
Chatbox
Dify
dstack
Haystack
Helm
Hugging Face Inference Endpoints
LiteLLM
Lobe Chat
LWS
Modal
Open WebUI
Retrieval-Augmented Generation
SkyPilot
Streamlit
NVIDIA Triton
Integrations
Integrations
KServe
KubeAI
KubeRay
Llama Stack
llmaz
Production stack
Training
Training
Reinforcement Learning from Human Feedback
Transformers Reinforcement Learning
Configuration
Configuration
Conserving Memory
Engine Arguments
Environment Variables
Model Resolution
Optimization and Tuning
Server Arguments
TPU Optimization Tips
Models
Models
Supported Models
Generative Models
Pooling Models
Extensions
Extensions
Loading Model weights with fastsafetensors
Loading models with Run:ai Model Streamer
Loading models with CoreWeave's Tensorizer
Hardware Supported Models
Hardware Supported Models
TPU
Features
Features
Automatic Prefix Caching
Custom Arguments
Custom Logits Processors
Disaggregated Prefilling (experimental)
LoRA Adapters
Multimodal Inputs
NixlConnector Usage Guide
Prompt Embedding Inputs
Reasoning Outputs
Sleep Mode
Speculative Decoding
Structured Outputs
Tool Calling
Quantization
Quantization
AutoAWQ
AutoRound
BitBLAS
BitsAndBytes
FP8 W8A8
GGUF
GPTQModel
FP8 INC
INT4 W4A16
INT8 W8A8
NVIDIA TensorRT Model Optimizer
Quantized KV Cache
AMD Quark
TorchAO
Developer Guide
Developer Guide
General
General
Benchmark Suites
Deprecation Policy
Dockerfile
Incremental Compilation Workflow
Profiling vLLM
Vulnerability Management
Model Implementation
Model Implementation
Basic Model
Registering a Model
Unit Testing
Multi-Modal Support
Speech-to-Text (Transcription/Translation) Support
CI
CI
CI Failures
Update PyTorch version on vLLM OSS CI/CD
Design Documents
Design Documents
Architecture Overview
Fused MoE Modular Kernel
Integration with Hugging Face
Hybrid KV Cache Manager
IO Processor Plugins
Logits Processors
Metrics
Multi-Modal Data Processing
Fused MoE Kernel features
Python Multiprocessing
P2P NCCL Connector
Paged Attention
Plugin System
Automatic Prefix Caching
torch.compile integration
API Reference
API Reference
vllm
vllm
beam_search
collect_env
connections
env_override
envs
forward_context
logger
logits_process
logprobs
outputs
pooling_params
sampling_params
scalar_type
scripts
sequence
tasks
test_utils
tracing
version
assets
assets
audio
base
image
video
attention
attention
layer
selector
backends
backends
abstract
registry
utils
layers
layers
chunked_local_attention
cross_attention
encoder_only_attention
ops
ops
chunked_prefill_paged_decode
common
flashmla
merge_attn_states
paged_attn
pallas_kv_cache_update
prefix_prefill
rocm_aiter_mla
rocm_aiter_paged_attn
triton_decode_attention
triton_flash_attention
triton_merge_attn_states
triton_reshape_and_cache_flash
triton_unified_attention
utils
utils
fa_utils
kv_sharing_utils
benchmarks
benchmarks
datasets
latency
serve
throughput
lib
lib
endpoint_request_func
ready_checker
utils
compilation
compilation
activation_quant_fusion
backends
base_static_graph
collective_fusion
compiler_interface
counter
cuda_graph
decorators
fix_functionalization
fusion
fusion_attn
fx_utils
inductor_pass
monitor
noop_elimination
pass_manager
piecewise_backend
post_cleanup
sequence_parallelism
torch25_custom_graph_pass
vllm_inductor_pass
wrapper
config
config
cache
compilation
device
kv_events
kv_transfer
load
lora
model
multimodal
observability
parallel
pooler
scheduler
speculative
speech_to_text
structured_outputs
utils
vllm
device_allocator
device_allocator
cumem
distributed
distributed
communication_op
kv_events
parallel_state
tpu_distributed_utils
utils
device_communicators
device_communicators
all2all
all_reduce_utils
base_device_communicator
cpu_communicator
cuda_communicator
cuda_wrapper
custom_all_reduce
mnnvl_compat
pynccl
pynccl_allocator
pynccl_wrapper
quick_all_reduce
ray_communicator
shm_broadcast
shm_object_storage
symm_mem
tpu_communicator
xpu_communicator
eplb
eplb
eplb_state
rebalance_algo
rebalance_execute
kv_transfer
kv_transfer
kv_transfer_state
kv_connector
kv_connector
base
factory
utils
v1
v1
base
lmcache_connector
metrics
multi_connector
nixl_connector
offloading_connector
shared_storage_connector
p2p
p2p
p2p_nccl_connector
p2p_nccl_engine
tensor_memory_pool
kv_lookup_buffer
kv_lookup_buffer
base
mooncake_store
simple_buffer
kv_pipe
kv_pipe
base
mooncake_pipe
pynccl_pipe
engine
engine
arg_utils
async_llm_engine
llm_engine
metrics
metrics_types
protocol
entrypoints
entrypoints
api_server
chat_utils
constants
context
harmony_utils
launcher
llm
logger
renderer
score_utils
ssl
tool
tool_server
utils
cli
cli
collect_env
main
openai
run_batch
serve
types
benchmark
benchmark
base
latency
main
serve
throughput
openai
openai
api_server
cli_args
logits_processors
protocol
run_batch
serving_chat
serving_classification
serving_completion
serving_embedding
serving_engine
serving_models
serving_pooling
serving_responses
serving_score
serving_tokenization
serving_transcription
speech_to_text
tool_parsers
tool_parsers
abstract_tool_parser
deepseekv3_tool_parser
deepseekv31_tool_parser
glm4_moe_tool_parser
granite_20b_fc_tool_parser
granite_tool_parser
hermes_tool_parser
hunyuan_a13b_tool_parser
internlm2_tool_parser
jamba_tool_parser
kimi_k2_tool_parser
llama4_pythonic_tool_parser
llama_tool_parser
longcat_tool_parser
minimax_tool_parser
mistral_tool_parser
openai_tool_parser
phi4mini_tool_parser
pythonic_tool_parser
qwen3coder_tool_parser
qwen3xml_tool_parser
seed_oss_tool_parser
step3_tool_parser
utils
xlam_tool_parser
executor
executor
executor_base
msgspec_utils
ray_distributed_executor
ray_utils
uniproc_executor
inputs
inputs
data
parse
preprocess
logging_utils
logging_utils
dump_input
formatter
log_time
lora
lora
lora_weights
models
peft_helper
request
resolver
utils
worker_manager
layers
layers
base
base_linear
column_parallel_linear
logits_processor
qkv_x_parallel_linear
replicated_linear
row_parallel_linear
utils
vocal_parallel_embedding
ops
ops
ipex_ops
ipex_ops
lora_ops
torch_ops
torch_ops
lora_ops
triton_ops
triton_ops
kernel_utils
lora_expand_op
lora_kernel_metadata
lora_shrink_op
utils
xla_ops
xla_ops
lora_ops
punica_wrapper
punica_wrapper
punica_base
punica_cpu
punica_gpu
punica_selector
punica_tpu
punica_xpu
utils
model_executor
model_executor
custom_op
parameter
utils
layers
layers
activation
attention_layer_base
batch_invariant
layernorm
lightning_attn
linear
logits_processor
mla
pooler
resampler
utils
vocab_parallel_embedding
fla
fla
ops
ops
chunk
chunk_delta_h
chunk_o
chunk_scaled_dot_kkt
cumsum
fused_recurrent
index
l2norm
layernorm_guard
op
solve_tril
utils
wy_fast
fused_moe
fused_moe
batched_deep_gemm_moe
batched_triton_or_deep_gemm_moe
config
cpu_fused_moe
cutlass_moe
deep_gemm_moe
deep_gemm_utils
deepep_ht_prepare_finalize
deepep_ll_prepare_finalize
flashinfer_cutlass_moe
flashinfer_cutlass_prepare_finalize
flashinfer_trtllm_moe
fused_batched_moe
fused_marlin_moe
fused_moe
gpt_oss_triton_kernels_moe
layer
modular_kernel
moe_align_block_size
moe_pallas
moe_permute_unpermute
moe_torch_iterative
pplx_prepare_finalize
prepare_finalize
rocm_aiter_fused_moe
routing_simulator
topk_weight_and_reduce
triton_deep_gemm_moe
trtllm_moe
utils
mamba
mamba
abstract
linear_attn
mamba_mixer
mamba_mixer2
mamba_utils
short_conv
ops
ops
causal_conv1d
layernorm_gated
mamba_ssm
ssd_bmm
ssd_chunk_scan
ssd_chunk_state
ssd_combined
ssd_state_passing
quantization
quantization
auto_round
awq
awq_marlin
awq_triton
base_config
bitblas
bitsandbytes
deepspeedfp
experts_int8
fbgemm_fp8
fp8
gguf
gptq
gptq_bitblas
gptq_marlin
gptq_marlin_24
hqq_marlin
inc
input_quant_fp8
ipex_quant
kv_cache
modelopt
moe_wna16
mxfp4
petit
ptpc_fp8
rtn
schema
torchao
tpu_int8
compressed_tensors
compressed_tensors
compressed_tensors
compressed_tensors_moe
triton_scaled_mm
utils
schemes
schemes
compressed_tensors_24
compressed_tensors_scheme
compressed_tensors_w4a4_nvfp4
compressed_tensors_w4a8_fp8
compressed_tensors_w4a8_int
compressed_tensors_w4a16_24
compressed_tensors_w4a16_nvfp4
compressed_tensors_w8a8_fp8
compressed_tensors_w8a8_int8
compressed_tensors_w8a16_fp8
compressed_tensors_wNa16
transform
transform
linear
module
utils
schemes
schemes
linear_qutlass_nvfp4
kernels
kernels
mixed_precision
mixed_precision
allspark
bitblas
conch
cutlass
dynamic_4bit
exllama
MPLinearKernel
machete
marlin
scaled_mm
scaled_mm
aiter
cpu
cutlass
ScaledMMLinearKernel
triton
xla
quark
quark
quark
quark_moe
utils
schemes
schemes
quark_scheme
quark_w4a4_mxfp4
quark_w8a8_fp8
quark_w8a8_int8
utils
utils
allspark_utils
bitblas_utils
flashinfer_fp4_moe
flashinfer_utils
fp8_utils
gptq_utils
int8_utils
layer_utils
machete_utils
marlin_utils
marlin_utils_fp4
marlin_utils_fp8
marlin_utils_test
marlin_utils_test_24
mxfp4_utils
mxfp8_utils
nvfp4_emulation_utils
nvfp4_moe_support
petit_utils
quant_utils
w8a8_utils
rotary_embedding
rotary_embedding
base
common
deepseek_scaling_rope
dual_chunk_rope
dynamic_ntk_alpha_rope
dynamic_ntk_scaling_rope
ernie45_vl_rope
linear_scaling_rope
llama3_rope
llama4_vision_rope
mrope
ntk_scaling_rope
phi3_long_rope_scaled_rope
rocm_aiter_rope_ops
yarn_scaling_rope
shared_fused_moe
shared_fused_moe
shared_fused_moe
model_loader
model_loader
base_loader
bitsandbytes_loader
default_loader
dummy_loader
gguf_loader
online_quantization
runai_streamer_loader
sharded_state_loader
tensorizer
tensorizer_loader
tpu
utils
weight_utils
models
models
adapters
aimv2
apertus
arcee
arctic
aria
aya_vision
baichuan
bailing_moe
bamba
bert
bert_with_rope
blip
blip2
bloom
chameleon
chatglm
clip
cohere2_vision
commandr
config
dbrx
deepseek
deepseek_eagle
deepseek_mtp
deepseek_v2
deepseek_vl2
dots1
dots_ocr
ernie45
ernie45_moe
ernie45_vl
ernie45_vl_moe
ernie_mtp
exaone
exaone4
fairseq2_llama
falcon
falcon_h1
fuyu
gemma
gemma2
gemma3
gemma3_mm
gemma3n
gemma3n_mm
glm
glm4
glm4_1v
glm4_moe
glm4_moe_mtp
glm4v
gpt2
gpt_bigcode
gpt_j
gpt_neox
gpt_oss
granite
granite_speech
granitemoe
granitemoehybrid
granitemoeshared
gritlm
grok1
h2ovl
hunyuan_v1
hyperclovax_vision
idefics2_vision_model
idefics3
interfaces
interfaces_base
intern_vit
internlm2
internlm2_ve
interns1
interns1_vit
internvl
jais
jamba
jina_vl
keye
keye_vl1_5
kimi_vl
lfm2
llama
llama4
llama4_eagle
llama_eagle
llama_eagle3
llava
llava_next
llava_next_video
llava_onevision
longcat_flash
longcat_flash_mtp
mamba
mamba2
medusa
midashenglm
mimo
mimo_mtp
minicpm
minicpm3
minicpm_eagle
minicpmo
minicpmv
minimax_text_01
minimax_vl_01
mistral3
mixtral
mllama4
mlp_speculator
modernbert
module_mapping
molmo
moonvit
mpt
nano_nemotron_vl
nemotron
nemotron_h
nemotron_nas
nemotron_vl
nvlm_d
olmo
olmo2
olmoe
opt
orion
ovis
ovis2_5
paligemma
persimmon
phi
phi3
phi3v
phi4_multimodal
phi4mm
phi4mm_audio
phi4mm_utils
phimoe
pixtral
plamo2
qwen
qwen2
qwen2_5_omni_thinker
qwen2_5_vl
qwen2_audio
qwen2_moe
qwen2_rm
qwen2_vl
qwen3
qwen3_moe
qwen3_next
qwen3_next_mtp
qwen3_vl
qwen3_vl_moe
qwen_vl
radio
registry
roberta
rvl
seed_oss
siglip
siglip2navit
skyworkr1v
smolvlm
solar
stablelm
starcoder2
step3_text
step3_vl
swin
tarsier
telechat2
teleflm
terratorch
transformers
transformers_moe
transformers_pooling
ultravox
utils
vision
voxtral
whisper
zamba2
warmup
warmup
deep_gemm_warmup
kernel_warmup
multimodal
multimodal
audio
base
cache
evs
hasher
image
inputs
parse
processing
profiling
registry
utils
video
platforms
platforms
cpu
cuda
interface
rocm
tpu
xpu
plugins
plugins
io_processors
io_processors
interface
lora_resolvers
lora_resolvers
filesystem_resolver
profiler
profiler
layerwise_profile
utils
ray
ray
lazy_utils
ray_env
reasoning
reasoning
abs_reasoning_parsers
basic_parsers
deepseek_r1_reasoning_parser
glm4_moe_reasoning_parser
gptoss_reasoning_parser
granite_reasoning_parser
hunyuan_a13b_reasoning_parser
mistral_reasoning_parser
qwen3_reasoning_parser
seedoss_reasoning_parser
step3_reasoning_parser
transformers_utils
transformers_utils
config
config_parser_base
detokenizer_utils
dynamic_module
processor
runai_utils
s3_utils
tokenizer
tokenizer_base
utils
chat_templates
chat_templates
registry
configs
configs
arctic
chatglm
deepseek_v3
deepseek_vl2
dotsocr
eagle
falcon
jais
kimi_vl
medusa
midashenglm
mistral
mlp_speculator
moonvit
nemotron
nemotron_h
nemotron_vl
olmo3
ovis
qwen3_next
radio
step3_vl
ultravox
speculators
speculators
algos
base
processors
processors
deepseek_vl2
ovis
ovis2_5
tokenizers
tokenizers
mistral
triton_utils
triton_utils
importing
usage
usage
usage_lib
utils
utils
deep_gemm
flashinfer
gc_utils
jsontree
tensor_schema
v1
v1
cudagraph_dispatcher
kv_cache_interface
outputs
request
serial_utils
utils
attention
attention
backends
backends
cpu_attn
flash_attn
flashinfer
flex_attention
gdn_attn
linear_attn
mamba1_attn
mamba2_attn
mamba_attn
pallas
rocm_aiter_fa
rocm_attn
short_conv_attn
tree_attn
triton_attn
utils
xformers
mla
mla
common
cutlass_mla
flashattn_mla
flashinfer_mla
flashmla
flashmla_sparse
indexer
rocm_aiter_mla
triton_mla
core
core
block_pool
encoder_cache_manager
kv_cache_coordinator
kv_cache_manager
kv_cache_utils
single_type_kv_cache_manager
sched
sched
async_scheduler
interface
output
request_queue
scheduler
utils
engine
engine
async_llm
coordinator
core
core_client
detokenizer
exceptions
llm_engine
logprobs
output_processor
parallel_sampling
processor
utils
executor
executor
abstract
multiproc_executor
ray_distributed_executor
utils
kv_offload
kv_offload
abstract
backend
cpu
factory
lru_manager
mediums
spec
backends
backends
cpu
worker
worker
cpu_gpu
worker
metrics
metrics
loggers
prometheus
ray_wrappers
reader
stats
pool
pool
metadata
sample
sample
metadata
rejection_sampler
sampler
logits_processor
logits_processor
builtin
interface
state
ops
ops
bad_words
logprobs
penalties
topk_topp_sampler
tpu
tpu
metadata
sampler
spec_decode
spec_decode
eagle
medusa
metadata
metrics
ngram_proposer
utils
structured_output
structured_output
backend_guidance
backend_lm_format_enforcer
backend_outlines
backend_types
backend_xgrammar
request
utils
worker
worker
block_table
cpu_model_runner
cpu_worker
gpu_input_batch
gpu_model_runner
gpu_ubatch_wrapper
gpu_worker
kv_connector_model_runner_mixin
lora_model_runner_mixin
tpu_input_batch
tpu_model_runner
tpu_worker
ubatch_splitting
ubatch_utils
ubatching
utils
worker_base
xpu_model_runner
xpu_worker
CLI Reference
CLI Reference
vllm serve
vllm chat
vllm complete
vllm run-batch
vllm bench
vllm bench
vllm bench latency
vllm bench serve
vllm bench throughput
Community
Community
Contact Us
Meetups
Sponsors
Blog
Forum
Slack
vllm.entrypoints.cli.benchmark
ΒΆ
Modules:
Name
Description
base
latency
main
serve
throughput
Back to top