Modules:
| Name | Description | 
|---|---|
| adapters |  | 
| aimv2 |  | 
| apertus | Inference-only Apertus model compatible with HuggingFace weights. | 
| arcee |  | 
| arctic | Inference-only Snowflake Arctic model. | 
| aria |  | 
| aya_vision |  | 
| baichuan | Inference-only BaiChuan model compatible with HuggingFace weights. | 
| bailing_moe | Inference-only BailingMoE model compatible with HuggingFace weights. | 
| bamba | Inference-only Bamba model. | 
| bert |  | 
| bert_with_rope |  | 
| blip | Minimal implementation of BlipVisionModel intended to be only used | 
| blip2 |  | 
| bloom | Inference-only BLOOM model compatible with HuggingFace weights. | 
| chameleon |  | 
| chatglm | Inference-only ChatGLM model compatible with THUDM weights. | 
| clip | Minimal implementation of CLIPVisionModel intended to be only used | 
| cohere2_vision | Command-A-Vision (Cohere2Vision) multimodal model implementation for vLLM. | 
| commandr | PyTorch Cohere model. | 
| config |  | 
| dbrx |  | 
| deepseek | Inference-only Deepseek model. | 
| deepseek_eagle |  | 
| deepseek_mtp |  | 
| deepseek_v2 | Inference-only DeepseekV2/DeepseekV3 model. | 
| deepseek_vl2 | Inference-only Deepseek-VL2 model compatible with HuggingFace weights. | 
| dots1 | Inference-only dots1 model. | 
| dots_ocr |  | 
| ernie45 | Inference-only Erine model compatible with HuggingFace weights. | 
| ernie45_moe | Inference-only ErineMoE model compatible with HuggingFace weights. | 
| ernie45_vl | Inference-only Erine VL model compatible with HuggingFace weights. | 
| ernie45_vl_moe | Inference-only Erine VL model compatible with HuggingFace weights. | 
| ernie_mtp | Inference-only Ernie-MTP model. | 
| exaone | Inference-only Exaone model compatible with HuggingFace weights. | 
| exaone4 | Inference-only Exaone model compatible with HuggingFace weights. | 
| fairseq2_llama | Llama model for fairseq2 weights. | 
| falcon | PyTorch Falcon model. | 
| falcon_h1 | Inference-only FalconH1 model. | 
| fuyu | PyTorch Fuyu model. | 
| gemma | Inference-only Gemma model compatible with HuggingFace weights. | 
| gemma2 |  | 
| gemma3 |  | 
| gemma3_mm |  | 
| gemma3n |  | 
| gemma3n_mm |  | 
| glm | Inference-only HF format GLM-4 model compatible with THUDM weights. | 
| glm4 | Inference-only GLM-4-0414 model compatible with HuggingFace weights. | 
| glm4_1v | Inference-only GLM-4V model compatible with HuggingFace weights. | 
| glm4_moe | Inference-only GLM-4.5, GLM-4.6 model compatible with HuggingFace weights. | 
| glm4_moe_mtp | Inference-only GLM-4.5 MTP model compatible with HuggingFace weights. | 
| glm4v | Inference-only CogAgent model compatible with THUDM weights. | 
| gpt2 | Inference-only GPT-2 model compatible with HuggingFace weights. | 
| gpt_bigcode | Inference-only GPTBigCode model compatible with HuggingFace weights. | 
| gpt_j | Inference-only GPT-J model compatible with HuggingFace weights. | 
| gpt_neox | Inference-only GPT-NeoX model compatible with HuggingFace weights. | 
| gpt_oss |  | 
| granite | Inference-only IBM Granite model compatible with HuggingFace weights. | 
| granite_speech | Inference-only IBM Granite speech model. | 
| granitemoe | Inference-only GraniteMoe model. | 
| granitemoehybrid | Inference-only GraniteMoeHybrid model. | 
| granitemoeshared | Inference-only GraniteMoeShared model. | 
| gritlm |  | 
| grok1 | Inference-only Grok1 model. | 
| h2ovl |  | 
| hunyuan_v1 | Inference-only HunYuan model compatible with HuggingFace weights. | 
| hyperclovax_vision |  | 
| idefics2_vision_model | PyTorch Idefics2 model. | 
| idefics3 | Inference-only Idefics3 model compatible with HuggingFace weights. | 
| interfaces |  | 
| interfaces_base |  | 
| intern_vit |  | 
| internlm2 |  | 
| internlm2_ve |  | 
| interns1 |  | 
| interns1_vit |  | 
| internvl |  | 
| jais | Inference-only Jais model compatible with HuggingFace weights. | 
| jamba | Inference-only Jamba model. | 
| jina_vl |  | 
| keye |  | 
| keye_vl1_5 |  | 
| kimi_vl |  | 
| lfm2 |  | 
| llama | Inference-only LLaMA model compatible with HuggingFace weights. | 
| llama4 | Inference-only LLaMA model compatible with HuggingFace weights. | 
| llama4_eagle |  | 
| llama_eagle |  | 
| llama_eagle3 |  | 
| llava |  | 
| llava_next |  | 
| llava_next_video |  | 
| llava_onevision |  | 
| longcat_flash | Inference-only Flash model compatible with HuggingFace weights. | 
| longcat_flash_mtp |  | 
| mamba | PyTorch MAMBA model. | 
| mamba2 | PyTorch MAMBA2 model. | 
| medusa |  | 
| midashenglm | Inference-only MiDashengLM model compatible with HuggingFace weights. | 
| mimo | Inference-only MiMo model compatible with HuggingFace weights. | 
| mimo_mtp | Inference-only MiMo-MTP model. | 
| minicpm | Inference-only MiniCPM model compatible with HuggingFace weights. | 
| minicpm3 | Inference-only MiniCPM3 model compatible with HuggingFace weights. | 
| minicpm_eagle | Inference-only EagleMiniCPM model compatible with HuggingFace weights. | 
| minicpmo | Inference-only MiniCPM-O model compatible with HuggingFace weights. | 
| minicpmv | Inference-only MiniCPM-V model compatible with HuggingFace weights. | 
| minimax_text_01 | Inference-only MiniMaxText01 model. | 
| minimax_vl_01 |  | 
| mistral3 |  | 
| mixtral | Inference-only Mixtral model. | 
| mllama4 |  | 
| mlp_speculator |  | 
| modernbert |  | 
| module_mapping |  | 
| molmo |  | 
| moonvit |  | 
| mpt |  | 
| nano_nemotron_vl |  | 
| nemotron | Inference-only Nemotron model compatible with HuggingFace weights. | 
| nemotron_h | Inference-only NemotronH model. | 
| nemotron_nas | Inference-only deci model compatible with HuggingFace weights. | 
| nemotron_vl |  | 
| nvlm_d |  | 
| olmo | Inference-only OLMo model compatible with HuggingFace weights. | 
| olmo2 | Inference-only OLMo2 model compatible with HuggingFace weights. | 
| olmoe | Inference-only OLMoE model compatible with HuggingFace weights. | 
| opt | Inference-only OPT model compatible with HuggingFace weights. | 
| orion | Inference-only Orion-14B model compatible with HuggingFace weights. | 
| ovis | PyTorch Ovis model. | 
| ovis2_5 | PyTorch Ovis model. | 
| paligemma |  | 
| persimmon | Inference-only persimmon model compatible with HuggingFace weights. | 
| phi | Inference-only Phi-1.5 model compatible with HuggingFace weights. | 
| phi3 | Inference-only Phi3 model code inherit from Llama.py | 
| phi3v |  | 
| phi4_multimodal |  | 
| phi4mm |  | 
| phi4mm_audio |  | 
| phi4mm_utils |  | 
| phimoe | Inference-only PhiMoE model. | 
| pixtral |  | 
| plamo2 | Inference-only PLaMo2 model. | 
| qwen | Inference-only QWen model compatible with HuggingFace weights. | 
| qwen2 | Inference-only Qwen2 model compatible with HuggingFace weights. | 
| qwen2_5_omni_thinker | Inference-only Qwen2.5-Omni model (thinker part). | 
| qwen2_5_vl | Inference-only Qwen2.5-VL model compatible with HuggingFace weights. | 
| qwen2_audio | Inference-only Qwen2-Audio model compatible with HuggingFace weights. | 
| qwen2_moe | Inference-only Qwen2MoE model compatible with HuggingFace weights. | 
| qwen2_rm | Inference-only Qwen2-RM model compatible with HuggingFace weights. | 
| qwen2_vl | Inference-only Qwen2-VL model compatible with HuggingFace weights. | 
| qwen3 | Inference-only Qwen3 model compatible with HuggingFace weights. | 
| qwen3_moe | Inference-only Qwen3MoE model compatible with HuggingFace weights. | 
| qwen3_next | Inference-only Qwen3Next model. | 
| qwen3_next_mtp | Inference-only Qwen3Next MTP model. | 
| qwen3_vl | Inference-only Qwen3VL model compatible with HuggingFace weights. | 
| qwen3_vl_moe | Inference-only Qwen3-VL-MoE model compatible with HuggingFace weights. | 
| qwen_vl | Inference-only Qwen-VL model compatible with HuggingFace weights. | 
| radio |  | 
| registry | Whenever you add an architecture to this page, please also update | 
| roberta |  | 
| rvl |  | 
| seed_oss | Inference-only SeedOss model compatible with HuggingFace weights. | 
| siglip | Implementation of SiglipVisionModel intended to be only used | 
| siglip2navit | Implementation of SiglipVisionModel intended to be only used | 
| skyworkr1v |  | 
| smolvlm |  | 
| solar | Inference-only Solar model compatible with HuggingFace weights. | 
| stablelm | Inference-only StabeLM (https://github.com/Stability-AI/StableLM) | 
| starcoder2 | PyTorch Starcoder2 model. | 
| step3_text | Inference-only Jurassic model. | 
| step3_vl |  | 
| swin |  | 
| tarsier |  | 
| telechat2 |  | 
| teleflm |  | 
| terratorch | Wrapper around  | 
| transformers | Wrapper around  | 
| transformers_moe | Wrapper around  | 
| transformers_pooling | Wrapper around  | 
| ultravox | PyTorch Ultravox model. | 
| utils |  | 
| vision |  | 
| voxtral |  | 
| whisper |  | 
| zamba2 | PyTorch Zamba2 model implementation for vLLM. | 
 module-attribute  ¶
 ModelRegistry = _ModelRegistry(
    {
        model_arch: (
            _LazyRegisteredModel(
                module_name=f"vllm.model_executor.models.{mod_relname}",
                class_name=cls_name,
            )
        )
        for (model_arch, (mod_relname, cls_name)) in (
            items()
        )
    }
)
 module-attribute  ¶
 __all__ = [
    "ModelRegistry",
    "VllmModelForPooling",
    "is_pooling_model",
    "VllmModelForTextGeneration",
    "is_text_generation_model",
    "HasInnerState",
    "has_inner_state",
    "SupportsLoRA",
    "supports_lora",
    "SupportsMultiModal",
    "supports_multimodal",
    "SupportsMRoPE",
    "supports_mrope",
    "SupportsPP",
    "supports_pp",
    "SupportsTranscription",
    "supports_transcription",
    "SupportsV0Only",
    "supports_v0_only",
]
 
  Bases: Protocol
The interface required for all models that has inner state.
Source code in vllm/model_executor/models/interfaces.py
  
  Bases: Protocol
The interface required for all models that support LoRA.
Source code in vllm/model_executor/models/interfaces.py
  
  Bases: Protocol
The interface required for all models that support M-RoPE.
Source code in vllm/model_executor/models/interfaces.py
  class-attribute  ¶
 supports_mrope: Literal[True] = True
A flag that indicates this model supports M-RoPE.
Note
There is no need to redefine this flag if this class is in the MRO of your model class.
 
 get_mrope_input_positions(
    input_tokens: list[int],
    hf_config: PretrainedConfig,
    image_grid_thw: Optional[
        Union[list[list[int]], Tensor]
    ],
    video_grid_thw: Optional[
        Union[list[list[int]], Tensor]
    ],
    second_per_grid_ts: Optional[list[float]] = None,
    context_len: int = 0,
    seq_len: Optional[int] = None,
    audio_feature_lengths: Optional[Tensor] = None,
    use_audio_in_video: bool = False,
) -> tuple[Tensor, int]
Get M-RoPE input positions and delta value for this specific model.
This method should be implemented by each model that supports M-RoPE to provide model-specific logic for computing input positions.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| input_tokens | list[int] | List of input token IDs | required | 
| hf_config | PretrainedConfig | HuggingFace model configuration | required | 
| image_grid_thw | Optional[Union[list[list[int]], Tensor]] | Image grid dimensions (t, h, w) | required | 
| video_grid_thw | Optional[Union[list[list[int]], Tensor]] | Video grid dimensions (t, h, w) | required | 
| second_per_grid_ts | Optional[list[float]] | Seconds per grid timestep for videos | None | 
| context_len | int | Context length | 0 | 
| seq_len | Optional[int] | Sequence length | None | 
| audio_feature_lengths | Optional[Tensor] | Audio feature lengths for multimodal models | None | 
| use_audio_in_video | bool | Whether to use audio in video for interleaving | False | 
Returns:
| Type | Description | 
|---|---|
| Tensor | Tuple of (llm_positions, mrope_position_delta) | 
| int | 
 | 
| tuple[Tensor, int] | 
 | 
Source code in vllm/model_executor/models/interfaces.py
  
  Bases: Protocol
The interface required for all multi-modal models.
Source code in vllm/model_executor/models/interfaces.py
 | 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |  | 
 class-attribute  ¶
 merge_by_field_config: bool = False
A flag that indicates which implementation of vllm.multimodal.utils.group_mm_kwargs_by_modality to use.
 class-attribute  ¶
 supports_encoder_tp_data: bool = False
A flag that indicates whether this model supports multimodal_config.mm_encoder_tp_mode="data".
 class-attribute  ¶
 supports_multimodal: Literal[True] = True
A flag that indicates this model supports multi-modal inputs.
Note
There is no need to redefine this flag if this class is in the MRO of your model class.
 class-attribute  ¶
 supports_multimodal_raw_input_only: bool = False
A flag that indicates this model supports multi-modal inputs and processes them in their raw form and not embeddings.
 
 _get_text_embeddings(
    input_ids: Tensor,
    get_input_embeddings: Callable[[Tensor], Tensor],
    *,
    is_multimodal: Optional[Tensor],
    handle_oov_mm_token: bool,
) -> Tensor
Source code in vllm/model_executor/models/interfaces.py
  
 get_input_embeddings(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings,
    *,
    is_multimodal: Tensor,
    handle_oov_mm_token: bool = False,
) -> Tensor
get_input_embeddings(
    input_ids: Tensor,
    multimodal_embeddings: Optional[
        MultiModalEmbeddings
    ] = None,
    *,
    is_multimodal: Optional[Tensor] = None,
    handle_oov_mm_token: bool = False,
) -> Tensor
Apply token embeddings to input_ids.
If multimodal_embeddings is passed, scatter them into input_ids according to the mask is_multimodal.
In case the multi-modal token IDs exceed the vocabulary size of the language model, you can set handle_oov_mm_token=False to avoid calling the language model's get_input_embeddings method on those tokens. Note however that doing so increases memory usage as an additional buffer is needed to hold the input embeddings.
Source code in vllm/model_executor/models/interfaces.py
  
 get_language_model() -> VllmModel
Returns the underlying language model used for text generation.
This is typically the torch.nn.Module instance responsible for processing the merged multimodal embeddings and producing hidden states
Returns:
| Type | Description | 
|---|---|
| VllmModel | torch.nn.Module: The core language model component. | 
Source code in vllm/model_executor/models/interfaces.py
  
 get_multimodal_embeddings(
    **kwargs: object,
) -> MultiModalEmbeddings
Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.
Note
The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.
Source code in vllm/model_executor/models/interfaces.py
  classmethod  ¶
  Get the placeholder text for the ith modality item in the prompt.
 
  Bases: Protocol
The interface required for all models that support pipeline parallel.
Source code in vllm/model_executor/models/interfaces.py
  class-attribute  ¶
 supports_pp: Literal[True] = True
A flag that indicates this model supports pipeline parallel.
Note
There is no need to redefine this flag if this class is in the MRO of your model class.
 
 forward(
    *, intermediate_tensors: Optional[IntermediateTensors]
) -> Union[Tensor, IntermediateTensors]
Accept IntermediateTensors when PP rank > 0.
Return IntermediateTensors only for the last PP rank.
Source code in vllm/model_executor/models/interfaces.py
  
 make_empty_intermediate_tensors(
    batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors
Called when PP rank > 0 for profiling purposes.
 
  Bases: Protocol
The interface required for all models that support transcription.
Source code in vllm/model_executor/models/interfaces.py
 | 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 |  | 
 class-attribute  ¶
 supports_transcription_only: bool = False
Transcription models can opt out of text generation by setting this to True.
 
  Source code in vllm/model_executor/models/interfaces.py
  classmethod  ¶
 get_generation_prompt(
    audio: ndarray,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
    language: Optional[str],
    task_type: Literal["transcribe", "translate"],
    request_prompt: str,
    to_language: Optional[str],
) -> PromptType
Get the prompt for the ASR model. The model has control over the construction, as long as it returns a valid PromptType.
Source code in vllm/model_executor/models/interfaces.py
  classmethod  ¶
 get_num_audio_tokens(
    audio_duration_s: float,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
) -> Optional[int]
Map from audio duration to number of audio tokens produced by the ASR model, without running a forward pass. This is used for estimating the amount of processing for this audio.
Source code in vllm/model_executor/models/interfaces.py
  classmethod  ¶
    classmethod  ¶
 get_speech_to_text_config(
    model_config: ModelConfig,
    task_type: Literal["transcribe", "translate"],
) -> SpeechToTextConfig
Get the speech to text config for the ASR model.
 classmethod  ¶
  Ensure the language specified in the transcription request is a valid ISO 639-1 language code. If the request language is valid, but not natively supported by the model, trigger a warning (but not an exception).
Source code in vllm/model_executor/models/interfaces.py
  
  Bases: Protocol
Models with this interface are not compatible with V1 vLLM.
Source code in vllm/model_executor/models/interfaces.py
   
  Bases: VllmModel[T_co], Protocol[T_co]
The interface required for all pooling models in vLLM.
Source code in vllm/model_executor/models/interfaces_base.py
  class-attribute  ¶
 default_pooling_type: str = 'LAST'
Indicates the vllm.model_executor.layers.pooler.PoolerConfig.pooling_type to use by default.
You can use the vllm.model_executor.models.interfaces_base.default_pooling_type decorator to conveniently set this field.
 
   
 has_inner_state(model: object) -> TypeIs[HasInnerState]
has_inner_state(
    model: type[object],
) -> TypeIs[type[HasInnerState]]
has_inner_state(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[HasInnerState]], TypeIs[HasInnerState]
]
 
 is_pooling_model(
    model: type[object],
) -> TypeIs[type[VllmModelForPooling]]
is_pooling_model(
    model: object,
) -> TypeIs[VllmModelForPooling]
is_pooling_model(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[VllmModelForPooling]],
    TypeIs[VllmModelForPooling],
]
Source code in vllm/model_executor/models/interfaces_base.py
   
 is_text_generation_model(
    model: type[object],
) -> TypeIs[type[VllmModelForTextGeneration]]
is_text_generation_model(
    model: object,
) -> TypeIs[VllmModelForTextGeneration]
is_text_generation_model(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[VllmModelForTextGeneration]],
    TypeIs[VllmModelForTextGeneration],
]
Source code in vllm/model_executor/models/interfaces_base.py
  
 supports_lora(
    model: type[object],
) -> TypeIs[type[SupportsLoRA]]
supports_lora(model: object) -> TypeIs[SupportsLoRA]
supports_lora(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsLoRA]], TypeIs[SupportsLoRA]
]
Source code in vllm/model_executor/models/interfaces.py
  
 supports_mrope(
    model: type[object],
) -> TypeIs[type[SupportsMRoPE]]
supports_mrope(model: object) -> TypeIs[SupportsMRoPE]
supports_mrope(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsMRoPE]], TypeIs[SupportsMRoPE]
]
 
 supports_multimodal(
    model: type[object],
) -> TypeIs[type[SupportsMultiModal]]
supports_multimodal(
    model: object,
) -> TypeIs[SupportsMultiModal]
supports_multimodal(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsMultiModal]],
    TypeIs[SupportsMultiModal],
]
 
 supports_pp(
    model: type[object],
) -> TypeIs[type[SupportsPP]]
supports_pp(model: object) -> TypeIs[SupportsPP]
supports_pp(
    model: Union[type[object], object],
) -> Union[
    bool, TypeIs[type[SupportsPP]], TypeIs[SupportsPP]
]
Source code in vllm/model_executor/models/interfaces.py
  
 supports_transcription(
    model: type[object],
) -> TypeIs[type[SupportsTranscription]]
supports_transcription(
    model: object,
) -> TypeIs[SupportsTranscription]
supports_transcription(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsTranscription]],
    TypeIs[SupportsTranscription],
]
 
 supports_v0_only(
    model: type[object],
) -> TypeIs[type[SupportsV0Only]]
supports_v0_only(model: object) -> TypeIs[SupportsV0Only]
supports_v0_only(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]
]