Qwen3-VL-30B-A3B-Thinking is one of the most advanced multimodal reasoning models in the Qwen3 series, designed to seamlessly fuse text, vision, and video understanding with large-scale reasoning. Built on a Mixture-of-Experts (MoE) architecture with 30B active parameters, the model introduces a specialized Thinking variant, tuned for deep multimodal reasoning across STEM, math, and complex real-world scenarios.
Key Strengths Include
- Visual Agent Capabilities – Can perceive GUI elements, invoke tools, and complete tasks on PC/mobile interfaces.
- Visual Coding Boost – Converts diagrams, screenshots, and videos into structured code artifacts (e.g., HTML, CSS, JavaScript, Draw.io).
- Advanced Spatial & Video Perception – Supports 3D grounding, object occlusion reasoning, timestamp alignment, and long-horizon video comprehension.
- Massive Context Handling – Native 256K tokens, expandable up to 1M, enabling book-level comprehension or hours-long video indexing.
- Robust OCR & Recognition – Trained on broad visual corpora, supports 32 languages, rare/ancient scripts, and noisy/tilted text scenarios.
- Unified Text-Vision Understanding – Matches pure LLMs in text reasoning while tightly aligning vision inputs for lossless multimodal comprehension.
Overall, Qwen3-VL-30B-A3B-Thinking is positioned as a research-grade, enterprise-ready model that excels at multimodal STEM reasoning, video understanding, GUI interaction, and code-generation from vision inputs.
Qwen3-VL-30B-A3B-Thinking Benchmark Results
Category | Benchmark | Qwen3-VL-30B-A3B-Thinking | GPT5-Mini (High) | Claude4-Sonnet (Thinking) | Other Best Open-source |
---|
STEM & Puzzle | MMMUVal | 76.0 | 79.0 | – | 75.6 (InternVL3.5-30A3) |
| MMMUPro_full | 63.0 | 67.3 | 61.6 | 57.1 (GLM-4.1V-9B) |
| MathVista_mini | 81.9 | 79.1 | – | 81.8 (MiMoVL-7B) |
| MathVision | 65.7 | 71.9 | 62.1 | 60.4 (MiMoVL-7B) |
| MathVerse_mini | 79.6 | 78.8 | 71.5 | 71.5 (MiMoVL-7B) |
General VQA | MMBenchDev_EN_V1.1 | 88.9 | 86.8 | 82.2 | 85.8 (GLM-4.1V-9B) |
| RealWorldQA | 77.4 | 79.0 | – | 72.3 (InternVL3.5-30A3) |
| MMStar | 75.5 | 74.1 | 69.4 | 72.9 (GLM-4.1V-9B) |
| SimpleVQA | 54.3 | 56.8 | 53.2 | – |
Subjective Experience & Instruction Following | HallusionBench | 66.0 | 63.2 | 59.2 | 53.8 (InternVL3.5-30A3) |
| MM-MT-Bench | 7.9 | 7.7 | 7.9 | – |
| AIBench | 91.6 | 92.0 | 92.0 | 95.7 (MiMoVL-7B) |
| DocVQA_test | 95.0 | 90.0 | 92.0 | 88.0 (MiMoVL-7B) |
| InfoVQA_test | 86.0 | 78.0 | 88.2 | 87.9 (GLM-4.1V-9B) |
| AI2D_test | 86.9 | 86.0 | 87.8 | 88.0 (InternVL3.5-30A3) |
Text Recognition / Chart & Document Understanding | OCRBench | 839.0 | 821.0 | 739.0 | 880.0 (InternVL3.5-30A3) |
| OCRBenchV2_en/zh | 62.6 / 60.4 | 52.6 / 45.1 | 44.9 / 39.4 | – |
| CCOCR-Bench_overall | 77.8 | 70.8 | 66.9 | 87.0 (MiMoVL-7B) |
| CharXiv(DA) | 86.9 | 89.4 | 89.5 | 87.0 (MiMoVL-7B) |
| CharXiv(RA) | 56.6 | 68.6 | 63.3 | 56.5 (MiMoVL-7B) |
| CountBench | 90.0 | 91.0 | 91.0 | 90.4 (MiMoVL-7B) |
2D / 3D Grounding | ODinW13 | 42.3 | – | – | 41.5 (InternVL3.5-30A3) |
| ARKitScenes | 55.6 | – | – | 63.7 (InternVL3.5-30A3) |
| Hypersim | 11.4 | – | – | 78.6 (RoboBrain 2.0) |
| SUNRGBD | 34.6 | – | – | 72.4 (RoboBrain 2.0) |
Multi-Image | BLINK | 65.4 | – | 60.4 | 65.1 (GLM-4.1V-9B) |
| MUIRBench | 77.6 | – | – | 74.7 (GLM-4.1V-9B) |
Embodied & Spatial Understanding | ERQA | 45.3 | 54.0 | 46.0 | 41.5 (InternVL3.5-30A3) |
| VSI-Bench | 56.1 | 31.5 | 33.3 | 63.7 (InternVL3.5-30A3) |
| EmbSpatialBench | 80.6 | – | 80.7 | 78.6 (RoboBrain 2.0) |
| RefSpatialBench | 54.2 | – | 9.0 | 54.0 (RoboBrain 2.0) |
| RoboSpatialHome | 65.5 | 54.3 | 69.7 | 72.4 (RoboBrain 2.0) |
Video | MVBench | 72.0 | – | – | 72.1 (InternVL3.5-30A3) |
| VideoMME | 73.3 | 78.9 | 72.3 | 68.7 (InternVL3.5-30A3) |
| MLVU-MCQ | 78.9 | 83.3 | 68.8 | 73.0 (InternVL3.5-30A3) |
| LVBench | 59.2 | – | – | 45.1 (GLM-4.1V-9B) |
| CharadesSTA | 62.7 | – | – | 50.0 (MiMoVL-7B) |
| VideoMMMU | 75.0 | 82.5 | 72.7 | 68.7 (MiMoVL-7B) |
Agent | ScreenSpot | 94.7 | – | – | 87.3 (MiMoVL-7B) |
| ScreenSpot Pro | 57.3 | – | – | 52.8 (Kimi-1.4A3B) |
| OSWorldG | 59.6 | – | – | 56.1 (MiMoVL-7B) |
| AndroidWorld | 55.0 | – | – | 41.7 (GLM-4.1V-9B) |
| OSWorld | 30.6 | – | – | 14.9 (GLM-4.1V-9B) |
Fine-grained Perception | V* | 81.2 | 78.6 | 45.0 | 81.7 (MiMoVL-7B) |
| HRBench4K | 77.8 | 78.6 | 58.5 | – |
| HRBench8K | 71.3 | 74.4 | 49.8 | – |
Pure Text Performance
Category | Benchmark | Qwen3-VL-30B-A3B Instruct | Qwen3-30B-A3B Instruct-2507 | Qwen3-VL-30B-A3B Thinking | Qwen3-30B-A3B Thinking-2507 |
---|
Knowledge | MMLU | 85.0 | 85.0 | 87.6 | 87.3 |
| MMLU-Pro | 77.8 | 78.4 | 80.5 | 80.9 |
| MMLU-Redux | 88.4 | 89.3 | 90.9 | 91.4 |
| GPQA | 70.4 | 70.4 | 74.4 | 73.4 |
| SuperGPQA | 53.1 | 53.4 | 56.4 | 56.8 |
| SimpleQA | 27.0 | 22.2 | 23.9 | 19.2 |
Reasoning | AIME25 | 69.3 | 61.3 | 83.1 | 85.0 |
| HMMT25 | 50.6 | 43.0 | 67.6 | 71.4 |
| LiveBench1125 | 65.4 | 69.0 | 72.1 | 76.8 |
Code | LCBv6 (25.02–25.05) | 42.6 | 43.2 | 64.2 | 66.0 |
Instruction Following | SIFO | 50.1 | 46.8 | 66.9 | 66.9 |
| SIFO-multiturn | 35.1 | 36.4 | 60.3 | 59.3 |
| IFEval | 85.8 | 84.7 | 81.7 | 88.9 |
Subjective Evaluation | Arena-Hard v2 | 58.5 | 69.0 | 56.7 | 56.0 |
| Creative Writing v3 | 84.6 | 86.0 | 82.5 | 84.4 |
| WritingBench | 82.6 | 85.5 | 85.2 | 85.0 |
Agent | BFCL-v3 | 66.3 | 65.1 | 68.6 | 72.4 |
Multilingual | MultiIF | 66.1 | 67.9 | 73.0 | 76.4 |
| MMLU-ProX | 70.9 | 72.0 | 76.1 | 76.4 |
| INCLUDE | 71.6 | 71.9 | 74.5 | 74.4 |
| PolyMATH | 44.3 | 43.1 | 51.7 | 52.6 |
GPU Configuration (Inference & Training, Rule-of-Thumb)
Scenario | Precision / Mode | Min VRAM (works) | Comfortable VRAM | Example GPU(s) | Notes |
---|
Single-GPU, Quantized (INT4/INT8) | INT4 / INT8 | 40–48 GB | 80 GB | 1× A100 80GB / H100 80GB | Suitable for cost-efficient inference; use bitsandbytes or GGUF quantization. |
Single-GPU, Half Precision (BF16/FP16) | BF16 / FP16 | 80 GB | 96–120 GB | 1× H100 80GB (SXM/PCIe) | Full-fidelity reasoning, best for smaller batch sizes and single-image/video tasks. |
Multi-GPU (Tensor Parallelism) | BF16 / FP16 | 4× 40 GB = 160 GB | 4× 80 GB = 320 GB | 4× A100 40GB / L40S | Splits weights across GPUs; needed for high-batch inference and long-context workloads. |
MoE Training Setup | FP16 / BF16 | 512–640 GB | 768 GB+ | 8× H100 80GB SXM | Required for fine-tuning or multi-video reasoning; benefits from FlashAttention-2. |
Long Context + Video (1M tokens) | FP16 w/ FlashAttention-2 | 160 GB | 320 GB+ | 4× H100 80GB | Large memory headroom needed for KV cache during ultra-long context or multi-hour video processing. |
Tips:
- Enable FlashAttention-2 for both inference and training—it reduces VRAM spikes and improves throughput.
- For edge deployment, quantized INT4 versions (via GGUF + llama.cpp or vLLM) make the model usable on single 48GB GPUs.
- For video + multimodal workloads, always keep extra VRAM buffer (~20–30%) for caching and activations.
Resources
Link: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
Step-by-Step Process to Install & Run Qwen3-VL-30B-A3B-Thinking Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image (Use the Jupyter Template)
We’ll use the Jupyter image from NodeShift’s gallery so you don’t have to install Jupyter Notebook/Lab manually. This image is GPU-ready and comes with a preconfigured Python + Jupyter environment—perfect for testing and serving Qwen3-VL-30B-A3B-Thinking.
What you’ll do
- pick the Jupyter template,
- (optionally) pick a CUDA/PyTorch variant if the UI offers it,
- open JupyterLab in your browser,
- install the few project-specific Python packages inside that environment.
How to select it
- In the Create VM flow, go to Choose an Image → Templates.
- Click Jupyter (see screenshot). You’ll see a short description like “A web-based interactive computing platform for data science.”
- If a version/stack dropdown appears, choose the latest CUDA 12.x / PyTorch variant (or “GPU-enabled” build).
- Click Create (or Next) to proceed to sizing and networking.
Why this image
- JupyterLab is already installed and enabled as a service, so the VM boots straight into a working notebook server.
- GPU drivers + CUDA runtime are aligned with the template, so PyTorch will detect your GPU out of the box.
- You can manage everything (terminals, notebooks, file browser) from the Jupyter UI—no extra desktop or VNC needed.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Access Your Deployment
Once your GPU VM is in the RUNNING state, you’ll see a control menu (three dots on the right side of the deployment card). This menu gives you multiple ways to access and manage your deployment.
Available Options
- Edit Name
Rename your deployment for easier identification (e.g., “Qwen3-VL-30B-A3B-Thinking”).
- Open Jupyter Notebook
- Click this to launch the pre-installed Jupyter environment directly in your browser.
- You’ll be taken to JupyterLab, where you can open notebooks, create terminals, and run code cells to set up Qwen3-VL-30B-A3B-Thinking.
- This is the most user-friendly way to start working immediately without additional setup.
- Connect with SSH
- Choose this if you prefer command-line access.
- You’ll get the SSH connection string (e.g.,
ssh -i <your-key> user@<vm-ip>
).
- Use this method for advanced management, server setups (like vLLM/SGLang), or installing additional system packages.
- Show Logs
- View system/service logs for debugging (useful if something isn’t starting correctly).
- Helps verify GPU initialization or catch errors during startup.
- Update Tags
- Add labels or tags to organize multiple deployments.
- Example: tag by project, model type, or experiment.
- Destroy Unit
- This permanently shuts down and deletes your VM.
- Use only when you are done, as this action cannot be undone.
Recommended Path for Qwen3-VL-30B-A3B-Thinking
- For beginners / testing: Use Open Jupyter Notebook → open a Terminal inside JupyterLab → install the required Python packages → run moderation tests.
- For production / serving APIs: Use Connect with SSH → start vLLM or SGLang on the VM → expose ports (8000/30000) → connect via API clients.
Step 8: Open Jupyter Notebook
Once your VM is running, you can directly access the Jupyter Notebook environment provided by NodeShift. This will be your main workspace for running Qwen3-VL-30B-A3B-Thinking.
1. Click Open Jupyter Notebook
- From the My GPU Deployments panel, click the three-dot menu on your deployment card.
- Select Open Jupyter Notebook.
This will open a new browser tab pointing to your VM’s Jupyter instance.
2. Handle the Browser Security Warning
Since the Jupyter server is running with a self-signed SSL certificate, your browser may show a “Your connection is not private” warning.
- Click Advanced.
- Then, click Proceed to
<your-vm-ip>
(unsafe).
Don’t worry — this is expected. You’re connecting directly to your VM’s Jupyter server, not a public website.
3. JupyterLab Interface Opens
Once you proceed, you’ll land inside JupyterLab. Here you’ll see:
- Notebook options (Python 3, Python 3.10, etc.)
- Console options (interactive shells)
- Other tools like a Terminal, Text File, and Markdown File.
You can now use the Terminal inside JupyterLab to install dependencies and start working with Qwen3-VL-30B-A3B-Thinking.
Step 9: Open Python 3.10 Notebook and Rename
Now that JupyterLab is running, let’s create a notebook where we will set up and run KAT-Dev.
1. Open a Python 3.10 Notebook
- In the Launcher screen, under Notebook, click on Python3.10 (python_310).
- This will open a new notebook editor with an empty code cell where you can type commands.
2. Rename the Notebook
- By default, the notebook will open as something like Untitled.ipynb.
- To rename:
- Right-click on the notebook tab name at the top.
- Select Rename Notebook….
- Enter a meaningful name such as:
qwen3vl.ipynb
Press Enter to confirm.
3. Verify the Editor
- You should now see an empty notebook named qwen3vl.ipynb with a code cell ready.
- This is where you’ll run all the setup commands (installing dependencies, loading the model, and testing moderation).
Step 10: Verify GPU Availability
Before installing and running Qwen3-VL-30B-A3B-Thinking, it’s important to confirm that your VM has successfully attached the GPU and that CUDA is working.
1. Run nvidia-smi
In your Jupyter Notebook cell, type:
!nvidia-smi
2. Check the Output
You should see information about your GPU, similar to the screenshot:
- GPU Name →
NVIDIA H200
- Driver Version →
565.xx
or similar
- CUDA Version →
12.x
(here it shows 12.7)
- Memory Usage → confirms available VRAM
- Temperature / Power → current GPU status
3. Why This Step Matters
- Confirms that the GPU drivers are properly installed.
- Ensures the CUDA runtime matches your environment.
- Prevents wasted time later if the model fails to load due to GPU issues.
With GPU verified, you’re ready to proceed to the next step: installing the required Python libraries (Transformers, vLLM, SGLang, etc.) inside the notebook.
Step 11: Install PyTorch with CUDA 12.4 Support
Use the following command to install the latest stable PyTorch, TorchVision, and TorchAudio built specifically for CUDA 12.4:
!pip install --index-url https://download.pytorch.org/whl/cu124 \
torch torchvision torchaudio --upgrade
This ensures that your environment has GPU acceleration enabled and is fully compatible with CUDA 12.4 for running large-scale models like Qwen3-VL-30B-A3B-Thinking.
Step 12: Install FlashAttention-2
Install the required build tools and flash-attn compiled against your current PyTorch/CUDA stack:
!pip install setuptools wheel
!pip install --no-build-isolation flash-attn
Step 13: Install Transformers (Latest) + Vision/Video Deps
Install the latest Transformers from source plus all runtime libraries for image/video IO and faster HF downloads.
!pip install "git+https://github.com/huggingface/transformers"
!pip install accelerate safetensors sentencepiece
!pip install pillow opencv-python timm
!pip install decord av imageio[ffmpeg] # for video
!pip install huggingface_hub[hf_transfer]
Why these:
transformers
(latest main) → freshest Qwen3-VL support
accelerate
, safetensors
, sentencepiece
→ inference + tokenizer basics
pillow
, opencv-python
, timm
→ image handling & vision utilities
decord
, av
, imageio[ffmpeg]
→ video reading & frame sampling
huggingface_hub[hf_transfer]
→ faster model downloads (enable via HF_HUB_ENABLE_HF_TRANSFER=1
)
Step 14: Run the Image Inference Script
Execute your script to generate a caption from the demo image.
import os
import sys
from io import BytesIO
import torch
import requests
from PIL import Image
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
MODEL_ID = "Qwen/Qwen3-VL-30B-A3B-Thinking"
IMG_URL = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
def fetch_image(url: str) -> Image.Image:
r = requests.get(url, timeout=30)
r.raise_for_status()
return Image.open(BytesIO(r.content)).convert("RGB")
def main():
# --- Sanity prints ---
print("Python :", sys.version)
print("Torch :", torch.__version__)
print("CUDA? :", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU :", torch.cuda.get_device_name(0))
# --- Load model (NO flash-attn; we force SDPA) ---
# Prefer bf16 on GPU; fallback to fp16/auto as needed.
dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=dtype,
device_map="auto", # shards across multiple GPUs if present
attn_implementation="sdpa", # <- avoids FlashAttention entirely
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
# --- Prepare message with image ---
image = fetch_image(IMG_URL)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in 3 concise bullet points."},
],
}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
)
# Move tensors to model device
device = next(model.parameters()).device
for k, v in list(inputs.items()):
if hasattr(v, "to"):
inputs[k] = v.to(device)
# --- Generate ---
gen_kwargs = dict(
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
with torch.inference_mode():
out = model.generate(**inputs, **gen_kwargs)
# Trim prompt tokens
trimmed = [o[len(i):] for i, o in zip(inputs["input_ids"], out)]
text = processor.batch_decode(
trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print("\n=== MODEL RESPONSE ===\n")
print(text)
if __name__ == "__main__":
os.environ.setdefault("HF_HUB_ENABLE_HF_TRANSFER", "1") # faster downloads
main()
What This Script Does
- Loads the Qwen3-VL-30B-A3B-Thinking vision-language model (SDPA; no FlashAttention).
- Downloads a demo image and builds a chat-style message (image + prompt).
- Tokenizes with AutoProcessor using Qwen’s chat template.
- Runs GPU inference (
device_map="auto"
, bf16/fp16) to generate up to 256 tokens.
- Prints the model’s concise description of the image to the console.
Conclusion
Qwen3-VL-30B-A3B-Thinking stands out as one of the most capable open multimodal reasoning models available today.
With its fusion of text, vision, and video understanding, it pushes the boundaries of large-scale reasoning across STEM, code generation, and real-world perception.
Running it on a NodeShift GPU VM offers the perfect balance of performance and accessibility—letting you explore advanced image, document, and video comprehension directly from a Jupyter or Streamlit setup.
Whether you’re a researcher, developer, or enterprise user, this guide enables you to deploy Qwen3-VL locally, experience its multimodal depth, and build the next generation of intelligent applications powered by unified reasoning.