How to Install & Run Qwen3-VL-30B-A3B-Thinking Locally?

by Ayush Kumar | October 11, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Qwen3-VL-30B-A3B-Thinking is one of the most advanced multimodal reasoning models in the Qwen3 series, designed to seamlessly fuse text, vision, and video understanding with large-scale reasoning. Built on a Mixture-of-Experts (MoE) architecture with 30B active parameters, the model introduces a specialized Thinking variant, tuned for deep multimodal reasoning across STEM, math, and complex real-world scenarios.

Key Strengths Include

Visual Agent Capabilities – Can perceive GUI elements, invoke tools, and complete tasks on PC/mobile interfaces.
Visual Coding Boost – Converts diagrams, screenshots, and videos into structured code artifacts (e.g., HTML, CSS, JavaScript, Draw.io).
Advanced Spatial & Video Perception – Supports 3D grounding, object occlusion reasoning, timestamp alignment, and long-horizon video comprehension.
Massive Context Handling – Native 256K tokens, expandable up to 1M, enabling book-level comprehension or hours-long video indexing.
Robust OCR & Recognition – Trained on broad visual corpora, supports 32 languages, rare/ancient scripts, and noisy/tilted text scenarios.
Unified Text-Vision Understanding – Matches pure LLMs in text reasoning while tightly aligning vision inputs for lossless multimodal comprehension.

Overall, Qwen3-VL-30B-A3B-Thinking is positioned as a research-grade, enterprise-ready model that excels at multimodal STEM reasoning, video understanding, GUI interaction, and code-generation from vision inputs.

Qwen3-VL-30B-A3B-Thinking Benchmark Results

Category	Benchmark	Qwen3-VL-30B-A3B-Thinking	GPT5-Mini (High)	Claude4-Sonnet (Thinking)	Other Best Open-source
STEM & Puzzle	MMMUVal	76.0	79.0	–	75.6 (InternVL3.5-30A3)
	MMMUPro_full	63.0	67.3	61.6	57.1 (GLM-4.1V-9B)
	MathVista_mini	81.9	79.1	–	81.8 (MiMoVL-7B)
	MathVision	65.7	71.9	62.1	60.4 (MiMoVL-7B)
	MathVerse_mini	79.6	78.8	71.5	71.5 (MiMoVL-7B)
General VQA	MMBenchDev_EN_V1.1	88.9	86.8	82.2	85.8 (GLM-4.1V-9B)
	RealWorldQA	77.4	79.0	–	72.3 (InternVL3.5-30A3)
	MMStar	75.5	74.1	69.4	72.9 (GLM-4.1V-9B)
	SimpleVQA	54.3	56.8	53.2	–
Subjective Experience & Instruction Following	HallusionBench	66.0	63.2	59.2	53.8 (InternVL3.5-30A3)
	MM-MT-Bench	7.9	7.7	7.9	–
	AIBench	91.6	92.0	92.0	95.7 (MiMoVL-7B)
	DocVQA_test	95.0	90.0	92.0	88.0 (MiMoVL-7B)
	InfoVQA_test	86.0	78.0	88.2	87.9 (GLM-4.1V-9B)
	AI2D_test	86.9	86.0	87.8	88.0 (InternVL3.5-30A3)
Text Recognition / Chart & Document Understanding	OCRBench	839.0	821.0	739.0	880.0 (InternVL3.5-30A3)
	OCRBenchV2_en/zh	62.6 / 60.4	52.6 / 45.1	44.9 / 39.4	–
	CCOCR-Bench_overall	77.8	70.8	66.9	87.0 (MiMoVL-7B)
	CharXiv(DA)	86.9	89.4	89.5	87.0 (MiMoVL-7B)
	CharXiv(RA)	56.6	68.6	63.3	56.5 (MiMoVL-7B)
	CountBench	90.0	91.0	91.0	90.4 (MiMoVL-7B)
2D / 3D Grounding	ODinW13	42.3	–	–	41.5 (InternVL3.5-30A3)
	ARKitScenes	55.6	–	–	63.7 (InternVL3.5-30A3)
	Hypersim	11.4	–	–	78.6 (RoboBrain 2.0)
	SUNRGBD	34.6	–	–	72.4 (RoboBrain 2.0)
Multi-Image	BLINK	65.4	–	60.4	65.1 (GLM-4.1V-9B)
	MUIRBench	77.6	–	–	74.7 (GLM-4.1V-9B)
Embodied & Spatial Understanding	ERQA	45.3	54.0	46.0	41.5 (InternVL3.5-30A3)
	VSI-Bench	56.1	31.5	33.3	63.7 (InternVL3.5-30A3)
	EmbSpatialBench	80.6	–	80.7	78.6 (RoboBrain 2.0)
	RefSpatialBench	54.2	–	9.0	54.0 (RoboBrain 2.0)
	RoboSpatialHome	65.5	54.3	69.7	72.4 (RoboBrain 2.0)
Video	MVBench	72.0	–	–	72.1 (InternVL3.5-30A3)
	VideoMME	73.3	78.9	72.3	68.7 (InternVL3.5-30A3)
	MLVU-MCQ	78.9	83.3	68.8	73.0 (InternVL3.5-30A3)
	LVBench	59.2	–	–	45.1 (GLM-4.1V-9B)
	CharadesSTA	62.7	–	–	50.0 (MiMoVL-7B)
	VideoMMMU	75.0	82.5	72.7	68.7 (MiMoVL-7B)
Agent	ScreenSpot	94.7	–	–	87.3 (MiMoVL-7B)
	ScreenSpot Pro	57.3	–	–	52.8 (Kimi-1.4A3B)
	OSWorldG	59.6	–	–	56.1 (MiMoVL-7B)
	AndroidWorld	55.0	–	–	41.7 (GLM-4.1V-9B)
	OSWorld	30.6	–	–	14.9 (GLM-4.1V-9B)
Fine-grained Perception	V*	81.2	78.6	45.0	81.7 (MiMoVL-7B)
	HRBench4K	77.8	78.6	58.5	–
	HRBench8K	71.3	74.4	49.8	–

Pure Text Performance

Category	Benchmark	Qwen3-VL-30B-A3B Instruct	Qwen3-30B-A3B Instruct-2507	Qwen3-VL-30B-A3B Thinking	Qwen3-30B-A3B Thinking-2507
Knowledge	MMLU	85.0	85.0	87.6	87.3
	MMLU-Pro	77.8	78.4	80.5	80.9
	MMLU-Redux	88.4	89.3	90.9	91.4
	GPQA	70.4	70.4	74.4	73.4
	SuperGPQA	53.1	53.4	56.4	56.8
	SimpleQA	27.0	22.2	23.9	19.2
Reasoning	AIME25	69.3	61.3	83.1	85.0
	HMMT25	50.6	43.0	67.6	71.4
	LiveBench1125	65.4	69.0	72.1	76.8
Code	LCBv6 (25.02–25.05)	42.6	43.2	64.2	66.0
Instruction Following	SIFO	50.1	46.8	66.9	66.9
	SIFO-multiturn	35.1	36.4	60.3	59.3
	IFEval	85.8	84.7	81.7	88.9
Subjective Evaluation	Arena-Hard v2	58.5	69.0	56.7	56.0
	Creative Writing v3	84.6	86.0	82.5	84.4
	WritingBench	82.6	85.5	85.2	85.0
Agent	BFCL-v3	66.3	65.1	68.6	72.4
Multilingual	MultiIF	66.1	67.9	73.0	76.4
	MMLU-ProX	70.9	72.0	76.1	76.4
	INCLUDE	71.6	71.9	74.5	74.4
	PolyMATH	44.3	43.1	51.7	52.6

GPU Configuration (Inference & Training, Rule-of-Thumb)

Scenario	Precision / Mode	Min VRAM (works)	Comfortable VRAM	Example GPU(s)	Notes
Single-GPU, Quantized (INT4/INT8)	INT4 / INT8	40–48 GB	80 GB	1× A100 80GB / H100 80GB	Suitable for cost-efficient inference; use `bitsandbytes` or GGUF quantization.
Single-GPU, Half Precision (BF16/FP16)	BF16 / FP16	80 GB	96–120 GB	1× H100 80GB (SXM/PCIe)	Full-fidelity reasoning, best for smaller batch sizes and single-image/video tasks.
Multi-GPU (Tensor Parallelism)	BF16 / FP16	4× 40 GB = 160 GB	4× 80 GB = 320 GB	4× A100 40GB / L40S	Splits weights across GPUs; needed for high-batch inference and long-context workloads.
MoE Training Setup	FP16 / BF16	512–640 GB	768 GB+	8× H100 80GB SXM	Required for fine-tuning or multi-video reasoning; benefits from FlashAttention-2.
Long Context + Video (1M tokens)	FP16 w/ FlashAttention-2	160 GB	320 GB+	4× H100 80GB	Large memory headroom needed for KV cache during ultra-long context or multi-hour video processing.

Tips:

Enable FlashAttention-2 for both inference and training—it reduces VRAM spikes and improves throughput.
For edge deployment, quantized INT4 versions (via GGUF + llama.cpp or vLLM) make the model usable on single 48GB GPUs.
For video + multimodal workloads, always keep extra VRAM buffer (~20–30%) for caching and activations.

Resources

Link: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking

Step-by-Step Process to Install & Run Qwen3-VL-30B-A3B-Thinking Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image (Use the Jupyter Template)

We’ll use the Jupyter image from NodeShift’s gallery so you don’t have to install Jupyter Notebook/Lab manually. This image is GPU-ready and comes with a preconfigured Python + Jupyter environment—perfect for testing and serving Qwen3-VL-30B-A3B-Thinking.

What you’ll do

pick the Jupyter template,
(optionally) pick a CUDA/PyTorch variant if the UI offers it,
open JupyterLab in your browser,
install the few project-specific Python packages inside that environment.

How to select it

In the Create VM flow, go to Choose an Image → Templates.
Click Jupyter (see screenshot). You’ll see a short description like “A web-based interactive computing platform for data science.”
If a version/stack dropdown appears, choose the latest CUDA 12.x / PyTorch variant (or “GPU-enabled” build).
Click Create (or Next) to proceed to sizing and networking.

Why this image

JupyterLab is already installed and enabled as a service, so the VM boots straight into a working notebook server.
GPU drivers + CUDA runtime are aligned with the template, so PyTorch will detect your GPU out of the box.
You can manage everything (terminals, notebooks, file browser) from the Jupyter UI—no extra desktop or VNC needed.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Access Your Deployment

Once your GPU VM is in the RUNNING state, you’ll see a control menu (three dots on the right side of the deployment card). This menu gives you multiple ways to access and manage your deployment.

Available Options

Edit Name
Rename your deployment for easier identification (e.g., “Qwen3-VL-30B-A3B-Thinking”).
Open Jupyter Notebook
- Click this to launch the pre-installed Jupyter environment directly in your browser.
- You’ll be taken to JupyterLab, where you can open notebooks, create terminals, and run code cells to set up Qwen3-VL-30B-A3B-Thinking.
- This is the most user-friendly way to start working immediately without additional setup.
Connect with SSH
- Choose this if you prefer command-line access.
- You’ll get the SSH connection string (e.g., ssh -i <your-key> user@<vm-ip>).
- Use this method for advanced management, server setups (like vLLM/SGLang), or installing additional system packages.
Show Logs
- View system/service logs for debugging (useful if something isn’t starting correctly).
- Helps verify GPU initialization or catch errors during startup.
Update Tags
- Add labels or tags to organize multiple deployments.
- Example: tag by project, model type, or experiment.
Destroy Unit
- This permanently shuts down and deletes your VM.
- Use only when you are done, as this action cannot be undone.

Recommended Path for Qwen3-VL-30B-A3B-Thinking

For beginners / testing: Use Open Jupyter Notebook → open a Terminal inside JupyterLab → install the required Python packages → run moderation tests.
For production / serving APIs: Use Connect with SSH → start vLLM or SGLang on the VM → expose ports (8000/30000) → connect via API clients.

Step 8: Open Jupyter Notebook

Once your VM is running, you can directly access the Jupyter Notebook environment provided by NodeShift. This will be your main workspace for running Qwen3-VL-30B-A3B-Thinking.

1. Click Open Jupyter Notebook

From the My GPU Deployments panel, click the three-dot menu on your deployment card.
Select Open Jupyter Notebook.

This will open a new browser tab pointing to your VM’s Jupyter instance.

2. Handle the Browser Security Warning

Since the Jupyter server is running with a self-signed SSL certificate, your browser may show a “Your connection is not private” warning.

Click Advanced.
Then, click Proceed to <your-vm-ip> (unsafe).

Don’t worry — this is expected. You’re connecting directly to your VM’s Jupyter server, not a public website.

3. JupyterLab Interface Opens

Once you proceed, you’ll land inside JupyterLab. Here you’ll see:

Notebook options (Python 3, Python 3.10, etc.)
Console options (interactive shells)
Other tools like a Terminal, Text File, and Markdown File.

You can now use the Terminal inside JupyterLab to install dependencies and start working with Qwen3-VL-30B-A3B-Thinking.

Step 9: Open Python 3.10 Notebook and Rename

Now that JupyterLab is running, let’s create a notebook where we will set up and run KAT-Dev.

1. Open a Python 3.10 Notebook

In the Launcher screen, under Notebook, click on Python3.10 (python_310).
This will open a new notebook editor with an empty code cell where you can type commands.

2. Rename the Notebook

By default, the notebook will open as something like Untitled.ipynb.

To rename:
- Right-click on the notebook tab name at the top.
- Select Rename Notebook….
- Enter a meaningful name such as:

qwen3vl.ipynb

Press Enter to confirm.

3. Verify the Editor

You should now see an empty notebook named qwen3vl.ipynb with a code cell ready.
This is where you’ll run all the setup commands (installing dependencies, loading the model, and testing moderation).

Step 10: Verify GPU Availability

Before installing and running Qwen3-VL-30B-A3B-Thinking, it’s important to confirm that your VM has successfully attached the GPU and that CUDA is working.

1. Run `nvidia-smi`

In your Jupyter Notebook cell, type:

!nvidia-smi

2. Check the Output

You should see information about your GPU, similar to the screenshot:

GPU Name → NVIDIA H200
Driver Version → 565.xx or similar
CUDA Version → 12.x (here it shows 12.7)
Memory Usage → confirms available VRAM
Temperature / Power → current GPU status

3. Why This Step Matters

Confirms that the GPU drivers are properly installed.
Ensures the CUDA runtime matches your environment.
Prevents wasted time later if the model fails to load due to GPU issues.

With GPU verified, you’re ready to proceed to the next step: installing the required Python libraries (Transformers, vLLM, SGLang, etc.) inside the notebook.

Step 11: Install PyTorch with CUDA 12.4 Support

Use the following command to install the latest stable PyTorch, TorchVision, and TorchAudio built specifically for CUDA 12.4:

!pip install --index-url https://download.pytorch.org/whl/cu124 \
  torch torchvision torchaudio --upgrade

This ensures that your environment has GPU acceleration enabled and is fully compatible with CUDA 12.4 for running large-scale models like Qwen3-VL-30B-A3B-Thinking.

Step 12: Install FlashAttention-2

Install the required build tools and flash-attn compiled against your current PyTorch/CUDA stack:

!pip install setuptools wheel
!pip install --no-build-isolation flash-attn

Step 13: Install Transformers (Latest) + Vision/Video Deps

Install the latest Transformers from source plus all runtime libraries for image/video IO and faster HF downloads.

!pip install "git+https://github.com/huggingface/transformers"
!pip install accelerate safetensors sentencepiece
!pip install pillow opencv-python timm
!pip install decord av imageio[ffmpeg]  # for video
!pip install huggingface_hub[hf_transfer]

Why these:

transformers (latest main) → freshest Qwen3-VL support
accelerate, safetensors, sentencepiece → inference + tokenizer basics
pillow, opencv-python, timm → image handling & vision utilities
decord, av, imageio[ffmpeg] → video reading & frame sampling
huggingface_hub[hf_transfer] → faster model downloads (enable via HF_HUB_ENABLE_HF_TRANSFER=1)

Step 14: Run the Image Inference Script

Execute your script to generate a caption from the demo image.

import os
import sys
from io import BytesIO

import torch
import requests
from PIL import Image
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor

MODEL_ID = "Qwen/Qwen3-VL-30B-A3B-Thinking"
IMG_URL = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"

def fetch_image(url: str) -> Image.Image:
    r = requests.get(url, timeout=30)
    r.raise_for_status()
    return Image.open(BytesIO(r.content)).convert("RGB")

def main():
    # --- Sanity prints ---
    print("Python :", sys.version)
    print("Torch  :", torch.__version__)
    print("CUDA?  :", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("GPU    :", torch.cuda.get_device_name(0))

    # --- Load model (NO flash-attn; we force SDPA) ---
    # Prefer bf16 on GPU; fallback to fp16/auto as needed.
    dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
    model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
        MODEL_ID,
        torch_dtype=dtype,
        device_map="auto",              # shards across multiple GPUs if present
        attn_implementation="sdpa",     # <- avoids FlashAttention entirely
    )
    processor = AutoProcessor.from_pretrained(MODEL_ID)

    # --- Prepare message with image ---
    image = fetch_image(IMG_URL)
    messages = [{
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in 3 concise bullet points."},
        ],
    }]

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
    )

    # Move tensors to model device
    device = next(model.parameters()).device
    for k, v in list(inputs.items()):
        if hasattr(v, "to"):
            inputs[k] = v.to(device)

    # --- Generate ---
    gen_kwargs = dict(
        max_new_tokens=256,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    with torch.inference_mode():
        out = model.generate(**inputs, **gen_kwargs)

    # Trim prompt tokens
    trimmed = [o[len(i):] for i, o in zip(inputs["input_ids"], out)]
    text = processor.batch_decode(
        trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    print("\n=== MODEL RESPONSE ===\n")
    print(text)

if __name__ == "__main__":
    os.environ.setdefault("HF_HUB_ENABLE_HF_TRANSFER", "1")  # faster downloads
    main()

What This Script Does

Loads the Qwen3-VL-30B-A3B-Thinking vision-language model (SDPA; no FlashAttention).
Downloads a demo image and builds a chat-style message (image + prompt).
Tokenizes with AutoProcessor using Qwen’s chat template.
Runs GPU inference (device_map="auto", bf16/fp16) to generate up to 256 tokens.
Prints the model’s concise description of the image to the console.

Conclusion

Qwen3-VL-30B-A3B-Thinking stands out as one of the most capable open multimodal reasoning models available today.
With its fusion of text, vision, and video understanding, it pushes the boundaries of large-scale reasoning across STEM, code generation, and real-world perception.

Running it on a NodeShift GPU VM offers the perfect balance of performance and accessibility—letting you explore advanced image, document, and video comprehension directly from a Jupyter or Streamlit setup.

Whether you’re a researcher, developer, or enterprise user, this guide enables you to deploy Qwen3-VL locally, experience its multimodal depth, and build the next generation of intelligent applications powered by unified reasoning.

Relevant blog posts

November 12, 2025

How to Install & Run SAP-RPT-1-OSS Locally?

sap-rpt-1-oss is SAP’s table-native, semantics-aware in-context learner for classification and regression. It embeds column names and cell values (no manual preprocessing), handles missing data, and scales quality with context size and bagging. For peak accuracy, it prefers big VRAM; for speed or smaller GPUs, just shrink the context and bagging.

November 11, 2025

How to Cut Your AI Costs in Half with TOON – The Smarter, Token-Optimized Alternative to JSON

Every token you send to an AI model costs money, and when your application scales, those costs can balloon fast. That’s where Token-Oriented Object Notation (TOON) steps in, offering a revolutionary way to save on API expenses without sacrificing data clarity or model accuracy. Designed as a compact, human-readable, and LLM-optimized alternative to JSON, TOON drastically reduces token usage by 30–60% across large structured datasets. It blends the simplicity of CSV, the readability of YAML, and the precision of JSON, creating a format that’s tailor-made for AI inputs. With features like tabular arrays, indentation-based hierarchy, and optional key folding, TOON helps models parse and reason about structured data more efficiently, all while maintaining perfect fidelity to your original dataset. The result? You send less data, get faster responses, and cut your AI inference costs dramatically, all by changing how you represent your data.

November 11, 2025

How to Install & Run Omnilingual ASR Locally?

Omnilingual ASR is Meta’s groundbreaking open-source speech recognition system built to support over 1,600 languages, including hundreds never before covered by any ASR model. It’s designed for inclusivity — allowing new languages to be added with just a few paired examples — and combines scalable zero-shot learning with flexible model architectures (Wav2Vec2, CTC, and LLM-based). The flagship OmniASR_LLM_7B model achieves state-of-the-art transcription accuracy, with character error rates (CER) below 10% for nearly 80% of supported languages, making it the most globally comprehensive ASR ever released. Each model is fully compatible with PyTorch, Fairseq2, and Hugging Face datasets, making it easy for developers and researchers to build multilingual transcription systems at scale.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.