How to Install & Run Qwen3-4B-Thinking-2507 Locally?

by Ayush Kumar | August 8, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Qwen3-4B-Thinking-2507 is a compact yet highly capable reasoning-focused language model designed for tasks that demand clarity of thought and multi-step problem solving. Despite having only 4 billion parameters, it delivers strong performance across logical reasoning, mathematics, scientific analysis, coding challenges, and other domains that require precision and depth.

What makes this version stand out is its “thinking mode” — it produces a visible reasoning trace before giving the final answer, allowing you to see how it arrives at conclusions. This is particularly valuable for debugging model outputs, teaching, or verifying reasoning in high-stakes scenarios.

Another key strength is its long-context capability — up to 262,144 tokens natively — enabling it to work with extremely large documents, multi-turn conversations, or complex datasets without losing context. Whether you’re feeding it an entire research paper, a big block of code, or a chain of connected instructions, it can keep track of details and maintain coherent reasoning throughout.

Although designed for complex reasoning tasks, it’s also well-tuned for general-purpose usage such as instruction following, structured output generation, and creative writing. It supports tool usage through agent frameworks like Qwen-Agent, making it easier to integrate with APIs, code execution environments, and other workflows.

Performance

	Qwen3-30B-A3B Thinking	Qwen3-4B Thinking	Qwen3-4B-Thinking-2507
Knowledge
MMLU-Pro	78.5	70.4	74.0
MMLU-Redux	89.5	83.7	86.1
GPQA	65.8	55.9	65.8
SuperGPQA	51.8	42.7	47.8
Reasoning
AIME25	70.9	65.6	81.3
HMMT25	49.8	42.1	55.5
LiveBench 20241125	74.3	63.6	71.8
Coding
LiveCodeBench v6 (25.02-25.05)	57.4	48.4	55.2
CFEval	1940	1671	1852
OJBench	20.7	16.1	17.9
Alignment
IFEval	86.5	81.9	87.4
Arena-Hard v2$	36.3	13.7	34.9
Creative Writing v3	79.1	61.1	75.6
WritingBench	77.0	73.5	83.3
Agent
BFCL-v3	69.1	65.9	71.2
TAU1-Retail	61.7	33.9	66.1
TAU1-Airline	32.0	32.0	48.0
TAU2-Retail	34.2	38.6	53.5
TAU2-Airline	36.0	28.0	58.0
TAU2-Telecom	22.8	17.5	27.2
Multilingualism
MultiIF	72.2	66.3	77.3
MMLU-ProX	73.1	61.0	64.2
INCLUDE	71.9	61.8	64.4
PolyMATH	46.1	40.0	46.2

Recommended GPU Setups

Tier	Example GPU	VRAM	Precision	Good for (context / output)	Notes
Minimum	RTX 3060 8GB	8 GB	FP16/BF16	~16k–32k ctx, ≤512–1024 new tokens	Keep prompts short; lower `max_new_tokens` if you hit OOM.
Sweet spot (budget)	RTX 3060 12GB / RTX 4070 12GB	12 GB	FP16/BF16	~32k–64k ctx, ≤1k–2k new tokens	Solid single-user chat; enable TF32/BF16 when available.
Sweet spot (creator)	RTX 3080 10GB / 4070 Ti 16GB	10–16 GB	FP16/BF16	~64k–96k ctx, ≤2k–4k new tokens	Great balance; watch KV-cache growth on very long prompts.
Prosumer	RTX 3090 / 4090 (24GB)	24 GB	FP16/BF16	~131k ctx, ≤4k–8k new tokens	Comfortable long-form runs; small batch serving possible.
Datacenter (mid)	A5000 24GB / L40 24GB	24 GB	FP16/BF16	~131k ctx, higher throughput	Good for small teams; add swap/KV offload if needed.
Datacenter (strong)	A100 40GB	40 GB	FP16/BF16	~200k ctx, ≤8k–16k new tokens	Reliable long context + larger batches.
Datacenter (best)	A100 80GB / H100 80GB	80 GB	FP16/BF16	Full 262k ctx, very long outputs	Headroom for 32k+ new tokens and multi-user serving.

Resources

Link: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507

Step-by-Step Process to Install & Run Qwen3-4B-Thinking-2507 Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Qwen3-4B-Thinking-2507, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Qwen3-4B-Thinking-2507
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Qwen3-4B-Thinking-2507.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Qwen3-4B-Thinking-2507 runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv qwen3
source qwen3/bin/activate
python -m pip install --upgrade pip wheel

Step 13: Install CUDA-enabled PyTorch

Run the following command to install CUDA-enabled PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 14: Install Python Dependencies

Run the following command to install the python dependencies:

pip install "transformers>=4.51.0" accelerate sentencepiece

Step 15: Connect to Your GPU VM with a Code Editor

Before you start running Python scripts with the Qwen3-4B-Thinking-2507 models and Transformers, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 16: Create Python Script to Load the Model & Generate a Response

In this step, you will create a Python script that:

Loads Qwen/Qwen3-4B-Thinking-2507 from Hugging Face.
Automatically downloads the model weights on first run (cached locally for future use).
Generates a text response and prints it in the terminal.

Create the script file

Create a new file named app.py:

app.py

Add the following code to app.py

Save it.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen3-4B-Thinking-2507"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role":"user","content":"Give me a short introduction to large language models."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=2048)  # push higher if you have VRAM
gen_ids = out[0][len(inputs.input_ids[0]):].tolist()

# Parse thinking vs final answer
END_THINK_ID = 151668  # </think>
try:
    idx = len(gen_ids) - gen_ids[::-1].index(END_THINK_ID)
except ValueError:
    idx = 0

thinking = tok.decode(gen_ids[:idx], skip_special_tokens=True).strip()
final = tok.decode(gen_ids[idx:], skip_special_tokens=True).strip()

print("\n--- THINKING ---\n", thinking[:2000])
print("\n--- FINAL ---\n", final)

This script will:

Load the tokenizer and model from Hugging Face.
Download model weights only once (cached in ~/.cache/huggingface).
Generate a response to the given prompt.
Show both the model’s hidden reasoning and the final clean answer.

Step 17: Run the Script

Execute the following command to run the script:

python3 app.py

On the first run, Hugging Face will download:

Tokenizer files
Model config
3 shard .safetensors weight files (~8 GB total)
Generation config

On subsequent runs, it will load instantly from cache.

Once loaded, the script will:

Parse and display the model’s thinking (hidden reasoning steps)
Show the final answer cleanly

Example output you should see:

--- THINKING ---
Okay, the user asked for a short introduction to large language models...
(internal reasoning continues...)

--- FINAL ---
Large language models (LLMs) are powerful AI systems trained on massive text data
to understand and generate human-like language. They can answer questions, write
content, translate languages, and more.

Up to this point, we’ve successfully set up our environment, loaded the Qwen3-4B-Thinking-2507 model, and generated responses directly in the terminal — letting us verify that everything is working end-to-end. Now that the model runs locally without issues, it’s time to take the next step: running the model in a way that allows us to interact with it through a browser-based interface. This will give us a more user-friendly experience, complete with a clean chat UI, adjustable parameters, and the ability to send and receive messages without relying solely on the command line.

Step 18: Install Required Libraries for Browser-Based Interaction

Before we can run the Qwen3-4B-Thinking-2507 model in a browser interface, we need to install Streamlit along with the required dependencies for model loading and inference.

Run the following command in your terminal:

pip install streamlit "transformers>=4.51.0" accelerate sentencepiece

Explanation of packages:

streamlit → Builds the browser-based chat UI.
transformers>=4.51.0 → Ensures compatibility with Qwen3 model architecture.
accelerate → Optimizes model loading and GPU/CPU usage.
sentencepiece → Required tokenizer library for Qwen models.

Once the installation finishes, you’ll be ready to create a Streamlit app that connects to the model and lets you chat through your browser.

Step 19: Complete the Streamlit App to Chat with Qwen3 in Your Browser

We’ve already defined the model loading function and the strip_think helper. Now, let’s finish the web.py script so we can interact with the model in a chat-style browser UI.

Add the following code after your current strip_think function:

import os, re, threading
import streamlit as st
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
import torch

MODEL_NAME = "Qwen/Qwen3-4B-Thinking-2507"

@st.cache_resource(show_spinner=True)
def load_model_and_tokenizer():
    tok = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype="auto",
        device_map="auto",
    )
    model.eval()
    # Speed hint (Ampere+ GPUs)
    torch.backends.cuda.matmul.allow_tf32 = True
    return tok, model

def strip_think(text: str) -> str:
    # hide <think>...</think> and stray </think>
    text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL)
    return text.replace("</think>", "").strip()

st.set_page_config(page_title="Qwen3-4B-Thinking-2507 UI", layout="centered")
st.title("Qwen3-4B-Thinking-2507")

with st.sidebar:
    st.markdown("### Generation Settings")
    temperature = st.slider("temperature", 0.0, 1.5, 0.6, 0.05)
    topp = st.slider("top_p", 0.1, 1.0, 0.95, 0.01)
    max_new = st.slider("max_new_tokens", 32, 8192, 1024, 32)
    show_thoughts = st.checkbox("Show reasoning (raw <think>)", value=False,
                                help="Hidden by default; can be very long.")

tok, model = load_model_and_tokenizer()

if "history" not in st.session_state:
    st.session_state.history = []

# Chat history UI
for role, content in st.session_state.history:
    with st.chat_message(role):
        st.markdown(content)

user_msg = st.chat_input("Ask me anything...")
if user_msg:
    st.session_state.history.append(("user", user_msg))
    with st.chat_message("user"):
        st.markdown(user_msg)

    with st.chat_message("assistant"):
        placeholder = st.empty()
        # Build chat template (no thinking history per best practices)
        messages = [{"role": "user", "content": user_msg}]
        text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tok([text], return_tensors="pt").to(model.device)

        streamer = TextIteratorStreamer(tok, skip_special_tokens=True)
        gen_kwargs = dict(
            **inputs,
            max_new_tokens=max_new,
            temperature=temperature,
            top_p=topp,
            streamer=streamer,
        )

        thread = threading.Thread(target=model.generate, kwargs=gen_kwargs)
        thread.start()

        collected = ""
        for token in streamer:
            collected += token
            show = collected if show_thoughts else strip_think(collected)
            # Live update
            placeholder.markdown(show)

        final_text = collected if show_thoughts else strip_think(collected)
        st.session_state.history.append(("assistant", final_text))

Step 20: Run the Streamlit App in Your Browser

With the web.py file ready, you can now launch the Streamlit interface to interact with the Qwen3-4B-Thinking-2507 model directly from your browser.

Start the app

In your terminal, run:

streamlit run web.py

Check the URLs

Once launched, Streamlit will display three URLs:

Local URL → http://localhost:8501 (works only inside the VM)
Network URL → http://172.17.0.5:8501 (internal network)
External URL → http://<your_VM_public_IP>:8501 (use this to access from your own browser)

Visit localhost:8501 on browser to check your app.

Step 21: Test the App in Your Browser

With the model running in Streamlit and port 8501 open, you can now test the app directly from your browser.
Type your question or task in the chat input box and press Enter.
Adjust temperature, top_p, and max_new_tokens in the left sidebar to control creativity and output length.
Use the Show reasoning (raw <think>) checkbox to toggle the display of the model’s hidden reasoning process.

You should now see both the model’s reasoning (if enabled) and the final clean answer — just like in the example screenshots.

Conclusion

Qwen3-4B-Thinking-2507 combines powerful multi-step reasoning with an accessible hardware footprint, making it an excellent choice for anyone looking to explore advanced model thinking capabilities without the heavy resource demands of larger LLMs. In this guide, we walked through setting up a GPU-powered environment on NodeShift, installing dependencies, running the model in the terminal, and then building a browser-based interface with Streamlit for a more interactive experience. Whether you’re tackling complex logic problems, analyzing large documents, or experimenting with tool-augmented workflows, this model offers both performance and flexibility — all while keeping the reasoning process transparent.

Relevant blog posts

August 6, 2025

How to Install & Run OpenAI GPT-OSS Locally?

There’s a new duo in the world of open-source models, and they’re here to make life a whole lot easier for developers, builders, and tinkerers everywhere. Whether you need raw horsepower for serious projects or something nimble for local experimentation, the gpt-oss lineup has you covered. On one side, you’ve got the gpt-oss-120b—a heavyweight, purpose-built for tasks where deep reasoning, clear thinking, and wide-ranging skills really matter. It’s ready for the big leagues, built to handle complex requests without breaking a sweat. Perfect if you want the confidence that comes from working with something built for scale and reliability. On the other side is gpt-oss-20b, the lighter and more agile sibling. It’s all about speed and versatility, ideal for those moments when you want answers fast, want to run things on your own machine, or just need a model that’s easy to fine-tune and shape to your unique needs.

August 4, 2025

How to Install & Run FLUX.1-Krea-dev Locally?

FLUX.1 Krea [dev] is a powerful image generator built to turn any text description into high-quality, visually striking pictures. With a focus on creating beautiful, photography-inspired images and following your prompt details closely, it’s designed for artists, developers, and anyone who wants to explore creative visual workflows. The model’s open weights support fresh research and new ideas, and you’re free to use it for personal, scientific, or non-commercial projects.

August 2, 2025

How to Install & Run GLM-4.5 Locally?

GLM-4.5 and GLM-4.5-Air are large-scale, cutting-edge language models designed to power a new generation of intelligent digital assistants, tools, and workflows. Built for both depth and efficiency, these models offer top-tier results across tasks like coding, problem solving, and natural conversation—making them perfect for teams building smart apps or anyone who wants advanced reasoning and real-world utility.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.