How to Install & Run GPT-OSS 20b and 120b GGUF Locally?

by Ayush Kumar | August 11, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

GPT-OSS is a two-model, open-weight lineup built for real work: 120B for high-reasoning, production use that fits on a single H100, and 20B for fast local runs, fine-tuning, and lower-latency apps. Both ship under Apache-2.0, support function calling/structured outputs, and use the Harmony chat format for consistent responses. Run them your way—Transformers/vLLM in the cloud or GGUF via llama.cpp/Ollama—with Unsloth’s quants for speed or F16 for maximum fidelity (120B uses MXFP4 MoE; 20B can run in ~16 GB). This guide covers the clean path to set up and deploy both.

Recommended GPU Configuration

Model	Minimum GPU Needed	VRAM Needed	GPU Count	Typical Hardware Example	Runs on Consumer GPU?	Notes
gpt-oss-20b	1x High-end GPU	16 GB+	1	NVIDIA RTX 4090, A6000, H100	Yes	Runs comfortably on modern consumer GPUs. Easy for local use.
gpt-oss-120b	1x Server-grade GPU	80 GB+	1	NVIDIA H100 (80 GB), A100 (80 GB)	No (Server Only)	Needs powerful server hardware, usually cloud or on-prem GPU server.

Resources

Link 1: https://huggingface.co/unsloth/gpt-oss-20b-GGUF

Link 2: https://huggingface.co/unsloth/gpt-oss-120b-GGUF

Step-by-Step Process to Install & Run Unsloth GPT-OSS 20b and 120b GGUF Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Unsloth GPT-OSS 20b and 120b GGUF, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Unsloth GPT-OSS 20b and 120b GGUF
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Unsloth GPT-OSS 20b and 120b GGUF.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Unsloth GPT-OSS 20b and 120b GGUF runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Build llama.cpp (CUDA on)

Run the following command to build llama.cpp:

apt-get update
apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev git
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/

Step 13: Install `huggingface_hub` and Download the 20b Model

Run the following commands to install huggingface_hub and download the models:

pip install --upgrade huggingface_hub

python3 - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="unsloth/gpt-oss-20b-GGUF",
    local_dir="unsloth/gpt-oss-20b-GGUF",
    allow_patterns=["*Q4_K_M.gguf"],
)
PY

ls -lh unsloth/gpt-oss-20b-GGUF/

Step 14: Run the Model

Execute the following command to run the model:

./llama.cpp/llama-cli \
  --model unsloth/gpt-oss-20b-GGUF/gpt-oss-20b-Q4_K_M.gguf \
  --threads -1 \
  --ctx-size 8192 \
  --n-gpu-layers 99

Step 15: Install `huggingface_hub` and Download the 120b Model

Run the following commands to install huggingface_hub and download the models:

pip install -U huggingface_hub
python3 - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
  "unsloth/gpt-oss-120b-GGUF",
  local_dir="unsloth/gpt-oss-120b-GGUF",
  allow_patterns=["*F16.gguf"],
)
PY

Step 16: Run the Model

Execute the following command to run the model:

./llama.cpp/llama-cli \
  --model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
  --threads -1 \
  --ctx-size 16384 \
  --n-gpu-layers 99 \
  -ot ".ffn_.*_exps.=CPU" \
  --temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 0.0

Conclusion

You’ve got both gpt-oss-20B and gpt-oss-120B running cleanly in a CUDA-ready environment: spun up a GPU VM, built llama.cpp with CUDA + curl, pulled the GGUFs, and launched inference (20B with Q4_K_M for speed; 120B F16 with MoE experts offloaded to CPU for fit and throughput). From here, it’s just choices:

Speed vs. fidelity: stay on Q-series quants for snappy tokens, switch to F16 when you need maximum quality.
Context & layers: raise --ctx-size for long docs; nudge --n-gpu-layers up or down based on VRAM; keep the -ot ".ffn_.*_exps.=CPU" trick for 120B stability.
Serve it: use llama-server for an OpenAI-compatible endpoint, or jump to Transformers/vLLM if you want a managed API with batching.
Prompts: stick to the Harmony chat pattern for consistent structure and tool use.

If something misbehaves: check nvidia-smi, lower --n-gpu-layers, confirm you’re on the latest llama.cpp, and verify disk space for the GGUFs.

That’s it—production-grade 120B when you need brains, lean 20B when you need speed. If this helped, share it with a teammate, and ping me if you want a one-click script that sets up the VM, builds llama.cpp, downloads the right GGUF, and starts a server automatically.

Relevant blog posts

October 11, 2025

How to Install & Run Qwen3-VL-30B-A3B-Thinking Locally?

Qwen3-VL-30B-A3B-Thinking is one of the most advanced multimodal reasoning models in the Qwen3 series, designed to seamlessly fuse text, vision, and video understanding with large-scale reasoning. Built on a Mixture-of-Experts (MoE) architecture with 30B active parameters, the model introduces a specialized Thinking variant, tuned for deep multimodal reasoning across STEM, math, and complex real-world scenarios. Key Strengths Include Visual Agent Capabilities – Can perceive GUI elements, invoke tools, and complete tasks on PC/mobile interfaces. Visual Coding Boost – Converts diagrams, screenshots, and videos into structured code artifacts (e.g., HTML, CSS, JavaScript, Draw.io). Advanced Spatial & Video Perception – Supports 3D grounding, object occlusion reasoning, timestamp alignment, and long-horizon video comprehension. Massive Context Handling – Native 256K tokens, expandable up to 1M, enabling book-level comprehension or hours-long video indexing. Robust OCR & Recognition – Trained on broad visual corpora, supports 32 languages, rare/ancient scripts, and noisy/tilted text scenarios. Unified Text-Vision Understanding – Matches pure LLMs in text reasoning while tightly aligning vision inputs for lossless multimodal comprehension. Overall, Qwen3-VL-30B-A3B-Thinking is positioned as a research-grade, enterprise-ready model that excels at multimodal STEM reasoning, vide

October 10, 2025

How to Install & Run Microsoft UserLM-8B Locally?

UserLM-8b is Microsoft’s open-weight large language model uniquely designed to simulate the “user” role in conversations. Unlike most LLMs that play the assistant role, UserLM-8b was fine-tuned on the WildChat-1M dataset to generate realistic user utterances. This makes it particularly useful for evaluating assistant LLMs, synthetic data generation, and research on user behavior modeling. Built on top of Llama-3.1-8B-Base, the model was fully fine-tuned with 227 hours of training on NVIDIA RTX A6000 GPUs. UserLM-8b can: Generate first-turn user queries given a task intent. Simulate multi-turn follow-up responses across long conversations. Signal the natural end of a conversation with a special token. Its evaluations show that UserLM-8b achieves lower perplexity, stronger distributional alignment, and more realistic conversational diversity compared to assistant-based simulators. While not designed as an assistant model, UserLM-8b helps researchers stress-test assistants under a wide range of conversational conditions, making it a valuable tool for robustness and evaluation studies.

October 9, 2025

How to Install & Run Servicenow Apriel-1.5-15b-Thinker Locally?

Apriel-1.5-15B-Thinker is ServiceNow’s open-weights multimodal reasoning model (image-text-to-text) built with an emphasis on mid-training/continual pre-training and high-quality text SFT—no RL. Despite its compact 15B size, it posts strong results (e.g., 52 on the Artificial Analysis Intelligence Index) and is designed to fit on a single GPU. It ships with an OpenAI-compatible vLLM recipe (custom parser for tools + reasoning) and an MIT license, making it practical for on-prem and research workflows.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.