GPT-OSS is a two-model, open-weight lineup built for real work: 120B for high-reasoning, production use that fits on a single H100, and 20B for fast local runs, fine-tuning, and lower-latency apps. Both ship under Apache-2.0, support function calling/structured outputs, and use the Harmony chat format for consistent responses. Run them your way—Transformers/vLLM in the cloud or GGUF via llama.cpp/Ollama—with Unsloth’s quants for speed or F16 for maximum fidelity (120B uses MXFP4 MoE; 20B can run in ~16 GB). This guide covers the clean path to set up and deploy both.
Recommended GPU Configuration
Model | Minimum GPU Needed | VRAM Needed | GPU Count | Typical Hardware Example | Runs on Consumer GPU? | Notes |
---|
gpt-oss-20b | 1x High-end GPU | 16 GB+ | 1 | NVIDIA RTX 4090, A6000, H100 | Yes | Runs comfortably on modern consumer GPUs. Easy for local use. |
gpt-oss-120b | 1x Server-grade GPU | 80 GB+ | 1 | NVIDIA H100 (80 GB), A100 (80 GB) | No (Server Only) | Needs powerful server hardware, usually cloud or on-prem GPU server. |
Resources
Link 1: https://huggingface.co/unsloth/gpt-oss-20b-GGUF
Link 2: https://huggingface.co/unsloth/gpt-oss-120b-GGUF
Step-by-Step Process to Install & Run Unsloth GPT-OSS 20b and 120b GGUF Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Unsloth GPT-OSS 20b and 120b GGUF, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like Unsloth GPT-OSS 20b and 120b GGUF
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Unsloth GPT-OSS 20b and 120b GGUF.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Unsloth GPT-OSS 20b and 120b GGUF runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py
Then, run the following command to check the version of pip:
pip --version
Step 12: Build llama.cpp (CUDA on)
Run the following command to build llama.cpp:
apt-get update
apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev git
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/
Step 13: Install huggingface_hub
and Download the 20b Model
Run the following commands to install huggingface_hub
and download the models:
pip install --upgrade huggingface_hub
python3 - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="unsloth/gpt-oss-20b-GGUF",
local_dir="unsloth/gpt-oss-20b-GGUF",
allow_patterns=["*Q4_K_M.gguf"],
)
PY
ls -lh unsloth/gpt-oss-20b-GGUF/
Step 14: Run the Model
Execute the following command to run the model:
./llama.cpp/llama-cli \
--model unsloth/gpt-oss-20b-GGUF/gpt-oss-20b-Q4_K_M.gguf \
--threads -1 \
--ctx-size 8192 \
--n-gpu-layers 99
Step 15: Install huggingface_hub
and Download the 120b Model
Run the following commands to install huggingface_hub
and download the models:
pip install -U huggingface_hub
python3 - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
"unsloth/gpt-oss-120b-GGUF",
local_dir="unsloth/gpt-oss-120b-GGUF",
allow_patterns=["*F16.gguf"],
)
PY
Step 16: Run the Model
Execute the following command to run the model:
./llama.cpp/llama-cli \
--model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
--threads -1 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 0.0
Conclusion
You’ve got both gpt-oss-20B and gpt-oss-120B running cleanly in a CUDA-ready environment: spun up a GPU VM, built llama.cpp
with CUDA + curl, pulled the GGUFs, and launched inference (20B with Q4_K_M for speed; 120B F16 with MoE experts offloaded to CPU for fit and throughput). From here, it’s just choices:
- Speed vs. fidelity: stay on Q-series quants for snappy tokens, switch to F16 when you need maximum quality.
- Context & layers: raise
--ctx-size
for long docs; nudge --n-gpu-layers
up or down based on VRAM; keep the -ot ".ffn_.*_exps.=CPU"
trick for 120B stability.
- Serve it: use
llama-server
for an OpenAI-compatible endpoint, or jump to Transformers/vLLM if you want a managed API with batching.
- Prompts: stick to the Harmony chat pattern for consistent structure and tool use.
If something misbehaves: check nvidia-smi
, lower --n-gpu-layers
, confirm you’re on the latest llama.cpp
, and verify disk space for the GGUFs.
That’s it—production-grade 120B when you need brains, lean 20B when you need speed. If this helped, share it with a teammate, and ping me if you want a one-click script that sets up the VM, builds llama.cpp
, downloads the right GGUF, and starts a server automatically.