How to Install & Run GPT-OSS-Safeguard 20B and 120B Locally?

by Ayush Kumar | November 4, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

gpt-oss-safeguard is a pair of open-weight, safety-reasoning models built on the gpt-oss family and trained to interpret your own policy text, explain decisions with auditable reasoning, and let you dial up/down the reasoning effort (low/medium/high). The 20B variant targets 16 GB-class GPUs for low-latency filters and offline labeling, while the 120B variant is tuned for highest quality yet still runs on a single 80 GB H100 thanks to MoE + native MXFP4 quantization. Both follow the harmony response format (use it, or outputs will degrade) and ship under Apache-2.0 for flexible commercial use.

GPU Configuration Table

Tier / Use case	Model & quant	Min VRAM (approx.)	Suggested single-GPU options	Notes & tips
Entry – lightweight policy trials, fast I/O filters	20B • MXFP4/8-bit	16 GB	RTX 4080 16G / A5000 24G / L4 24G	20B is designed to fit 16 GB; keep sequence lengths modest for maximal throughput. Harmony format required. (Hugging Face)
Standard – batch offline labeling, low-latency moderation	20B • MXFP4	20–24 GB	RTX 4090 24G / A5000 24G / L4 24G	Headroom helps for longer contexts & larger batches; use vLLM `--gpu-memory-utilization 0.9`. (Hugging Face)
Pro – high-quality safety reasoning, single-GPU	120B • MXFP4	80 GB	H100 80G (SXM/PCIe)	Official guidance: 120B fits on a single 80 GB H100 via MXFP4 + MoE routing. Use harmony template and set reasoning_effort. (Hugging Face)
Max – bigger batches / longer ctx on 120B	120B • MXFP4	96–120 GB effective	H200 141G (with tensor parallel = 1), or 2× A100 80G/H100 80G with TP=2	For more headroom, either step up to larger VRAM or shard across 2 GPUs (TP=2). Keep page-size defaults; pin memory on. (Sharding is an inference-stack capability, not a model requirement.) (Hugging Face)
Mac / CPU prototyping (debug, not prod)	20B • 4-bit MLX/GGUF	16–24 GB unified	Apple M2/M3 Max (64–96 GB RAM)	Community MLX/GGUF builds exist for 20B; useful for pipelines/tests, not high-throughput labeling. (Hugging Face)

Quick Facts You’ll Likely Need

Params / active experts: 20B (~21B params, ~3.6B active); 120B (~117B params, ~5.1B active).
License: Apache-2.0 (commercial-friendly).
Format: Use the provided harmony chat template; responses are tuned for it.
Where to get them: Hugging Face model cards (OpenAI org), plus OpenAI’s intro post and prompt guide.
Community quantizations: MLX/GGUF 4-bit variants available for the 20B and 120B (experimental).

Resources

Link 1: https://huggingface.co/openai/gpt-oss-safeguard-20b

Link 2: https://huggingface.co/openai/gpt-oss-safeguard-120b

Link 3: https://ollama.com/library/gpt-oss-safeguard

Step-by-Step Process to Install & Run OpenAI GPT-OSS-Safeguard 20B and 120B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running GPT-OSS-Safeguard 20B and 120B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like GPT-OSS-Safeguard 20B and 120B
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like GPT-OSS-Safeguard 20B and 120B.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the GPT-OSS-Safeguard 20B and 120B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Ollama

After connecting to the terminal via SSH, it’s now time to install Ollama from the official Ollama website.

Website Link: https://ollama.com/

Run the following command to install the Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step 9: Serve Ollama

Run the following command to host the Ollama so that it can be accessed and utilized efficiently:

ollama serve

Step 10: Explore Ollama CLI Commands

After starting the Ollama server, you can explore all available commands and get help right from the terminal.

To see the list of all commands that Ollama supports, run:

ollama

You’ll see an output like this:

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve     Start ollama
  create    Create a model
  show      Show information for a model
  run       Run a model
  stop      Stop a running model
  pull      Pull a model from a registry
  push      Push a model to a registry
  list      List models
  ps        List running models
  cp        Copy a model
  rm        Remove a model
  help      Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.

This command helps you quickly understand what you can do with Ollama—such as running, pulling, stopping models, and more.

Step 11: Pull Both GPT-OSS-Safeguard 20B and 120B Models

GPT-OSS-Safeguard 20B and 120B comes in two main versions—20B and 120B.
You’ll need to pull each model separately using Ollama’s CLI.
Let’s do it one by one:

Pull the 20B Version

Run this command to pull the 20B model:

ollama pull gpt-oss-safeguard:20b

You’ll see progress bars as the model and its components download.
When finished, you should see success.

Pull the 120B Version

Now, pull the larger 120B model:

ollama pull gpt-oss-safeguard:120b

Again, wait for the download and extraction to finish until you see success.

Step 12: Verify Downloaded Models

After pulling the GPT-OSS-Safeguard 20B and 120B models, you can check that they’ve been successfully downloaded and are available on your system.

Just run:

ollama list

You should see output like this:

NAME                      ID              SIZE     MODIFIED       
gpt-oss-safeguard:120b    45be44f7918a    65 GB    22 minutes ago    
gpt-oss-safeguard:20b     f2e795d0099c    13 GB    25 minutes ago

This confirms both the 20B and 120B GPT-OSS models are now installed and ready to use.

Step 13: Run the GPT-OSS-Safeguard 20B and 120B Model for Inference

Now that your models are installed, you can start running them and interacting directly from the terminal.

To run the 20B version of GPT-OSS-Safeguard, use:

ollama run gpt-oss-safeguard:20b

You’ll be prompted to enter your message or prompt. For example, you can try:

Label this text as safe or unsafe per my harassment policy: "You’re useless and stupid."

The model will process your prompt, display “Thinking…”, and then generate a detailed response.

Try Different Prompts

Step 14: Run the 120B GPT-OSS-Safeguard Model

After testing the 20B model, let’s now run the larger, more powerful 120B version.

To start an interactive session with the 120B model, run:

ollama run gpt-oss-safeguard:120b

You’ll see the prompt:

>>>

Type your question, prompt, or creative request—just like with the 20B model. For example:

Summarize why this text breaks the anti-hate rule in one sentence.

The model will process your request and generate a detailed, creative answer.

Now you’ve successfully run and interacted with the GPT-OSS-Safeguard 20B and 120B models directly in your terminal using Ollama! This command-line approach is fast and powerful for quick experiments or automation. However, sometimes you want a more visually appealing and user-friendly interface for chatting with models, exploring outputs, or showcasing demos. For those moments, it’s great to use an interface like Open WebUI, which makes running prompts and interacting with models both simple and enjoyable. In the next steps, we’ll see how to run the same models with Open WebUI and experience an upgraded, interactive chat environment.

Step 15: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 16: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 17: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 18: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 19: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv openwebui
source openwebui/bin/activate

Step 20: Install Open-WebUI

Run the following command to install open-webui:

pip install open-webui

Step 21: Serve Open-WebUI

In your activated Python environment, start the Open-WebUI server by running:

open-webui serve

Wait for the server to complete all database migrations and set up initial files. You’ll see a series of INFO logs and a large “OPEN WEBUI” banner in the terminal.
When setup is complete, the WebUI will be available and ready for you to access via your browser.

Step 22: Set up SSH port forwarding from your local machine

On your local machine (Mac/Windows/Linux), open a terminal and run:

ssh -L 8080:localhost:8080 -p 18685 root@Your_VM_IP

This forwards:

Local localhost:8000 → Remote VM 127.0.0.1:8000

Step 23: Access Open-WebUI in Your Browser

Go to:

http://localhost:8080

You should see the Open-WebUI login or setup page.
Log in or create a new account if this is your first time.
You’re now ready to use Open-WebUI to interact with your models!

Step 24: Select and Use Your Model in Open-WebUI

Once you’ve logged into Open-WebUI in your browser, you can easily choose between any models you have installed on your system.

Click on the model selection dropdown at the top left (where you see the model name, e.g., gpt-oss-safeguard:120b).
You’ll see a list of all available models, such as:
- gpt-oss-safeguard:120b
- gpt-oss-safeguard:20b
- (and any other models you’ve installed)
Simply click on the model you want to use (for example, gpt-oss-safeguard:120b for the largest, most powerful model).
Once selected, you can start chatting or sending prompts to that model in the Open-WebUI chat window below.

Step 25: Start Chatting with Your Model in Open-WebUI

With your model selected in Open-WebUI, you can now start sending prompts and receive rich, detailed responses—just like chatting with a modern AI assistant.

Type your question or prompt in the chat input box at the bottom of the screen.
Press Enter to send your message.
The model will process your request and respond in the chat window, showing its full reasoning and answer.

As shown in the screenshot, you can ask advanced questions, get structured explanations, and even see responses formatted with tables and bullet points.

Step 26: Explore Advanced Reasoning and Creativity with Large Models

With the gpt-oss-safeguard:120b model loaded in Open-WebUI, you can take full advantage of its advanced reasoning, problem-solving, and creativity. Try giving the model complex, multi-step challenges—such as designing unique puzzles, solving technical problems, or explaining advanced topics in depth.

Ask open-ended or multi-part questions to see the model’s full reasoning process.
The model can generate diagrams, ASCII art, tables, and well-structured explanations, as shown in the screenshot.
You can save, copy, or collapse responses for easy reference.

Conclusion

With GPT-OSS-Safeguard 20B and 120B, you now have fully open, safety-reasoning models that let you enforce your own policies, audit model decisions, and control reasoning depth with unmatched flexibility. Whether you’re building low-latency moderation pipelines or large-scale labeling systems, these models deliver reliable safety insights with transparent logic.

Deployed on NodeShift GPU VMs, they combine enterprise-grade infrastructure with affordable, high-performance compute, making open-source safety not just possible—but practical and scalable for any team or product.

Relevant blog posts

November 5, 2025

How to Install & Run SoulX-Podcast-1.7B Locally?

SoulX-Podcast-1.7B is a podcast-style TTS model built for long, multi-turn, multi-speaker dialogs. It supports English, Mandarin, and several Chinese dialects (e.g., Sichuanese, Henanese, Cantonese), does zero-shot voice cloning from short reference clips, and exposes paralinguistic controls (like laughter/sighs) to make conversations feel natural over long durations. It’s aimed at generating full podcast episodes—complete with speaker changes, dialectal variation, and expressive delivery—while still running comfortably on a single modern GPU.

November 3, 2025

How to Install & Run AMD Nitro-E Locally?

Nitro-E is AMD’s ultra-light text-to-image diffusion family built on E-MMDiT (~304M params). It’s designed for fast, low-cost training/inference: the base 512px model gives strong quality in ~20 steps, while the distilled 512px variant can generate usable images in as few as 4 steps. There’s also a GRPO-tuned checkpoint for post-training quality/behavior tweaks. Code is plain PyTorch/Diffusers, so it runs on both NVIDIA (CUDA) and AMD (ROCm).

November 1, 2025

How to Install & Run JanusCoderV-8B Locally?

JanusCoderV-8B is an 8B multimodal code-intelligence model from InternLM’s JanusCoder suite, built on InternVL-3.5-8B. Trained on JANUSCODE-800K, it unifies visual + programmatic inputs to generate and edit code for charts, interactive web UIs, and animation logic. It supports image-conditioned code generation, visual-grounded edits, and long outputs (demo shows max_new_tokens up to 32K) using standard Transformers (≥ 4.55.0) with AutoProcessor + AutoModelForCausalLM and remote code enabled.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.