How to Install & Run OpenAI GPT-OSS Locally?

by Ayush Kumar | August 6, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

There’s a new duo in the world of open-source models, and they’re here to make life a whole lot easier for developers, builders, and tinkerers everywhere. Whether you need raw horsepower for serious projects or something nimble for local experimentation, the gpt-oss lineup has you covered.

On one side, you’ve got the gpt-oss-120b—a heavyweight, purpose-built for tasks where deep reasoning, clear thinking, and wide-ranging skills really matter. It’s ready for the big leagues, built to handle complex requests without breaking a sweat. Perfect if you want the confidence that comes from working with something built for scale and reliability.

On the other side is gpt-oss-20b, the lighter and more agile sibling. It’s all about speed and versatility, ideal for those moments when you want answers fast, want to run things on your own machine, or just need a model that’s easy to fine-tune and shape to your unique needs.

Model Benchmark Scores

	gpt-oss-120b	gpt-oss-20b	OpenAI o3	OpenAI o4-mini
Reasoning & Knowledge
MMLU	90.0	85.3	93.4	93.0
GPQA Diamond	80.1	71.5	83.3	81.4
Humanity’s Last Exam	19.0	17.3	24.9	17.7
Competition Math
AIME 2024	96.6	96.0	95.2	98.7
AIME 2025	97.9	98.7	98.4	99.5

Recommended GPU Configuration

Model	Minimum GPU Needed	VRAM Needed	GPU Count	Typical Hardware Example	Runs on Consumer GPU?	Notes
gpt-oss-20b	1x High-end GPU	16 GB+	1	NVIDIA RTX 4090, A6000, H100	Yes	Runs comfortably on modern consumer GPUs. Easy for local use.
gpt-oss-120b	1x Server-grade GPU	80 GB+	1	NVIDIA H100 (80 GB), A100 (80 GB)	No (Server Only)	Needs powerful server hardware, usually cloud or on-prem GPU server.

Resources

Link 1: https://huggingface.co/openai/gpt-oss-20b

Link 2: https://huggingface.co/openai/gpt-oss-120b

Link 3: https://github.com/openai/gpt-oss

Link 4: https://ollama.com/library/gpt-oss

Step-by-Step Process to Install & Run OpenAI GPT-OSS Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 2 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running OpenAI GPT-OSS, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like OpenAI GPT-OSS
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like OpenAI GPT-OSS.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the OpenAI GPT-OSS runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Ollama

After connecting to the terminal via SSH, it’s now time to install Ollama from the official Ollama website.

Website Link: https://ollama.com/

Run the following command to install the Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step 9: Serve Ollama

Run the following command to host the Ollama so that it can be accessed and utilized efficiently:

ollama serve

Step 10: Explore Ollama CLI Commands

After starting the Ollama server, you can explore all available commands and get help right from the terminal.

To see the list of all commands that Ollama supports, run:

ollama

You’ll see an output like this:

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve     Start ollama
  create    Create a model
  show      Show information for a model
  run       Run a model
  stop      Stop a running model
  pull      Pull a model from a registry
  push      Push a model to a registry
  list      List models
  ps        List running models
  cp        Copy a model
  rm        Remove a model
  help      Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.

This command helps you quickly understand what you can do with Ollama—such as running, pulling, stopping models, and more.

Step 11: Pull Both GPT-OSS Models

GPT-OSS comes in two main versions—20B and 120B.
You’ll need to pull each model separately using Ollama’s CLI.
Let’s do it one by one:

Pull the 20B Version

Run this command to pull the 20B model:

ollama pull gpt-oss:20b

You’ll see progress bars as the model and its components download.
When finished, you should see success.

Pull the 120B Version

Now, pull the larger 120B model:

ollama pull gpt-oss:120b

Again, wait for the download and extraction to finish until you see success.

Step 12: Verify Downloaded Models

After pulling the GPT-OSS models, you can check that they’ve been successfully downloaded and are available on your system.

Just run:

ollama list

You should see output like this:

NAME           ID              SIZE   MODIFIED
gpt-oss:120b   735371f916a9    65 GB  43 seconds ago
gpt-oss:20b    f2b8351c629c    13 GB  4 minutes ago

This confirms both the 20B and 120B GPT-OSS models are now installed and ready to use.

Step 13: Run the GPT-OSS Model for Inference

Now that your models are installed, you can start running them and interacting directly from the terminal.

To run the 20B version of GPT-OSS, use:

ollama run gpt-oss:20b

You’ll be prompted to enter your message or prompt. For example, you can try:

Imagine you’re a detective solving a mystery in a city where gravity randomly changes direction every hour. Walk through your entire reasoning as you solve a crime—don’t skip a step or assumption.

The model will process your prompt, display “Thinking…”, and then generate a detailed response.

Try Different Prompts

Step 14: Run the 120B GPT-OSS Model

After testing the 20B model, let’s now run the larger, more powerful 120B version.

To start an interactive session with the 120B model, run:

ollama run gpt-oss:120b

You’ll see the prompt:

>>>

Type your question, prompt, or creative request—just like with the 20B model. For example:

You have access to a web browser, Python, and function-calling. Your task: find the current weather in Reykjavik, predict tomorrow’s using code, and format it as a poetic haiku.

The model will process your request and generate a detailed, creative answer.

Try Different Prompts

Now you’ve successfully run and interacted with the GPT-OSS models directly in your terminal using Ollama! This command-line approach is fast and powerful for quick experiments or automation. However, sometimes you want a more visually appealing and user-friendly interface for chatting with models, exploring outputs, or showcasing demos. For those moments, it’s great to use an interface like Open WebUI, which makes running prompts and interacting with models both simple and enjoyable. In the next steps, we’ll see how to run the same models with Open WebUI and experience an upgraded, interactive chat environment.

Step 15: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 16: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 17: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 18: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 19: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv openwebui
source openwebui/bin/activate

Step 20: Install Open-WebUI

Run the following command to install open-webui:

pip install open-webui

Step 21: Serve Open-WebUI

In your activated Python environment, start the Open-WebUI server by running:

open-webui serve

Wait for the server to complete all database migrations and set up initial files. You’ll see a series of INFO logs and a large “OPEN WEBUI” banner in the terminal.
When setup is complete, the WebUI will be available and ready for you to access via your browser.

Step 22: Set up SSH port forwarding from your local machine

On your local machine (Mac/Windows/Linux), open a terminal and run:

ssh -L 8080:localhost:8080 -p 18685 root@Your_VM_IP

This forwards:

Local localhost:8000 → Remote VM 127.0.0.1:8000

Step 23: Access Open-WebUI in Your Browser

Go to:

http://localhost:8080

You should see the Open-WebUI login or setup page.
Log in or create a new account if this is your first time.
You’re now ready to use Open-WebUI to interact with your models!

Step 24: Select and Use Your Model in Open-WebUI

Once you’ve logged into Open-WebUI in your browser, you can easily choose between any models you have installed on your system.

Click on the model selection dropdown at the top left (where you see the model name, e.g., gpt-oss:120b).
You’ll see a list of all available models, such as:
- gpt-oss:120b
- gpt-oss:20b
- (and any other models you’ve installed)
Simply click on the model you want to use (for example, gpt-oss:120b for the largest, most powerful model).
Once selected, you can start chatting or sending prompts to that model in the Open-WebUI chat window below.

Step 25: Start Chatting with Your Model in Open-WebUI

With your model selected in Open-WebUI, you can now start sending prompts and receive rich, detailed responses—just like chatting with a modern AI assistant.

Type your question or prompt in the chat input box at the bottom of the screen.
Press Enter to send your message.
The model will process your request and respond in the chat window, showing its full reasoning and answer.

As shown in the screenshot, you can ask advanced questions, get structured explanations, and even see responses formatted with tables and bullet points.

Step 26: Explore Advanced Reasoning and Creativity with Large Models

With the gpt-oss:120b model loaded in Open-WebUI, you can take full advantage of its advanced reasoning, problem-solving, and creativity. Try giving the model complex, multi-step challenges—such as designing unique puzzles, solving technical problems, or explaining advanced topics in depth.

Ask open-ended or multi-part questions to see the model’s full reasoning process.
The model can generate diagrams, ASCII art, tables, and well-structured explanations, as shown in the screenshot.
You can save, copy, or collapse responses for easy reference.

Up to this point, you’ve learned how to interact with GPT-OSS models both from the terminal using Ollama and from a user-friendly interface with Open-WebUI. This gives you the best of both worlds: the speed and flexibility of the command line, and the visual, intuitive experience of a web-based chat interface. But that’s not all! You can also integrate these models directly into your Python code using the Transformers library. With Transformers, you can run gpt-oss-120b and gpt-oss-20b programmatically for everything from chatbots to automated pipelines. If you use the Transformers chat template, harmony response formatting is automatically handled for you. If you call model.generate directly, you’ll just need to apply the harmony format manually—either with the chat template or with the official openai-harmony package. This opens up a world of advanced integrations, so let’s explore how to use GPT-OSS models with Transformers next!

Step 27: Connect to Your GPU VM with a Code Editor

Before you start running Python scripts with the GPT-OSS models and Transformers, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 28: Set Up a Python Virtual Environment

Before you install any dependencies or write code, it’s best practice to create a Python virtual environment. This keeps your project’s packages isolated and prevents conflicts with system-wide Python libraries.

Run the following commands on your GPU VM:

python3 -m venv ~/gptoss-venv
source ~/gptoss-venv/bin/activate

Step 29: Install Python Dependencies

Run the following command to install python dependencies:

pip install -U transformers kernels torch
pip install accelerate

Step 30: Run GPT-OSS Models with Transformers in Python

Now you’re ready to interact with GPT-OSS directly in your own Python scripts using the Transformers library.

Here’s an example script (run_gptoss_transformers.py) you can use:

from transformers import pipeline
import torch

model_id = "openai/gpt-oss-120b"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"][-1])

This script loads the GPT-OSS 120B model and sends a prompt for completion.
You can edit the messages variable to ask anything you want!

Tip

Make sure you’re in your virtual environment and have installed all dependencies (transformers, torch).
You can swap "openai/gpt-oss-120b" with "openai/gpt-oss-20b" to use the smaller model.

Step 31: Run the Python Script and Generate Model Output

Now it’s time to run your script and see the GPT-OSS model in action!
Simply execute your Python file in the terminal:

python3 run_gptoss_transformers.py

You’ll see the model’s output directly in your terminal, with a detailed, well-formatted response generated by GPT-OSS.

You can try any prompt you want—just change the text in the content field of your Python code and run the script again!

Up to this point, you’ve seen how easy it is to run the GPT-OSS models programmatically using Python scripts and the Transformers library. This lets you automate tasks, process data, and build custom workflows entirely in code. But what if you want to serve your model as an API, making it accessible to any app or client just like the OpenAI API? That’s where Transformers Serve comes in. With just a few commands, you can spin up an OpenAI-compatible webserver around your model—enabling easy integration with tools, chatbots, or anything that speaks the OpenAI API format!

Step 32: Install Pillow

Run the following command to install pillow:

pip install pillow

Step 33: Install Transformers[Serving]

Run the following command to install “transformers[serving]”:

pip install "transformers[serving]"

Step 34: Install Rich

Run the following command to install rich:

pip install rich

Step 35: Install Aiohttp

Run the following command to install aiohttp:

pip install aiohttp

Step 36: Launch an OpenAI-Compatible API Server with Transformers Serve

With everything set up, you can now serve your GPT-OSS model as an OpenAI-compatible API using Transformers Serve.
Simply run:

transformers serve

You’ll see log messages confirming that the server has started, the application is ready, and Uvicorn is running on http://localhost:8000.

Now your model is accessible as an API—ready to accept requests just like the official OpenAI endpoint!

Step 37: Chat with Your Model via the Transformers Serve API

Now that your OpenAI-compatible API server is running, you can chat with your model using the transformers CLI.

Run this command to start a chat session with the gpt-oss-120b model through your local API server:

transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-120b

You’ll be able to send prompts and receive responses, just like chatting with a hosted API—except everything runs locally on your GPU VM!

Step 38: Use the Transformers Chat Interface Commands

Once inside the Transformers Chat Interface, you’ll see a prompt (<root>:) where you can interact with your model.
Besides chatting, you have access to a few handy commands:

!help – Shows all available commands, including how to set generation settings and save your chat.
!status – Displays the current status of your model and generation settings.
!clear – Clears the current conversation and starts fresh.
!exit – Closes the chat interface.

Simply type your messages or commands at the <root>: prompt and press Enter.
You can now experiment, tweak settings, or start new chats—all directly in your terminal!

Step 39: Interact with the Model and Get Structured Answers

Now you can use the chat interface to ask your model any question you want!
Simply type your prompt at the <root>: prompt and press Enter.

The model will respond in a clear, well-formatted way—sometimes even including tables, bullet points, or side-by-side comparisons, just like in the example above.
You can ask for explanations, definitions, comparisons, or step-by-step solutions to technical and non-technical questions alike.

Conclusion

With everything set up, you’re ready to make these open-source models work the way you want—locally, in the cloud, or anywhere in between. Choose what fits your project, experiment freely, and build without limits or hidden costs. It’s all about flexibility, control, and putting real power in your hands. Now go ahead—try new ideas, solve real problems, and see what you can create next!

Relevant blog posts

November 12, 2025

How to Install & Run SAP-RPT-1-OSS Locally?

sap-rpt-1-oss is SAP’s table-native, semantics-aware in-context learner for classification and regression. It embeds column names and cell values (no manual preprocessing), handles missing data, and scales quality with context size and bagging. For peak accuracy, it prefers big VRAM; for speed or smaller GPUs, just shrink the context and bagging.

November 11, 2025

How to Cut Your AI Costs in Half with TOON – The Smarter, Token-Optimized Alternative to JSON

Every token you send to an AI model costs money, and when your application scales, those costs can balloon fast. That’s where Token-Oriented Object Notation (TOON) steps in, offering a revolutionary way to save on API expenses without sacrificing data clarity or model accuracy. Designed as a compact, human-readable, and LLM-optimized alternative to JSON, TOON drastically reduces token usage by 30–60% across large structured datasets. It blends the simplicity of CSV, the readability of YAML, and the precision of JSON, creating a format that’s tailor-made for AI inputs. With features like tabular arrays, indentation-based hierarchy, and optional key folding, TOON helps models parse and reason about structured data more efficiently, all while maintaining perfect fidelity to your original dataset. The result? You send less data, get faster responses, and cut your AI inference costs dramatically, all by changing how you represent your data.

November 11, 2025

How to Install & Run Omnilingual ASR Locally?

Omnilingual ASR is Meta’s groundbreaking open-source speech recognition system built to support over 1,600 languages, including hundreds never before covered by any ASR model. It’s designed for inclusivity — allowing new languages to be added with just a few paired examples — and combines scalable zero-shot learning with flexible model architectures (Wav2Vec2, CTC, and LLM-based). The flagship OmniASR_LLM_7B model achieves state-of-the-art transcription accuracy, with character error rates (CER) below 10% for nearly 80% of supported languages, making it the most globally comprehensive ASR ever released. Each model is fully compatible with PyTorch, Fairseq2, and Hugging Face datasets, making it easy for developers and researchers to build multilingual transcription systems at scale.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.