How to Install & Run AMD Nitro-E Locally?

by Ayush Kumar | November 3, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Nitro-E is AMD’s ultra-light text-to-image diffusion family built on E-MMDiT (~304M params). It’s designed for fast, low-cost training/inference: the base 512px model gives strong quality in ~20 steps, while the distilled 512px variant can generate usable images in as few as 4 steps. There’s also a GRPO-tuned checkpoint for post-training quality/behavior tweaks. Code is plain PyTorch/Diffusers, so it runs on both NVIDIA (CUDA) and AMD (ROCm).

GPU Configuration Table

Tier / Use case	Model/Steps	Precision	Min VRAM (approx)	Suggested GPUs (NVIDIA)	Suggested GPUs (AMD)	Notes
Entry – fastest 512px (single images, prototyping)	512px-dist, 4 steps, guidance 0.0	FP16 / BF16	6–8 GB	RTX 2060 6G (borderline), RTX 3060 12G, T4 16G	Radeon VII (borderline), MI210 64G (overkill)	Use distilled checkpoint for max speed and lowest memory. Batch=1. If OOM, drop to FP16 or reduce width/height a bit.
Standard – higher quality 512px	512px (full), ~20 steps, guidance ~4.5	BF16 (pref) / FP16	10–12 GB	RTX 3060 12G, RTX A4000 16G, A10 24G	MI210 64G, MI250 128G	Best balance of speed/quality. Batch=1–2 on 12–16 GB. Enable FlashAttention if your stack supports it (optional).
Throughput – 512px batches	512px-dist (4–8 steps) or 512px (12–20 steps)	BF16 / FP16	16–24 GB (Batch 4–8)	A4000 16G, A5000 24G, A6000 24G, L4 24G	MI210/MI250	Scale batch size for data/gen pipelines. Use distilled for best samples/sec. Mixed precision recommended.
High-res / 1024px (quality or upscales)	1024px (if provided) or 512→upscale	BF16 / FP16	20–24 GB (Batch 1–2)	A5000/A6000 24G, A100 40/80G, H100	MI250 128G, MI300X 192G	True 1024px needs more VRAM; otherwise generate at 512px and use an upscaler.
CPU-only (dev/test)	512px-dist, 4 steps	FP32/FP16 autocast	—	—	—	Works for functional checks; generation will be slow.

Resources

Link: https://huggingface.co/amd/Nitro-E

Step-by-Step Process to Install & Run AMD Nitro-E Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running AMD Nitro-E, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like AMD Nitro-E.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like AMD Nitro-E.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the JanusCoder runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 10: Upgrade `Pip`, `Wheel`, and `Setuptools` (Recommended Before Installs)

Run the following command to ensure your Python environment has the latest package tools:

pip install --upgrade pip wheel setuptools

Step 11: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 12: Install Core Dependencies for Nitro-E

Once you’ve upgraded pip, wheel, and setuptools and activated your virtual environment, install all required Python packages for Nitro-E in one go:

pip install diffusers==0.32.2 transformers==4.49.0 accelerate==1.7.0 \
    wandb torchmetrics "torchmetrics[image]" mosaicml-streaming==0.11.0 \
    beautifulsoup4 tabulate timm==0.9.1 pyarrow einops omegaconf \
    sentencepiece==0.2.0 pandas==2.2.3 alive-progress

Step 13: Install FlashAttention For Speed Boost

Nitro-E can run perfectly without FlashAttention, but installing it gives a noticeable performance boost during image generation — especially on modern GPUs such as A100, H100, or RTX A6000.

Run this command inside your active virtual environment:

pip install "flash-attn>=2.6.0" --no-build-isolation

Step 14: Upgrade Hugging Face Hub (to enable model + checkpoint downloads)

The Hugging Face Hub library manages model downloads, caching, and authentication for repositories such as amd/Nitro-E and meta-llama/Llama-3.2-1B.
Keeping it updated ensures compatibility with your current transformers, diffusers, and authentication flow.

Run this command inside your virtual environment:

pip install -U "huggingface_hub"

Purpose

Updates Hugging Face’s client to the latest stable version.
Ensures faster, more reliable downloads of large safetensors and tokenizers.
Fixes issues with gated repos and new model metadata formats.

Step 15: Clone the Official Nitro-E Repository

Now that your Python environment is ready, clone the official Nitro-E source code from AMD’s GitHub and move into the project directory:

git clone https://github.com/AMD-AGI/Nitro-E.git
cd Nitro-E

Step 16: Connect to Your GPU VM with a Code Editor

Before you start running model script with the AMD Nitro-E model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 17: Create the Script

Create a file (ex: #run_nitro_e.py) and add the following code:

import torch
from core.tools.inference_pipe import init_pipe
from PIL import Image

device = torch.device("cuda:0")
dtype = torch.bfloat16
repo_name = "amd/Nitro-E"

# ----- choose ONE of these -----
resolution = 512
ckpt_name = "Nitro-E-512px.safetensors"         # full model, ~20 steps good quality
# ckpt_name = "Nitro-E-512px-dist.safetensors"  # distilled, ~4 steps super fast
# use_grpo = True                               # optional: GRPO post-training

# Initialize pipeline
# GRPO variant:
# pipe = init_pipe(device, dtype, resolution, repo_name=repo_name,
#                  ckpt_name="Nitro-E-512px.safetensors", ckpt_path_grpo="ckpt_grpo_512px")
pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)

prompt = "A hot air balloon in the shape of a heart, Grand Canyon"
num_steps = 20 if "512px.safetensors" in ckpt_name else 4
guidance = 4.5 if "512px.safetensors" in ckpt_name else 0.0

images = pipe(prompt=prompt, width=resolution, height=resolution,
              num_inference_steps=num_steps, guidance_scale=guidance).images

# Save first image
img: Image.Image = images[0]
img.save("nitro_e_sample.png")
print("Saved nitro_e_sample.png")

What This Script Does

Sets up GPU inference with PyTorch (cuda:0) and bfloat16 precision.
Chooses Nitro-E 512px base checkpoint (or you can switch to the distilled/GRPO options in the comments).
Initializes the Nitro-E pipeline via init_pipe(...) from the repo’s core code.
Defines a prompt and auto-selects steps/guidance (20 & 4.5 for base; 4 & 0.0 for distilled), then generates an image at 512×512.
Saves the first generated image to nitro_e_sample.png and prints a confirmation.

Step 18: Run the Script

Run the script from the following command:

python run_nitro_e.py

This will load the model and generate the response on terminal.

Step 20: Run the Final Inference Script (`run_nitro_e.py`)

Now that all dependencies are installed and the environment is configured, it’s time to generate images using the Nitro-E diffusion model.

# run_nitro_e.py
# Drop this file into the Nitro-E repo root and run:  python run_nitro_e.py
import os, sys, argparse
from pathlib import Path

# Ensure we can import the local package "core.*"
repo_root = Path(__file__).resolve().parent
sys.path.insert(0, str(repo_root))

import torch
from PIL import Image

# --- Patch: Make tokenizer robust against gated HF repos (falls back to open TinyLlama) ---
from transformers import AutoTokenizer
try:
    from huggingface_hub.errors import GatedRepoError
except Exception:
    class GatedRepoError(Exception): pass  # fallback if hub is older

_AutoTokenizer_orig = AutoTokenizer.from_pretrained
def _safe_from_pretrained(model_name_or_path, *args, **kwargs):
    try:
        return _AutoTokenizer_orig(model_name_or_path, *args, **kwargs)
    except Exception as e:
        msg = str(e).lower()
        gated = isinstance(e, GatedRepoError) or "gated repo" in msg or "401 client error" in msg or "access to model" in msg
        if gated:
            # Fallback to an open LLaMA-compatible tokenizer
            print("[INFO] Falling back to open tokenizer: TinyLlama/TinyLlama-1.1B-Chat-v1.0")
            return _AutoTokenizer_orig("TinyLlama/TinyLlama-1.1B-Chat-v1.0", *args, **kwargs)
        raise
AutoTokenizer.from_pretrained = _safe_from_pretrained  # monkey-patch

# Now import the Nitro-E pipeline (will call AutoTokenizer.from_pretrained inside)
from core.tools.inference_pipe import init_pipe

def pick_device_and_dtype(force_fp16=False):
    if not torch.cuda.is_available():
        raise SystemExit("No CUDA device found. Use a GPU VM.")
    dev = torch.device("cuda:0")
    if force_fp16:
        return dev, torch.float16
    # Prefer BF16 if supported; otherwise FP16
    try:
        if hasattr(torch.cuda, "is_bf16_supported") and torch.cuda.is_bf16_supported():
            return dev, torch.bfloat16
    except Exception:
        pass
    return dev, torch.float16

def main():
    p = argparse.ArgumentParser(description="Nitro-E one-shot inference (robust).")
    p.add_argument("--variant", choices=["base", "dist", "grpo"], default="dist",
                   help="base=512px (20 steps), dist=512px-dist (4 steps), grpo=512px + GRPO weights")
    p.add_argument("--prompt", default="A hot air balloon in the shape of a heart, Grand Canyon")
    p.add_argument("--res", type=int, default=512)
    p.add_argument("--steps", type=int, default=None, help="Override steps (default auto by variant)")
    p.add_argument("--guidance", type=float, default=None, help="Override guidance (default auto by variant)")
    p.add_argument("--out", default="nitro_e_sample.png")
    p.add_argument("--fp16", action="store_true", help="Force FP16 if BF16 gives issues")
    args = p.parse_args()

    device, dtype = pick_device_and_dtype(force_fp16=args.fp16)

    repo_name = "amd/Nitro-E"
    if args.variant == "base":
        ckpt_name = "Nitro-E-512px.safetensors"
        steps = 20 if args.steps is None else args.steps
        guidance = 4.5 if args.guidance is None else args.guidance
        pipe = init_pipe(device, dtype, args.res, repo_name=repo_name, ckpt_name=ckpt_name)
    elif args.variant == "dist":
        ckpt_name = "Nitro-E-512px-dist.safetensors"
        steps = 4 if args.steps is None else args.steps
        guidance = 0.0 if args.guidance is None else args.guidance
        pipe = init_pipe(device, dtype, args.res, repo_name=repo_name, ckpt_name=ckpt_name)
    else:  # grpo
        ckpt_name = "Nitro-E-512px.safetensors"
        steps = 20 if args.steps is None else args.steps
        guidance = 4.5 if args.guidance is None else args.guidance
        pipe = init_pipe(
            device, dtype, args.res,
            repo_name=repo_name,
            ckpt_name=ckpt_name,
            ckpt_path_grpo="ckpt_grpo_512px"  # relies on HF to fetch subfolder
        )

    # Generate
    out_images = pipe(
        prompt=args.prompt,
        width=args.res,
        height=args.res,
        num_inference_steps=steps,
        guidance_scale=guidance
    ).images

    img: Image.Image = out_images[0]
    img.save(args.out)
    print(f"[OK] Saved {args.out} | variant={args.variant} steps={steps} guidance={guidance} dtype={dtype}")

if __name__ == "__main__":
    # Optional: speed up HF downloads if enabled
    os.environ.setdefault("HF_HUB_ENABLE_HF_TRANSFER", "1")
    main()

From inside the cloned Nitro-E directory, run:

python run_nitro_e.py --variant dist --prompt "A hot air balloon in the shape of a heart, Grand Canyon" --out nitro_e_sample.png

Explanation:

--variant dist → uses the distilled 512px model, optimized for speed (only 4 inference steps).
--prompt → your text prompt that describes what image you want to generate.
--out → name of the output image file that will be saved locally.

Optional examples:

# Higher-quality full model (20 steps)
python run_nitro_e.py --variant base --prompt "A golden retriever running in a meadow" --out dog.png

# GRPO fine-tuned variant
python run_nitro_e.py --variant grpo --prompt "studio portrait of a samurai cat" --out cat.png

# Force FP16 if BF16 isn’t supported
python run_nitro_e.py --variant dist --fp16 --prompt "cyberpunk neon cityscape at night" --out city.png

Expected output:

[OK] Saved nitro_e_sample.png | variant=dist steps=4 guidance=0.0 dtype=torch.bfloat16

Your generated image will be located in the same Nitro-E directory.

Conclusion

AMD’s Nitro-E proves that text-to-image diffusion doesn’t have to be heavy or resource-hungry. With just 304 million parameters, it delivers high-quality results while running efficiently on both NVIDIA (CUDA) and AMD (ROCm) GPUs.
Whether you’re exploring the distilled 4-step variant for speed, the base model for quality, or the GRPO-tuned version for refined results, Nitro-E offers flexibility for every use case.

By following the above step-by-step guide, you can easily deploy and run Nitro-E on a NodeShift GPU VM (or any GPU environment) and start generating stunning visuals from text prompts in minutes — lightweight, fast, and open-source.

Relevant blog posts

November 1, 2025

How to Install & Run JanusCoderV-8B Locally?

JanusCoderV-8B is an 8B multimodal code-intelligence model from InternLM’s JanusCoder suite, built on InternVL-3.5-8B. Trained on JANUSCODE-800K, it unifies visual + programmatic inputs to generate and edit code for charts, interactive web UIs, and animation logic. It supports image-conditioned code generation, visual-grounded edits, and long outputs (demo shows max_new_tokens up to 32K) using standard Transformers (≥ 4.55.0) with AutoProcessor + AutoModelForCausalLM and remote code enabled.

October 31, 2025

A Step-By-Step Guide to Install & Run Kimi Linear

In an era where attention mechanisms are redefining efficiency in large language models, Kimi Linear emerges as a breakthrough innovation designed for extreme scalability without compromise. Built upon the novel Kimi Delta Attention (KDA) architecture, it reimagines how models process information across both short and million-token-long contexts. Unlike conventional full attention systems that buckle under long sequences, Kimi Linear offers a hybrid linear attention framework combining the precision of global attention with the blazing speed and memory efficiency of KDA. The results speak for themselves, it achieves 51.0 on MMLU-Pro (4k) while maintaining the same speed as full attention, and delivers Pareto-optimal 84.3 on RULER (128k) with a 3.98× speedup. Even more impressively, Kimi Linear pushes decoding throughput up to 6× faster and cuts KV cache requirements by 75%, making it one of the most efficient architectures for high-throughput, long-context reasoning.

October 29, 2025

How to Install & Run Chandra-OCR Locally?

Chandra is Datalab’s next-generation OCR model built for precise document understanding. It goes beyond simple text extraction — converting images and PDFs into structured Markdown, HTML, or JSON while preserving original layout details like tables, forms, and diagrams. With strong support for handwriting, math equations, and multi-column layouts across 40+ languages, Chandra achieves an overall accuracy of 83.1% on the olmOCR benchmark, outperforming most open and commercial OCR systems. It can be used easily via CLI, VLLM, Hugging Face, or a Streamlit app, making it versatile for developers, researchers, and document intelligence workflows.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.