How to Install & Run Chroma1-HD Locally?

by Ayush Kumar | September 12, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Chroma1-HD is an 8.9B text-to-image base model built on FLUX.1-schnell. It’s released under Apache-2.0, making it ideal for research and downstream finetuning. As a neutral, high-quality foundation, it focuses on clean generation, stable training behavior, and easy customization—while staying friendly to common open-source tooling (Diffusers, ComfyUI).

GPU Configuration (Rule-of-Thumb)

Scenario	Min VRAM	Comfortable VRAM	Example GPUs	Precision	Typical Settings	Notes
Entry (single image)	24 GB	24–32 GB	RTX 4090 24G, L4 24G	bf16 / fp16	1024×1024, steps 25–40, batch=1, `enable_model_cpu_offload()`	Fits reliably with offload (slower due to PCIe). If OOM, drop to 896–768 px or steps 20–30.
Standard	40 GB	40–48 GB	A100 40G, L40S 48G	bf16	1024×1024, steps 30–40, batch 2–4, offload off	Best balance of speed + batching; good for small queues.
Pro / High Throughput	80 GB	80 GB+	A100 80G, H100/H200 80G	bf16	1024–1280 px, steps 30–40, batch 4–8, offload off	Highest throughput; room for larger resolutions and more concurrent jobs.
CPU-only (not recommended)	—	—	—	fp32	512–768 px, steps ≤20	Very slow; only for functional tests.
Quantized (optional)	20–32 GB	24–40 GB	4090, L40S, A100	fp16 + low-bit linears	Same as above	Using GemLite dynamic 8-bit linears can trim VRAM and latency; gains vary by GPU.

Tips

Prefer dtype=torch.bfloat16 on modern GPUs; switch to fp16 if your card favors it.
On 24 GB, keep batch=1 and use pipe.enable_model_cpu_offload(); turn it off on ≥40 GB.
For speed on ≥40 GB, consider torch.compile(pipe.transformer.forward, fullgraph=True) (falls back safely if unsupported).
If you push 1280×1280 or higher, expect a VRAM jump—scale steps/batch accordingly.

Resources

Link: https://huggingface.co/lodestones/Chroma1-HD

Step-by-Step Process to Install & Run Chroma1-HD Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Chroma1-HD, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like Chroma1-HD.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Chroma1-HD.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Chroma1-HD runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install `pip` (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install `pip` (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install `pip` via `get-pip.py`:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv chroma
source chroma/bin/activate

Step 10: Install PyTorch

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install libraries required by ChromaPipeline

Run the following command to install libraries required by chromapipeline:

pip install "diffusers>=0.35.1" transformers sentencepiece accelerate safetensors

Step 12: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Chroma1-HD model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 13: Create the Script

Create a file (ex: # app.py) and add the following code:

import torch
from diffusers import ChromaPipeline

pipe = ChromaPipeline.from_pretrained(
    "lodestones/Chroma1-HD",
    torch_dtype=torch.bfloat16,   # bf16 is recommended
)
pipe.enable_model_cpu_offload()   # trims peak VRAM; a bit slower

prompt = ["A high-fashion close-up portrait... (your prompt)"]
neg = ["low quality, ugly, out of focus, deformed, disfigure, blurry"]

img = pipe(
    prompt=prompt,
    negative_prompt=neg,
    generator=torch.Generator("cpu").manual_seed(433),
    num_inference_steps=40,
    guidance_scale=3.0,
    num_images_per_prompt=1,
).images[0]
img.save("chroma.png")

What the Script Does:

Imports PyTorch and ChromaPipeline from Diffusers.
Loads lodestones/Chroma1-HD with bfloat16 precision for speed/stability.
Enables CPU offload to reduce peak VRAM on 24 GB-class GPUs (slower but safer).
Defines a prompt and a negative prompt to steer image quality.
Sets a deterministic seed (433) via a CPU generator for reproducible results.
Runs inference with 40 steps, CFG=3.0, and 1 image per prompt.
Retrieves the generated image from the pipeline output.
Saves the image locally as chroma.png.

Step 14: Run the Script

Run the script from the following command:

python3 app.py

This will generate the image and save in project directory.

Step 15: Check the Generated Image

Check the generated image in directory.

We validated Chroma1-HD end-to-end using the quick script: it imports PyTorch and ChromaPipeline, loads lodestones/Chroma1-HD in bf16, enables CPU offload to keep peak VRAM in check, sets a fixed seed for reproducibility, runs 40 denoise steps with CFG=3.0 for one image, then writes the output to chroma.png—a fast sanity test that the weights, tokenizer/encoders, and scheduler are wired correctly.

Why GemLite + Triton?

Lower VRAM & faster matmuls: GemLite replaces PyTorch Linear layers with low-bit dynamic kernels (e.g., 8-bit activations/weights), cutting memory traffic and often improving latency—especially helpful on 24–40 GB GPUs.
JIT kernels via Triton: Triton compiles tiny CUDA kernels on the fly for your GPU. That’s why you need the Python C headers (python3.10-dev provides Python.h) and a working CUDA driver inside the container/VM.

Step 16: Install Python 3.10 Toolchain + Headers

Run the following command to install python 3.10 toolchain + headers:

sudo apt update
sudo apt install -y \
  build-essential \
  python3.10 python3.10-venv python3.10-dev python3-pip \
  python3-dev

Step 17: Install the PyTorch trio that matches your setup (you already have torch 2.8.0+cu128)

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu128 \
  torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0

Step 18: Install GemLite + Triton

Run the following command to install GemLite + Triton:

pip install "gemlite>=0.5.0" triton

Step 19: Sanity check (imports + CUDA ops)

Run the following command to sanity check (imports + CUDA ops):

python - <<'PY'
import torch, torchvision, torchaudio, transformers, diffusers, triton
from torchvision.ops import nms
print("torch:", torch.__version__)
print("torchvision:", torchvision.__version__)
print("torchaudio:", torchaudio.__version__)
print("transformers:", transformers.__version__)
print("diffusers:", diffusers.__version__)
print("triton:", triton.__version__)
print("nms ok")
PY

Step 20: Create the Script

Create a file (ex: # app2.py) and add the following code:

import torch, gemlite
from diffusers import ChromaPipeline

pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-HD", torch_dtype=torch.float16)
device = "cuda:0"
processor = gemlite.helper.A8W8_INT8_dynamic

for name, module in pipe.transformer.named_modules():
    module.name = name

def patch_linears(m, ctor):
    for n, layer in m.named_children():
        if isinstance(layer, torch.nn.Linear):
            setattr(m, n, ctor(layer, n))
        else:
            patch_linears(layer, ctor)

def to_gemlite(layer, name):
    layer = layer.to(device, non_blocking=True)
    try:
        return processor(device=device).from_linear(layer)
    except Exception:
        return layer

patch_linears(pipe.transformer, to_gemlite)
pipe.to(device)
pipe.transformer.forward = torch.compile(pipe.transformer.forward, fullgraph=True)
pipe.vae.forward = torch.compile(pipe.vae.forward, fullgraph=True)

What the Script Does:

Imports PyTorch, GemLite, and ChromaPipeline (Diffusers).
Loads Chroma1-HD with FP16 weights (torch_dtype=torch.float16).
Chooses GPU device cuda:0 and selects the GemLite processor A8W8_INT8_dynamic (dynamic 8-bit linears).
Adds a .name attribute to every submodule in pipe.transformer (used by GemLite logging/conversion).
Defines patch_linears(...) to walk the transformer recursively and replace every torch.nn.Linear layer via a constructor hook.
Defines to_gemlite(...) that moves a layer to GPU and tries to convert it to a GemLite low-bit linear (processor.from_linear); if conversion fails, it leaves the original layer in place.
Runs patch_linears(pipe.transformer, to_gemlite) so only the transformer’s Linear layers are swapped to GemLite kernels (VAE stays unchanged).
Moves the whole pipeline to GPU with pipe.to(device).
Uses torch.compile(..., fullgraph=True) on both the transformer and VAE forward passes to JIT-optimize execution for extra speed.
After this prep, the pipeline is ready for generation with reduced VRAM pressure and potential latency gains, though this snippet itself does not call pipe(...) to generate an image.

Step 21: Run the Script

Run the script from the following command:

python3 app2.py

This will download the model.

Step 22: Create the Script

Create a file (ex: # test.py) and add the following code:

import time, torch
from diffusers import ChromaPipeline

pipe = ChromaPipeline.from_pretrained(
    "lodestones/Chroma1-HD",
    dtype=torch.bfloat16,     # use `dtype`, not deprecated `torch_dtype`
)
pipe.enable_model_cpu_offload()  # turn OFF if you have ≥40GB VRAM

t0 = time.time()
out = pipe(
    prompt=["A cinematic sunrise over a glassy mountain lake, ultra-detailed, 50mm film look."],
    negative_prompt=["low quality, blurry, deformed"],
    num_inference_steps=30,
    guidance_scale=3.0,
    num_images_per_prompt=1,
    width=1024, height=1024,
    generator=torch.Generator("cpu").manual_seed(433),
)
t1 = time.time()

img = out.images[0]
img.save("chroma_ok.png")
print(f"Saved chroma_ok.png in {t1 - t0:.2f}s")

What the Script Does:

Imports time, PyTorch, and ChromaPipeline (Diffusers).
Loads lodestones/Chroma1-HD with bfloat16 precision via the new dtype= argument.
Enables CPU offload to lower peak VRAM usage (disable this on ≥40 GB GPUs for speed).
Starts a timer to measure end-to-end generation latency.

Calls the pipeline to generate 1 image at 1024×1024, using:

Prompt: cinematic sunrise over a glassy mountain lake (50mm look)
Negative prompt: low quality / blurry / deformed
30 steps, CFG = 3.0, seeded with 433 for reproducibility

Retrieves the first image from out.images.
Saves the result to chroma_ok.png.
Prints the total runtime in seconds.

Step 23: Run the Script

Run the script from the following command:

python3 test.py

This will generate the image and save in project directory.

Step 24: Check the Generated Image

Check the generated image in directory.

Conclusion

Chroma1-HD is a clean, dependable base you can put to work fast. In this guide, you spun up a GPU VM on NodeShift, used a CUDA-ready image, installed the core stack, and verified end-to-end generation with simple Diffusers scripts. You also learned practical VRAM tiers (24/40/80 GB), when to use CPU offload, and how optional GemLite+Triton trims memory and speeds up matmuls—handy on 24–40 GB cards.

From here, experiment with prompts, steps, and 1024→1280 resolutions, then scale batching on bigger GPUs. When you’re comfortable, wire it into ComfyUI or explore lightweight LoRA finetunes. Ship your first render, iterate, and share what you build.

Relevant blog posts

September 19, 2025

How to Install & Run mmBERT-base Locally?

mmBERT (by JHU CLSP) is a modern multilingual encoder (≈307M params) trained on 3T+ tokens across 1,800+ languages. Built on the ModernBERT family, it brings fast inference (FlashAttention-2/unpadding in the official recipe), 8K context, and state-of-the-art cross-lingual performance on classification, embeddings, retrieval, and reranking. It also introduces training tricks like inverse mask scheduling, inverse temperature sampling, and progressive language addition, which especially help low-resource languages in the decay phase. Use it as: a Masked-LM (fill-mask) for language understanding, a feature extractor for multilingual embeddings & retrieval, a backbone for classification/reranking fine-tuning.

September 18, 2025

How to Install & Run Alibaba Tongyi DeepResearch Locally?

Tongyi DeepResearch (30B-A3B) is a 30-billion parameter Mixture-of-Experts (MoE) language model developed by Alibaba Tongyi Lab, with only 3B active parameters per token for efficiency. Unlike general LLMs, it is purpose-built for deep, long-horizon information-seeking tasks, achieving state-of-the-art results on benchmarks such as Humanity’s Last Exam, BrowserComp, WebWalkerQA, GAIA, xbench-DeepSearch, and FRAMES. Key highlights include a fully automated synthetic data pipeline, large-scale continual pre-training on agentic data, and end-to-end reinforcement learning via a customized Group Relative Policy Optimization framework. At inference, it supports both ReAct-style lightweight reasoning and a test-time scaling “Heavy” mode (IterResearch) to maximize performance.

September 16, 2025

How to Install & Run Facebook MobileLLM-R1-950M Locally?

MobileLLM-R1-950M is Meta’s new reasoning-focused model in the MobileLLM family, optimized for math, programming (Python/C++), and scientific problems. Despite its smaller scale (<1B parameters), it rivals or outperforms much larger open-source models like Qwen3-0.6B and SmolLM2-1.7B across benchmarks such as MATH, GSM8K, MMLU, and LiveCodeBench. With a 32K context window, efficient training pipeline, and open recipes, it’s designed to be lightweight yet powerful for reasoning-heavy workloads.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.