Chroma1-HD is an 8.9B text-to-image base model built on FLUX.1-schnell. It’s released under Apache-2.0, making it ideal for research and downstream finetuning. As a neutral, high-quality foundation, it focuses on clean generation, stable training behavior, and easy customization—while staying friendly to common open-source tooling (Diffusers, ComfyUI).
GPU Configuration (Rule-of-Thumb)
Scenario | Min VRAM | Comfortable VRAM | Example GPUs | Precision | Typical Settings | Notes |
---|
Entry (single image) | 24 GB | 24–32 GB | RTX 4090 24G, L4 24G | bf16 / fp16 | 1024×1024, steps 25–40, batch=1, enable_model_cpu_offload() | Fits reliably with offload (slower due to PCIe). If OOM, drop to 896–768 px or steps 20–30. |
Standard | 40 GB | 40–48 GB | A100 40G, L40S 48G | bf16 | 1024×1024, steps 30–40, batch 2–4, offload off | Best balance of speed + batching; good for small queues. |
Pro / High Throughput | 80 GB | 80 GB+ | A100 80G, H100/H200 80G | bf16 | 1024–1280 px, steps 30–40, batch 4–8, offload off | Highest throughput; room for larger resolutions and more concurrent jobs. |
CPU-only (not recommended) | — | — | — | fp32 | 512–768 px, steps ≤20 | Very slow; only for functional tests. |
Quantized (optional) | 20–32 GB | 24–40 GB | 4090, L40S, A100 | fp16 + low-bit linears | Same as above | Using GemLite dynamic 8-bit linears can trim VRAM and latency; gains vary by GPU. |
Tips
- Prefer
dtype=torch.bfloat16
on modern GPUs; switch to fp16 if your card favors it.
- On 24 GB, keep batch=1 and use
pipe.enable_model_cpu_offload()
; turn it off on ≥40 GB.
- For speed on ≥40 GB, consider
torch.compile(pipe.transformer.forward, fullgraph=True)
(falls back safely if unsupported).
- If you push 1280×1280 or higher, expect a VRAM jump—scale steps/batch accordingly.
Resources
Link: https://huggingface.co/lodestones/Chroma1-HD
Step-by-Step Process to Install & Run Chroma1-HD Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Chroma1-HD, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like Chroma1-HD.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Chroma1-HD.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Chroma1-HD runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Verify Python Version & Install pip
(if not present)
Since Python 3.10 is already installed, we’ll confirm its version and ensure pip
is available for package installation.
Step 8.1: Check Python Version
Run the following command to verify Python 3.10 is installed:
python3 --version
You should see output like:
Python 3.10.12
Step 8.2: Install pip
(if not already installed)
Even if Python is installed, pip
might not be available.
Check if pip
exists:
pip3 --version
If you get an error like command not found
, then install pip
manually.
Install pip
via get-pip.py
:
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
This will download and install pip
into your system.
You may see a warning about running as root — that’s okay for now.
After installation, verify:
pip3 --version
Expected output:
pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Now pip
is ready to install packages like transformers
, torch
, etc.
Step 9: Created and Activated Python 3.10 Virtual Environment
Run the following commands to created and activated Python 3.10 virtual environment:
apt update && apt install -y python3.10-venv git wget
python3.10 -m venv chroma
source chroma/bin/activate
Step 10: Install PyTorch
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
Step 11: Install libraries required by ChromaPipeline
Run the following command to install libraries required by chromapipeline:
pip install "diffusers>=0.35.1" transformers sentencepiece accelerate safetensors
Step 12: Connect to Your GPU VM with a Code Editor
Before you start running model script with the Chroma1-HD model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 13: Create the Script
Create a file (ex: # app.py) and add the following code:
import torch
from diffusers import ChromaPipeline
pipe = ChromaPipeline.from_pretrained(
"lodestones/Chroma1-HD",
torch_dtype=torch.bfloat16, # bf16 is recommended
)
pipe.enable_model_cpu_offload() # trims peak VRAM; a bit slower
prompt = ["A high-fashion close-up portrait... (your prompt)"]
neg = ["low quality, ugly, out of focus, deformed, disfigure, blurry"]
img = pipe(
prompt=prompt,
negative_prompt=neg,
generator=torch.Generator("cpu").manual_seed(433),
num_inference_steps=40,
guidance_scale=3.0,
num_images_per_prompt=1,
).images[0]
img.save("chroma.png")
What the Script Does:
- Imports PyTorch and ChromaPipeline from Diffusers.
- Loads
lodestones/Chroma1-HD
with bfloat16 precision for speed/stability.
- Enables CPU offload to reduce peak VRAM on 24 GB-class GPUs (slower but safer).
- Defines a prompt and a negative prompt to steer image quality.
- Sets a deterministic seed (
433
) via a CPU generator for reproducible results.
- Runs inference with 40 steps, CFG=3.0, and 1 image per prompt.
- Retrieves the generated image from the pipeline output.
- Saves the image locally as
chroma.png
.
Step 14: Run the Script
Run the script from the following command:
python3 app.py
This will generate the image and save in project directory.
Step 15: Check the Generated Image
Check the generated image in directory.
We validated Chroma1-HD end-to-end using the quick script: it imports PyTorch and ChromaPipeline
, loads lodestones/Chroma1-HD
in bf16, enables CPU offload to keep peak VRAM in check, sets a fixed seed for reproducibility, runs 40 denoise steps with CFG=3.0 for one image, then writes the output to chroma.png
—a fast sanity test that the weights, tokenizer/encoders, and scheduler are wired correctly.
Why GemLite + Triton?
- Lower VRAM & faster matmuls: GemLite replaces PyTorch
Linear
layers with low-bit dynamic kernels (e.g., 8-bit activations/weights), cutting memory traffic and often improving latency—especially helpful on 24–40 GB GPUs.
- JIT kernels via Triton: Triton compiles tiny CUDA kernels on the fly for your GPU. That’s why you need the Python C headers (
python3.10-dev
provides Python.h
) and a working CUDA driver inside the container/VM.
Step 16: Install Python 3.10 Toolchain + Headers
Run the following command to install python 3.10 toolchain + headers:
sudo apt update
sudo apt install -y \
build-essential \
python3.10 python3.10-venv python3.10-dev python3-pip \
python3-dev
Step 17: Install the PyTorch trio that matches your setup (you already have torch 2.8.0+cu128)
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu128 \
torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0
Step 18: Install GemLite + Triton
Run the following command to install GemLite + Triton:
pip install "gemlite>=0.5.0" triton
Step 19: Sanity check (imports + CUDA ops)
Run the following command to sanity check (imports + CUDA ops):
python - <<'PY'
import torch, torchvision, torchaudio, transformers, diffusers, triton
from torchvision.ops import nms
print("torch:", torch.__version__)
print("torchvision:", torchvision.__version__)
print("torchaudio:", torchaudio.__version__)
print("transformers:", transformers.__version__)
print("diffusers:", diffusers.__version__)
print("triton:", triton.__version__)
print("nms ok")
PY
Step 20: Create the Script
Create a file (ex: # app2.py) and add the following code:
import torch, gemlite
from diffusers import ChromaPipeline
pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-HD", torch_dtype=torch.float16)
device = "cuda:0"
processor = gemlite.helper.A8W8_INT8_dynamic
for name, module in pipe.transformer.named_modules():
module.name = name
def patch_linears(m, ctor):
for n, layer in m.named_children():
if isinstance(layer, torch.nn.Linear):
setattr(m, n, ctor(layer, n))
else:
patch_linears(layer, ctor)
def to_gemlite(layer, name):
layer = layer.to(device, non_blocking=True)
try:
return processor(device=device).from_linear(layer)
except Exception:
return layer
patch_linears(pipe.transformer, to_gemlite)
pipe.to(device)
pipe.transformer.forward = torch.compile(pipe.transformer.forward, fullgraph=True)
pipe.vae.forward = torch.compile(pipe.vae.forward, fullgraph=True)
What the Script Does:
- Imports PyTorch, GemLite, and ChromaPipeline (Diffusers).
- Loads Chroma1-HD with FP16 weights (
torch_dtype=torch.float16
).
- Chooses GPU device
cuda:0
and selects the GemLite processor A8W8_INT8_dynamic (dynamic 8-bit linears).
- Adds a
.name
attribute to every submodule in pipe.transformer
(used by GemLite logging/conversion).
- Defines
patch_linears(...)
to walk the transformer recursively and replace every torch.nn.Linear
layer via a constructor hook.
- Defines
to_gemlite(...)
that moves a layer to GPU and tries to convert it to a GemLite low-bit linear (processor.from_linear
); if conversion fails, it leaves the original layer in place.
- Runs
patch_linears(pipe.transformer, to_gemlite)
so only the transformer’s Linear layers are swapped to GemLite kernels (VAE stays unchanged).
- Moves the whole pipeline to GPU with
pipe.to(device)
.
- Uses
torch.compile(..., fullgraph=True)
on both the transformer and VAE forward passes to JIT-optimize execution for extra speed.
- After this prep, the pipeline is ready for generation with reduced VRAM pressure and potential latency gains, though this snippet itself does not call
pipe(...)
to generate an image.
Step 21: Run the Script
Run the script from the following command:
python3 app2.py
This will download the model.
Step 22: Create the Script
Create a file (ex: # test.py) and add the following code:
import time, torch
from diffusers import ChromaPipeline
pipe = ChromaPipeline.from_pretrained(
"lodestones/Chroma1-HD",
dtype=torch.bfloat16, # use `dtype`, not deprecated `torch_dtype`
)
pipe.enable_model_cpu_offload() # turn OFF if you have ≥40GB VRAM
t0 = time.time()
out = pipe(
prompt=["A cinematic sunrise over a glassy mountain lake, ultra-detailed, 50mm film look."],
negative_prompt=["low quality, blurry, deformed"],
num_inference_steps=30,
guidance_scale=3.0,
num_images_per_prompt=1,
width=1024, height=1024,
generator=torch.Generator("cpu").manual_seed(433),
)
t1 = time.time()
img = out.images[0]
img.save("chroma_ok.png")
print(f"Saved chroma_ok.png in {t1 - t0:.2f}s")
What the Script Does:
- Imports time, PyTorch, and ChromaPipeline (Diffusers).
- Loads lodestones/Chroma1-HD with bfloat16 precision via the new
dtype=
argument.
- Enables CPU offload to lower peak VRAM usage (disable this on ≥40 GB GPUs for speed).
- Starts a timer to measure end-to-end generation latency.
Calls the pipeline to generate 1 image at 1024×1024, using:
- Prompt: cinematic sunrise over a glassy mountain lake (50mm look)
- Negative prompt: low quality / blurry / deformed
- 30 steps, CFG = 3.0, seeded with 433 for reproducibility
- Retrieves the first image from
out.images
.
- Saves the result to
chroma_ok.png
.
- Prints the total runtime in seconds.
Step 23: Run the Script
Run the script from the following command:
python3 test.py
This will generate the image and save in project directory.
Step 24: Check the Generated Image
Check the generated image in directory.
Conclusion
Chroma1-HD is a clean, dependable base you can put to work fast. In this guide, you spun up a GPU VM on NodeShift, used a CUDA-ready image, installed the core stack, and verified end-to-end generation with simple Diffusers scripts. You also learned practical VRAM tiers (24/40/80 GB), when to use CPU offload, and how optional GemLite+Triton trims memory and speeds up matmuls—handy on 24–40 GB cards.
From here, experiment with prompts, steps, and 1024→1280 resolutions, then scale batching on bigger GPUs. When you’re comfortable, wire it into ComfyUI or explore lightweight LoRA finetunes. Ship your first render, iterate, and share what you build.