How to Install & Run Google VaultGemma-1B Locally?

by Ayush Kumar | September 15, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

When we talk about open language models, most discussions revolve around performance and scale. But what if the conversation centered on privacy first? That’s where VaultGemma comes in.

Developed by Google, VaultGemma is a unique variant of the Gemma family, built entirely from the ground up with Differential Privacy (DP) at its core. Using DP-SGD (Differentially Private Stochastic Gradient Descent), it provides strong, mathematically-backed guarantees that no single training example can be extracted from its parameters. In plain words: VaultGemma remembers patterns, not people.

Despite being lightweight (under 1B parameters), the model shows solid performance on reasoning, code, and natural language tasks, while ensuring that the privacy of its training data is never compromised. That makes it a rare model suitable for healthcare, finance, and sensitive communication systems—where both performance and privacy matter.

VaultGemma might not top the leaderboards compared to non-private models, but it represents a paradigm shift: proving that you don’t have to choose between utility and privacy—you can build responsibly from the start.

Benchmark Results

The model was evaluated on a range of standard academic benchmarks. As expected, there is a utility trade-off for the strong privacy guarantees offered by the model. The table below shows the performance of the 1B pre-trained (PT) VaultGemma model.

Benchmark	n-shot	VaultGemma 1B PT
HellaSwag	10-shot	39.09
BoolQ	0-shot	62.04
PIQA	0-shot	68.00
SocialIQA	0-shot	46.16
TriviaQA	5-shot	11.24
ARC-c	25-shot	26.45
ARC-e	0-shot	51.78

GPU Configuration Table for VaultGemma (1B)

VaultGemma is relatively small compared to today’s giants (like 70B+ models). Its <1B parameter size means it can run comfortably on consumer GPUs, laptops with good VRAM, and lightweight cloud setups. Here’s a breakdown:

Scenario	Min VRAM	Comfortable VRAM	Example GPUs	Precision	Notes
Local experimentation (single prompt/chat)	4 GB	6–8 GB	RTX 3050, T4, Mac M2/M3	FP16/BF16	Works fine with Transformers pipeline or vLLM CPU+GPU hybrid offload.
Development & prototyping (small apps/chatbots)	8 GB	12–16 GB	RTX 3060, RTX 4060 Ti, A10G	BF16	Smooth real-time interaction, batch size 1–2.
Scalable cloud deployment (chatbots, API serving)	16 GB	24 GB	RTX 4090, L4, A100 40G	BF16	Best balance of throughput + low latency; easily serve multiple requests.
High-throughput workloads (batch inference, fine-tuning with DP)	24 GB+	40 GB+	A100 40G, H100, TPU v5e	FP32/BF16	Needed for training/fine-tuning with differential privacy overhead.

Pro Tip: If you’re just running VaultGemma for testing, even a mid-range consumer GPU (or CPU with enough RAM) will do. For fine-tuning with DP enabled, lean toward A100/H100 or TPU hardware, since DP-SGD adds extra compute overhead.

Resources

Link: https://huggingface.co/google/vaultgemma-1b

Step-by-Step Process to Install & Run Google VaultGemma-1B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Google VaultGemma-1B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like Google VaultGemma-1B.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Google VaultGemma-1B.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Google VaultGemma-1B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python

Run the following command to install python:

sudo apt update && sudo apt -y install git curl build-essential python3.10-venv

Step 9: Install Pip, Wheel, Created and Activated Python 3.10 Virtual Environment

Run the following command to install pip, created and activated python 3.10 virtual environment:

python3.10 -m venv vaultgemma && source vaultgemma/bin/activate
python -m pip install --upgrade pip wheel

Step 10: Install PyTorch

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install Model Dependencies

Run the following command to install model dependencies:

pip install accelerate sentencepiece bitsandbytes

Step 12: Install Transformers

Why this step exists

VaultGemma is new and declares a custom model_type: "vaultgemma". Stable transformers (e.g., 4.56.x) doesn’t recognize it yet, so you get the KeyError: 'vaultgemma'.
The latest dev build of transformers includes the auto-mapping hooks + remote code loading needed to resolve that model type. That’s why we install from GitHub @main.
We keep this step separate from the rest of the stack so you can upgrade/downgrade transformers independently (without touching Torch, vLLM, etc.). If something breaks, you roll just this package back—clean and low-risk.

What we install and why

transformers (dev, from GitHub):
- Adds support for brand-new architectures like vaultgemma.
- Ensures AutoConfig/AutoModel can resolve classes when you pass trust_remote_code=True.
- Ships bug fixes ahead of the next PyPI release.
We do not rebuild heavy deps here:
- Torch/CUDA are already installed and working—no need to touch them.
- tokenizers stays on a prebuilt wheel (fast install, no Rust toolchain needed) that satisfies the version range required by dev transformers.
- huggingface_hub, accelerate are already present; no need to pin them unless you see specific errors.

Exact commands

Run the following command to install transformers:

# Install bleeding-edge Transformers (scoped to this venv)
pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@main"

# Verify
python - <<'PY'
import transformers, platform
print("Transformers:", transformers.__version__, "| Python:", platform.python_version())
PY
# Expect something like: 4.57.0.dev0

How this fixes your error

With the dev build installed, AutoConfig.from_pretrained(..., trust_remote_code=True) knows how to interpret model_type="vaultgemma" and import the repo’s custom modeling code. That removes the “Transformers does not recognize this architecture” failure.

Why it’s a dedicated step

Isolation: Upgrading only transformers avoids accidentally changing Torch/cuDNN/vLLM versions that affect GPU kernels or scheduling.
Reproducibility: You can pin the exact git commit for blog readers or CI (@<commit_sha>).
Rollback-friendly: If a later change upstream causes a regression, you just reinstall a known-good commit of transformers—no full env surgery.

Step 13: Install vLLM

Run the following command to install vLLM:

pip install "vllm>=0.5.5"

We install vLLM to serve VaultGemma as a fast, scalable, OpenAI-compatible API after first smoke-testing it with a simple Transformers script. Transformers is perfect for quick local runs (sanity checks, notebooks), but it isn’t optimized for high-throughput serving. vLLM adds production features: PagedAttention for efficient KV-cache memory use, better request scheduling for many concurrent users, streaming responses, easy /v1/chat/completions compatibility, simple flags for max context/batch size, and clean multi-GPU/TP scaling when you need it. In short—Option 1 (Transformers) proves the model works; Option 2 (vLLM) turns it into a reliable, low-latency API your apps can hit just like OpenAI.

Step 14: Install HuggingFace Hub CLI

Run the following command to install huggingface_hub[cli]:

pip install "huggingface_hub[cli]"

Step 15: Authenticate to Hugging Face Hub (paste your token)

Create a token

Open: https://huggingface.co/settings/tokens
Make a Read token (name it vaultgemma for clarity). Keep it copied.

# New command (the old `huggingface-cli login` is deprecated)
hf auth login
# paste your token when asked
hf whoami   # quick sanity check

Step 16: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Google VaultGemma-1B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 17: Create the Script

Create a file (ex: # app.py) and add the following code:

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, TextStreamer
import torch

model_id = "google/vaultgemma-1b"

# Tokenizer (remote code)
tok = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
    trust_remote_code=True,
)

# Config first (forces remote mapping for model_type="vaultgemma")
cfg = AutoConfig.from_pretrained(
    model_id,
    trust_remote_code=True,
)

# Model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=cfg,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "In two lines, explain what DP-SGD guarantees."
inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)

_ = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.7,
    do_sample=True,
    streamer=streamer
)

Facts to remember:

Context window is 1,024 tokens—keep prompts + output within that. Hugging Face+1
If you see a “model type … not supported” error, update transformers to the latest; the HF thread shows such reports when the library is out of date.

What the script does:

Import core Hugging Face classes (tokenizer, config, model) plus Torch and a live text streamer.
Set model_id to the gated repo google/vaultgemma-1b you authenticated for.
Download & load the tokenizer from HF with trust_remote_code=True so custom code in the repo is allowed.
Download & load the model config first (also with trust_remote_code=True) so the custom vaultgemma architecture is recognized.
Download & load the actual weights as a causal LM, using BF16 for efficiency and device_map="auto" to place them on your GPU.
Define a short natural-language prompt asking for a two-line summary of DP-SGD.
Tokenize the prompt into tensors and move them to the same device as the model.
Create a TextStreamer that prints generated tokens to stdout in real time (no prompts/special tokens).
Call model.generate(...) to produce text with up to 128 new tokens, sampling enabled (temperature=0.7, do_sample=True).
Stream the model’s output to your terminal as it’s generated (no separate print needed).
Use default model max context (VaultGemma: 1,024 tokens) implicitly for prompt + output.
Rely on your HF credentials/cache to fetch and reuse model files automatically.
Keep everything self-contained in your Python venv so it doesn’t affect system packages.

Step 18: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Step 19: Launch the OpenAI-compatible server (vLLM) and smoke-test it

Start the server (keep this terminal open):

python3 -m vllm.entrypoints.openai.api_server \
  --model google/vaultgemma-1b \
  --dtype bfloat16 \
  --max-model-len 1024 \
  --gpu-memory-utilization 0.9 \
  --port 8000

You’re ready when you see routes like /v1/completions, /v1/models printed in the log.

Verify the server is up (new terminal):

curl http://127.0.0.1:8000/v1/models
# expect JSON listing: id = "google/vaultgemma-1b"

Do a quick completion (base model → use /v1/completions):

curl http://127.0.0.1:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
  "model": "google/vaultgemma-1b",
  "prompt": "Task: Write exactly two bullet lines about Differentially Private SGD (DP-SGD).\nOutput:\n- ",
  "max_tokens": 80,
  "temperature": 0.1,
  "top_p": 0.9,
  "stop": ["\n\n","END"]
}'

(Optional) Use it from apps
Set the OpenAI-compatible envs and point your clients at vLLM:

export OPENAI_BASE_URL="http://127.0.0.1:8000/v1/"
export OPENAI_API_KEY="sk-not-needed"

Python client: use openai and the /v1/completions endpoint.
Streamlit UI: run the provided vg_ui.py (or your app) and paste the same Base URL.

Note: vaultgemma-1b is pretrained (non-instruct). Prefer /v1/completions or launch vLLM with a --chat-template if you want /v1/chat/completions. Keep temperature low and constrain the output format for best results.

Conclusion

This guide walked through end-to-end setup: spinning up a GPU VM, installing deps, authenticating HF, smoke-testing with Transformers, then serving an OpenAI-compatible API via vLLM (plus optional Streamlit UI) and sizing GPUs. The takeaway: you can deploy a private-by-design LLM quickly and responsibly, then scale it or extend it with DP fine-tuning and safety guardrails as your needs grow.

Relevant blog posts

September 19, 2025

How to Install & Run mmBERT-base Locally?

mmBERT (by JHU CLSP) is a modern multilingual encoder (≈307M params) trained on 3T+ tokens across 1,800+ languages. Built on the ModernBERT family, it brings fast inference (FlashAttention-2/unpadding in the official recipe), 8K context, and state-of-the-art cross-lingual performance on classification, embeddings, retrieval, and reranking. It also introduces training tricks like inverse mask scheduling, inverse temperature sampling, and progressive language addition, which especially help low-resource languages in the decay phase. Use it as: a Masked-LM (fill-mask) for language understanding, a feature extractor for multilingual embeddings & retrieval, a backbone for classification/reranking fine-tuning.

September 18, 2025

How to Install & Run Alibaba Tongyi DeepResearch Locally?

Tongyi DeepResearch (30B-A3B) is a 30-billion parameter Mixture-of-Experts (MoE) language model developed by Alibaba Tongyi Lab, with only 3B active parameters per token for efficiency. Unlike general LLMs, it is purpose-built for deep, long-horizon information-seeking tasks, achieving state-of-the-art results on benchmarks such as Humanity’s Last Exam, BrowserComp, WebWalkerQA, GAIA, xbench-DeepSearch, and FRAMES. Key highlights include a fully automated synthetic data pipeline, large-scale continual pre-training on agentic data, and end-to-end reinforcement learning via a customized Group Relative Policy Optimization framework. At inference, it supports both ReAct-style lightweight reasoning and a test-time scaling “Heavy” mode (IterResearch) to maximize performance.

September 16, 2025

How to Install & Run Facebook MobileLLM-R1-950M Locally?

MobileLLM-R1-950M is Meta’s new reasoning-focused model in the MobileLLM family, optimized for math, programming (Python/C++), and scientific problems. Despite its smaller scale (<1B parameters), it rivals or outperforms much larger open-source models like Qwen3-0.6B and SmolLM2-1.7B across benchmarks such as MATH, GSM8K, MMLU, and LiveCodeBench. With a 32K context window, efficient training pipeline, and open recipes, it’s designed to be lightweight yet powerful for reasoning-heavy workloads.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.