How to Install & Run EmbeddingGemma-300m Locally?

by Ayush Kumar | September 6, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

EmbeddingGemma-300M is Google DeepMind’s lightweight, multilingual (100+ languages) embedding model built on Gemma 3/T5Gemma foundations. It outputs 768-dim vectors (with Matryoshka down-projections to 512/256/128) optimized for retrieval, classification, clustering, semantic similarity, QA, and code retrieval. It’s designed for low-resource / on-device use, loads via SentenceTransformers, and does not support float16—use FP32 or bfloat16.

Evaluation

Benchmark Results

The model was evaluated against a large collection of different datasets and metrics to cover different aspects of text understanding.

Full Precision Checkpoint

MTEB (Multilingual, v2)
Dimensionality	Mean (Task)	Mean (TaskType)
768d	61.15	54.31
512d	60.71	53.89
256d	59.68	53.01
128d	58.23	51.77

MTEB (English, v2)
Dimensionality	Mean (Task)	Mean (TaskType)
768d	68.36	64.15
512d	67.80	63.59
256d	66.89	62.94
128d	65.09	61.56

MTEB (Code, v1)
Dimensionality	Mean (Task)	Mean (TaskType)
768d	68.76	68.76
512d	68.48	68.48
256d	66.74	66.74
128d	62.96	62.96

QAT Checkpoints

MTEB (Multilingual, v2)
Quant config (dimensionality)	Mean (Task)	Mean (TaskType)
Q4_0 (768d)	60.62	53.61
Q8_0 (768d)	60.93	53.95
Mixed Precision* (768d)	60.69	53.82

MTEB (English, v2)
Quant config (dimensionality)	Mean (Task)	Mean (TaskType)
Q4_0 (768d)	67.91	63.64
Q8_0 (768d)	68.13	63.85
Mixed Precision* (768d)	67.95	63.83

MTEB (Code, v1)
Quant config (dimensionality)	Mean (Task)	Mean (TaskType)
Q4_0 (768d)	67.99	67.99
Q8_0 (768d)	68.70	68.70
Mixed Precision* (768d)	68.03	68.03

Note: QAT models are evaluated after quantization

Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4).

GPU/CPU Configuration Table

Scenario	Precision	Min VRAM / RAM	Recommended	Example Hardware	Context Length	Practical Notes
Laptop / Desktop (CPU only)	FP32	2–3 GB RAM	4–8 GB RAM	8-core CPU (Ryzen 7 / i7)	2,048	Easiest setup; batch 16–64 for docs; use BLAS/OMP; great for small/medium corpora.
Entry GPU	BF16 / FP32	4 GB VRAM	6–8 GB VRAM	RTX 3050/3060 6–8 GB, T4 16 GB	2,048	Good price/perf; batch 128–256 docs; pin memory + DataLoader prefetch.
Standard single-GPU	BF16 / FP32	8 GB VRAM	12–24 GB VRAM	RTX 3080/3090, L4 24 GB, A4000	2,048	High-throughput ingestion; batch 512–1,024; keep outputs in FP32; enable half-precision weights if BF16 supported.
Throughput server	BF16 / FP32	16 GB VRAM	24–48 GB VRAM	L4 24 GB, A5000/A6000	2,048	Production indexing; use multi-worker loaders; stream to vector DB (FAISS/HNSW).
Multi-GPU pipeline*	BF16 / FP32	2×8 GB	2×16 GB+	2×L4 / 2×A4000	2,048	*Model itself doesn’t need multi-GPU; split by sharding batches across GPUs for line-rate ingestion.
On-device / Edge	FP32	2 GB RAM	3–4 GB RAM	Apple M-series / modern ARM	2,048	Great for on-device similarity; smaller MRL dims (128–512) reduce memory + I/O.

Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice.

Use Case (task type enum)	Descriptions	Recommended Prompt
Retrieval (Query)	Used to generate embeddings that are optimized for document search or information retrieval	task: search result \| query: {content}
Retrieval (Document)		title: {title \| “none”} \| text: {content}
Question Answering		task: question answering \| query: {content}
Fact Verification		task: fact checking \| query: {content}
Classification	Used to generate embeddings that are optimized to classify texts according to preset labels	task: classification \| query: {content}
Clustering	Used to generate embeddings that are optimized to cluster texts based on their similarities	task: clustering \| query: {content}
Semantic Similarity	Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases.	task: sentence similarity \| query: {content}
Code Retrieval	Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of the code blocks are computed using retrieval_document.	task: code retrieval \| query: {content}

Resources

Link: https://huggingface.co/google/embeddinggemma-300m

Step-by-Step Process to Install & Run EmbeddingGemma-300m Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running EmbeddingGemma-300m, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like EmbeddingGemma-300m.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like EmbeddingGemma-300m.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the EmbeddingGemma-300m runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install `pip` (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install `pip` (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install `pip` via `get-pip.py`:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv gemma
source gemma/bin/activate

Step 10: Install Dependencies

Run the following command to install dependencies:

pip install -U sentence-transformers faiss-cpu

Step 11: Install Hugging Face Hub

Run the following command to install huggingface_hub:

pip install -U huggingface_hub

Step 12: Log in to Hugging Face (CLI)

Run the following command to login in to hugging face:

huggingface-cli login

When prompted, paste your HF token (from https://huggingface.co/settings/tokens).
For “Add token as git credential? (Y/n)”:
- Y if you plan to git clone models/repos.
- n if you only use huggingface_hub downloads.

You should see: “Token is valid… saved to /root/.cache/huggingface/stored_tokens”.
The red line “Cannot authenticate through git-credential…” just means no Git credential helper is set. It’s safe to ignore.

Step 13: Connect to Your GPU VM with a Code Editor

Before you start running model script with the EmbeddingGemma-300m model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 14: Create `app.py` and Add the Following Code

Create the file
From your VM terminal:

nano app.py

Or in VS Code (as in your screenshot), click New File → name it app.py.

Paste this code

from sentence_transformers import SentenceTransformer
import numpy as np

# Load the EmbeddingGemma-300M model (Google’s open embedding model)
model = SentenceTransformer("google/embeddinggemma-300m")  # auto device (CPU/GPU)

# A sample query
query = "Which planet is known as the Red Planet?"

# A small list of candidate documents
docs = [
    "Venus is often called Earth's twin.",
    "Mars, with its reddish hue, is the Red Planet.",
    "Jupiter is the largest planet.",
    "Saturn has iconic rings."
]

# Encode the query → vector representation optimized for search
q = model.encode_query(query)

# Encode the documents → vector representations optimized for retrieval
D = model.encode_document(docs)

# Compute similarity between the query vector and each document vector
scores = model.similarity(q, D).squeeze().tolist()

# Pair each score with its document and sort (highest similarity first)
ranked = sorted(zip(scores, docs), reverse=True)

# Print top 3 results
print(ranked[:3])

What this file does (detailed)

Imports:
- SentenceTransformer loads the EmbeddingGemma-300M model.
- numpy is for vector math.
Model load:
Loads the Google EmbeddingGemma-300M embedding model, which converts text into vectors (embeddings).
Query + documents:
Defines one query ("Which planet is known as the Red Planet?") and a small set of candidate sentences (our mini “document corpus”).
Encoding:
- model.encode_query(query) → creates a vector representation of the query.
- model.encode_document(docs) → creates vector representations of the candidate docs.
  Using separate methods ensures query/document embeddings are tuned for retrieval.
Similarity:
model.similarity(q, D) computes how close each doc is to the query in vector space.
Ranking:
Sorts docs by similarity score (highest first). The result shows which document best answers the query.
Output:
Prints the top 3 results. You should see “Mars…” ranked highest, since it matches the Red Planet question.

In short:
app.py is a minimal semantic search demo using EmbeddingGemma. It shows how to encode queries & docs, compute similarity, and rank results — the basic workflow behind search engines, chatbots, and RAG systems.

Step 15: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Step 16: Create `build_index.py` and add the following code

Create the file

nano build_index.py

Or in VS Code → New File → name it build_index.py.

Paste the full code (you already have it):

import os, json, argparse, numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
import faiss

def read_corpus(folder):
    paths = []
    texts = []
    for p in Path(folder).rglob("*"):
        if p.suffix.lower() in {".txt", ".md"} and p.stat().st_size > 0:
            paths.append(str(p))
            texts.append(p.read_text(encoding="utf-8", errors="ignore"))
    return paths, texts

def mrl_truncate_and_norm(X, k):
    X = X[:, :k]
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    return X.astype("float32")

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--data_dir", required=True, help="Folder with .txt/.md")
    ap.add_argument("--dim", type=int, default=768, choices=[768,512,256,128])
    ap.add_argument("--out_dir", default="index")
    args = ap.parse_args()

    os.makedirs(args.out_dir, exist_ok=True)

    print("Loading model…")
    model = SentenceTransformer("google/embeddinggemma-300m")  # fp32/bf16 only

    print("Reading corpus…")
    paths, texts = read_corpus(args.data_dir)
    assert texts, "No .txt/.md files found"

    print(f"Encoding {len(texts)} docs…")
    D = model.encode_document(texts, batch_size=64, convert_to_numpy=True)
    # L2-normalize (cosine sim via inner product)
    D = D / np.linalg.norm(D, axis=1, keepdims=True)

    if args.dim < 768:
        print(f"Applying Matryoshka truncation to {args.dim}…")
        D = mrl_truncate_and_norm(D, args.dim)

    index = faiss.IndexFlatIP(D.shape[1])
    index.add(D)

    faiss.write_index(index, f"{args.out_dir}/faiss_{args.dim}.index")
    np.save(f"{args.out_dir}/embeddings_{args.dim}.npy", D)
    with open(f"{args.out_dir}/mapping.json", "w") as f:
        json.dump(paths, f, indent=2)

    print(f"Saved index to {args.out_dir} (dim={args.dim}, N={len(texts)})")

if __name__ == "__main__":
    main()

What this script does

read_corpus(folder):
Reads all .txt and .md files in the given folder. Returns two lists:
- paths → file paths
- texts → file contents
mrl_truncate_and_norm(X, k):
Implements Matryoshka Representation Learning.
- Takes embeddings of size 768.
- Truncates to smaller dimension (512, 256, or 128).
- Re-normalizes them for cosine similarity search.
main():
1. Parse arguments:
  - --data_dir → where your text files are.
  - --dim → embedding size (default 768).
  - --out_dir → where to save the index (default index/).
2. Load the EmbeddingGemma-300M model.
3. Read all docs from your folder.
4. Encode them with model.encode_document().
5. Normalize vectors.
6. Optionally shrink with MRL.
7. Create a FAISS index (cosine similarity using IndexFlatIP).
8. Save:
  - faiss_<dim>.index → the FAISS index file.
  - embeddings_<dim>.npy → numpy array of embeddings.
  - mapping.json → file path mapping to docs.

How to run it

Create some docs (if you don’t have any yet):

mkdir docs
echo "Mars is the Red Planet." > docs/mars.txt
echo "Venus is Earth's twin." > docs/venus.txt
echo "Jupiter is the largest planet." > docs/jupiter.txt

Run the script:

python3 build_index.py --data_dir ./docs

This will:

Read your .txt files in docs/
Encode them with EmbeddingGemma-300M
Save an index under ./index/

Output example:

Loading model…
Reading corpus…
Encoding 3 docs…
Saved index to index (dim=768, N=3)

What you get after running

Inside the index/ folder:

faiss_768.index → FAISS index file
embeddings_768.npy → stored embeddings
mapping.json → JSON mapping file paths

In short: build_index.py prepares your text files into a searchable embedding index using EmbeddingGemma + FAISS.

Conclusion

EmbeddingGemma-300M is a powerful yet lightweight open embedding model from Google DeepMind, designed for retrieval, semantic similarity, classification, clustering, and more — all while being efficient enough to run on laptops, desktops, or modest GPUs. In this guide, we walked through setting up a NodeShift GPU VM, installing dependencies, and building two core scripts:

app.py for a quick semantic search demo using queries and documents.
build_index.py for preparing and indexing your own text corpus with FAISS, ready for scalable search.

With these steps, you now have everything you need to integrate EmbeddingGemma into search pipelines, recommendation systems, or retrieval-augmented applications. Whether on-device or in the cloud, EmbeddingGemma-300M provides a practical and cost-effective foundation for embedding-based workflows.

Relevant blog posts

November 5, 2025

How to Install & Run SoulX-Podcast-1.7B Locally?

SoulX-Podcast-1.7B is a podcast-style TTS model built for long, multi-turn, multi-speaker dialogs. It supports English, Mandarin, and several Chinese dialects (e.g., Sichuanese, Henanese, Cantonese), does zero-shot voice cloning from short reference clips, and exposes paralinguistic controls (like laughter/sighs) to make conversations feel natural over long durations. It’s aimed at generating full podcast episodes—complete with speaker changes, dialectal variation, and expressive delivery—while still running comfortably on a single modern GPU.

November 4, 2025

How to Install & Run GPT-OSS-Safeguard 20B and 120B Locally?

gpt-oss-safeguard is a pair of open-weight, safety-reasoning models built on the gpt-oss family and trained to interpret your own policy text, explain decisions with auditable reasoning, and let you dial up/down the reasoning effort (low/medium/high). The 20B variant targets 16 GB-class GPUs for low-latency filters and offline labeling, while the 120B variant is tuned for highest quality yet still runs on a single 80 GB H100 thanks to MoE + native MXFP4 quantization. Both follow the harmony response format (use it, or outputs will degrade) and ship under Apache-2.0 for flexible commercial use.

November 3, 2025

How to Install & Run AMD Nitro-E Locally?

Nitro-E is AMD’s ultra-light text-to-image diffusion family built on E-MMDiT (~304M params). It’s designed for fast, low-cost training/inference: the base 512px model gives strong quality in ~20 steps, while the distilled 512px variant can generate usable images in as few as 4 steps. There’s also a GRPO-tuned checkpoint for post-training quality/behavior tweaks. Code is plain PyTorch/Diffusers, so it runs on both NVIDIA (CUDA) and AMD (ROCm).

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.