EmbeddingGemma-300M is Google DeepMind’s lightweight, multilingual (100+ languages) embedding model built on Gemma 3/T5Gemma foundations. It outputs 768-dim vectors (with Matryoshka down-projections to 512/256/128) optimized for retrieval, classification, clustering, semantic similarity, QA, and code retrieval. It’s designed for low-resource / on-device use, loads via SentenceTransformers, and does not support float16—use FP32 or bfloat16.
Evaluation
Benchmark Results
The model was evaluated against a large collection of different datasets and metrics to cover different aspects of text understanding.
Full Precision Checkpoint
MTEB (Multilingual, v2) |
---|
Dimensionality | Mean (Task) | Mean (TaskType) |
768d | 61.15 | 54.31 |
512d | 60.71 | 53.89 |
256d | 59.68 | 53.01 |
128d | 58.23 | 51.77 |
MTEB (English, v2) |
---|
Dimensionality | Mean (Task) | Mean (TaskType) |
768d | 68.36 | 64.15 |
512d | 67.80 | 63.59 |
256d | 66.89 | 62.94 |
128d | 65.09 | 61.56 |
MTEB (Code, v1) |
---|
Dimensionality | Mean (Task) | Mean (TaskType) |
768d | 68.76 | 68.76 |
512d | 68.48 | 68.48 |
256d | 66.74 | 66.74 |
128d | 62.96 | 62.96 |
QAT Checkpoints
MTEB (Multilingual, v2) |
---|
Quant config (dimensionality) | Mean (Task) | Mean (TaskType) |
Q4_0 (768d) | 60.62 | 53.61 |
Q8_0 (768d) | 60.93 | 53.95 |
Mixed Precision* (768d) | 60.69 | 53.82 |
MTEB (English, v2) |
---|
Quant config (dimensionality) | Mean (Task) | Mean (TaskType) |
Q4_0 (768d) | 67.91 | 63.64 |
Q8_0 (768d) | 68.13 | 63.85 |
Mixed Precision* (768d) | 67.95 | 63.83 |
MTEB (Code, v1) |
---|
Quant config (dimensionality) | Mean (Task) | Mean (TaskType) |
Q4_0 (768d) | 67.99 | 67.99 |
Q8_0 (768d) | 68.70 | 68.70 |
Mixed Precision* (768d) | 68.03 | 68.03 |
Note: QAT models are evaluated after quantization
Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4).
GPU/CPU Configuration Table
Scenario | Precision | Min VRAM / RAM | Recommended | Example Hardware | Context Length | Practical Notes |
---|
Laptop / Desktop (CPU only) | FP32 | 2–3 GB RAM | 4–8 GB RAM | 8-core CPU (Ryzen 7 / i7) | 2,048 | Easiest setup; batch 16–64 for docs; use BLAS/OMP; great for small/medium corpora. |
Entry GPU | BF16 / FP32 | 4 GB VRAM | 6–8 GB VRAM | RTX 3050/3060 6–8 GB, T4 16 GB | 2,048 | Good price/perf; batch 128–256 docs; pin memory + DataLoader prefetch. |
Standard single-GPU | BF16 / FP32 | 8 GB VRAM | 12–24 GB VRAM | RTX 3080/3090, L4 24 GB, A4000 | 2,048 | High-throughput ingestion; batch 512–1,024; keep outputs in FP32; enable half-precision weights if BF16 supported. |
Throughput server | BF16 / FP32 | 16 GB VRAM | 24–48 GB VRAM | L4 24 GB, A5000/A6000 | 2,048 | Production indexing; use multi-worker loaders; stream to vector DB (FAISS/HNSW). |
Multi-GPU pipeline* | BF16 / FP32 | 2×8 GB | 2×16 GB+ | 2×L4 / 2×A4000 | 2,048 | *Model itself doesn’t need multi-GPU; split by sharding batches across GPUs for line-rate ingestion. |
On-device / Edge | FP32 | 2 GB RAM | 3–4 GB RAM | Apple M-series / modern ARM | 2,048 | Great for on-device similarity; smaller MRL dims (128–512) reduce memory + I/O. |
Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice.
Use Case (task type enum) | Descriptions | Recommended Prompt |
---|
Retrieval (Query) | Used to generate embeddings that are optimized for document search or information retrieval | task: search result | query: {content} |
Retrieval (Document) | title: {title | “none”} | text: {content} |
Question Answering | task: question answering | query: {content} |
Fact Verification | task: fact checking | query: {content} |
Classification | Used to generate embeddings that are optimized to classify texts according to preset labels | task: classification | query: {content} |
Clustering | Used to generate embeddings that are optimized to cluster texts based on their similarities | task: clustering | query: {content} |
Semantic Similarity | Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases. | task: sentence similarity | query: {content} |
Code Retrieval | Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of the code blocks are computed using retrieval_document. | task: code retrieval | query: {content} |
Resources
Link: https://huggingface.co/google/embeddinggemma-300m
Step-by-Step Process to Install & Run EmbeddingGemma-300m Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running EmbeddingGemma-300m, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like EmbeddingGemma-300m.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like EmbeddingGemma-300m.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the EmbeddingGemma-300m runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Verify Python Version & Install pip
(if not present)
Since Python 3.10 is already installed, we’ll confirm its version and ensure pip
is available for package installation.
Step 8.1: Check Python Version
Run the following command to verify Python 3.10 is installed:
python3 --version
You should see output like:
Python 3.10.12
Step 8.2: Install pip
(if not already installed)
Even if Python is installed, pip
might not be available.
Check if pip
exists:
pip3 --version
If you get an error like command not found
, then install pip
manually.
Install pip
via get-pip.py
:
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
This will download and install pip
into your system.
You may see a warning about running as root — that’s okay for now.
After installation, verify:
pip3 --version
Expected output:
pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Now pip
is ready to install packages like transformers
, torch
, etc.
Step 9: Created and Activated Python 3.10 Virtual Environment
Run the following commands to created and activated Python 3.10 virtual environment:
apt update && apt install -y python3.10-venv git wget
python3.10 -m venv gemma
source gemma/bin/activate
Step 10: Install Dependencies
Run the following command to install dependencies:
pip install -U sentence-transformers faiss-cpu
Step 11: Install Hugging Face Hub
Run the following command to install huggingface_hub:
pip install -U huggingface_hub
Step 12: Log in to Hugging Face (CLI)
Run the following command to login in to hugging face:
huggingface-cli login
- When prompted, paste your HF token (from https://huggingface.co/settings/tokens).
- For “Add token as git credential? (Y/n)”:
- Y if you plan to
git clone
models/repos.
- n if you only use
huggingface_hub
downloads.
You should see: “Token is valid… saved to /root/.cache/huggingface/stored_tokens”.
The red line “Cannot authenticate through git-credential…” just means no Git credential helper is set. It’s safe to ignore.
Step 13: Connect to Your GPU VM with a Code Editor
Before you start running model script with the EmbeddingGemma-300m model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 14: Create app.py
and Add the Following Code
Create the file
From your VM terminal:
nano app.py
Or in VS Code (as in your screenshot), click New File → name it app.py
.
Paste this code
from sentence_transformers import SentenceTransformer
import numpy as np
# Load the EmbeddingGemma-300M model (Google’s open embedding model)
model = SentenceTransformer("google/embeddinggemma-300m") # auto device (CPU/GPU)
# A sample query
query = "Which planet is known as the Red Planet?"
# A small list of candidate documents
docs = [
"Venus is often called Earth's twin.",
"Mars, with its reddish hue, is the Red Planet.",
"Jupiter is the largest planet.",
"Saturn has iconic rings."
]
# Encode the query → vector representation optimized for search
q = model.encode_query(query)
# Encode the documents → vector representations optimized for retrieval
D = model.encode_document(docs)
# Compute similarity between the query vector and each document vector
scores = model.similarity(q, D).squeeze().tolist()
# Pair each score with its document and sort (highest similarity first)
ranked = sorted(zip(scores, docs), reverse=True)
# Print top 3 results
print(ranked[:3])
What this file does (detailed)
- Imports:
SentenceTransformer
loads the EmbeddingGemma-300M model.
numpy
is for vector math.
- Model load:
Loads the Google EmbeddingGemma-300M embedding model, which converts text into vectors (embeddings).
- Query + documents:
Defines one query ("Which planet is known as the Red Planet?"
) and a small set of candidate sentences (our mini “document corpus”).
- Encoding:
model.encode_query(query)
→ creates a vector representation of the query.
model.encode_document(docs)
→ creates vector representations of the candidate docs.
Using separate methods ensures query/document embeddings are tuned for retrieval.
- Similarity:
model.similarity(q, D)
computes how close each doc is to the query in vector space.
- Ranking:
Sorts docs by similarity score (highest first). The result shows which document best answers the query.
- Output:
Prints the top 3 results. You should see “Mars…” ranked highest, since it matches the Red Planet question.
In short:
app.py
is a minimal semantic search demo using EmbeddingGemma. It shows how to encode queries & docs, compute similarity, and rank results — the basic workflow behind search engines, chatbots, and RAG systems.
Step 15: Run the Script
Run the script from the following command:
python3 app.py
This will download the model and generate response on terminal.
Step 16: Create build_index.py
and add the following code
Create the file
nano build_index.py
Or in VS Code → New File → name it build_index.py
.
Paste the full code (you already have it):
import os, json, argparse, numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
import faiss
def read_corpus(folder):
paths = []
texts = []
for p in Path(folder).rglob("*"):
if p.suffix.lower() in {".txt", ".md"} and p.stat().st_size > 0:
paths.append(str(p))
texts.append(p.read_text(encoding="utf-8", errors="ignore"))
return paths, texts
def mrl_truncate_and_norm(X, k):
X = X[:, :k]
X = X / np.linalg.norm(X, axis=1, keepdims=True)
return X.astype("float32")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--data_dir", required=True, help="Folder with .txt/.md")
ap.add_argument("--dim", type=int, default=768, choices=[768,512,256,128])
ap.add_argument("--out_dir", default="index")
args = ap.parse_args()
os.makedirs(args.out_dir, exist_ok=True)
print("Loading model…")
model = SentenceTransformer("google/embeddinggemma-300m") # fp32/bf16 only
print("Reading corpus…")
paths, texts = read_corpus(args.data_dir)
assert texts, "No .txt/.md files found"
print(f"Encoding {len(texts)} docs…")
D = model.encode_document(texts, batch_size=64, convert_to_numpy=True)
# L2-normalize (cosine sim via inner product)
D = D / np.linalg.norm(D, axis=1, keepdims=True)
if args.dim < 768:
print(f"Applying Matryoshka truncation to {args.dim}…")
D = mrl_truncate_and_norm(D, args.dim)
index = faiss.IndexFlatIP(D.shape[1])
index.add(D)
faiss.write_index(index, f"{args.out_dir}/faiss_{args.dim}.index")
np.save(f"{args.out_dir}/embeddings_{args.dim}.npy", D)
with open(f"{args.out_dir}/mapping.json", "w") as f:
json.dump(paths, f, indent=2)
print(f"Saved index to {args.out_dir} (dim={args.dim}, N={len(texts)})")
if __name__ == "__main__":
main()
What this script does
read_corpus(folder)
:
Reads all .txt
and .md
files in the given folder. Returns two lists:
paths
→ file paths
texts
→ file contents
mrl_truncate_and_norm(X, k)
:
Implements Matryoshka Representation Learning.
- Takes embeddings of size 768.
- Truncates to smaller dimension (
512
, 256
, or 128
).
- Re-normalizes them for cosine similarity search.
main()
:
- Parse arguments:
--data_dir
→ where your text files are.
--dim
→ embedding size (default 768).
--out_dir
→ where to save the index (default index/
).
- Load the EmbeddingGemma-300M model.
- Read all docs from your folder.
- Encode them with
model.encode_document()
.
- Normalize vectors.
- Optionally shrink with MRL.
- Create a FAISS index (cosine similarity using
IndexFlatIP
).
- Save:
faiss_<dim>.index
→ the FAISS index file.
embeddings_<dim>.npy
→ numpy array of embeddings.
mapping.json
→ file path mapping to docs.
How to run it
Create some docs (if you don’t have any yet):
mkdir docs
echo "Mars is the Red Planet." > docs/mars.txt
echo "Venus is Earth's twin." > docs/venus.txt
echo "Jupiter is the largest planet." > docs/jupiter.txt
Run the script:
python3 build_index.py --data_dir ./docs
This will:
- Read your
.txt
files in docs/
- Encode them with EmbeddingGemma-300M
- Save an index under
./index/
Output example:
Loading model…
Reading corpus…
Encoding 3 docs…
Saved index to index (dim=768, N=3)
What you get after running
Inside the index/
folder:
faiss_768.index
→ FAISS index file
embeddings_768.npy
→ stored embeddings
mapping.json
→ JSON mapping file paths
In short: build_index.py
prepares your text files into a searchable embedding index using EmbeddingGemma + FAISS.
Conclusion
EmbeddingGemma-300M is a powerful yet lightweight open embedding model from Google DeepMind, designed for retrieval, semantic similarity, classification, clustering, and more — all while being efficient enough to run on laptops, desktops, or modest GPUs. In this guide, we walked through setting up a NodeShift GPU VM, installing dependencies, and building two core scripts:
app.py
for a quick semantic search demo using queries and documents.
build_index.py
for preparing and indexing your own text corpus with FAISS, ready for scalable search.
With these steps, you now have everything you need to integrate EmbeddingGemma into search pipelines, recommendation systems, or retrieval-augmented applications. Whether on-device or in the cloud, EmbeddingGemma-300M provides a practical and cost-effective foundation for embedding-based workflows.