mmBERT (by JHU CLSP) is a modern multilingual encoder (≈307M params) trained on 3T+ tokens across 1,800+ languages. Built on the ModernBERT family, it brings fast inference (FlashAttention-2/unpadding in the official recipe), 8K context, and state-of-the-art cross-lingual performance on classification, embeddings, retrieval, and reranking. It also introduces training tricks like inverse mask scheduling, inverse temperature sampling, and progressive language addition, which especially help low-resource languages in the decay phase. Use it as:
- a Masked-LM (fill-mask) for language understanding,
- a feature extractor for multilingual embeddings & retrieval,
- a backbone for classification/reranking fine-tuning.
Model Family
Training Data
mmBERT training data is publicly available across different phases:
Model Architecture
Parameter | mmBERT-small | mmBERT-base |
---|
Layers | 22 | 22 |
Hidden Size | 384 | 768 |
Intermediate Size | 1152 | 1152 |
Attention Heads | 6 | 12 |
Total Parameters | 140M | 307M |
Non-embedding Parameters | 42M | 110M |
Max Sequence Length | 8192 | 8192 |
Vocabulary Size | 256,000 | 256,000 |
Tokenizer | Gemma 2 | Gemma 2 |
Training Data
mmBERT was trained on a carefully curated 3T+ token multilingual dataset:
GPU configuration (mmBERT-base, ~307M params, max seq 8192)
Scenario | Precision | Seq Len | Typical Batch | Min VRAM | Comfortable VRAM | Example GPUs | Notes |
---|
Embeddings / Feature extraction (short text) | bf16 / fp16 | 256–512 | 64–256 | 6–8 GB | 12–16 GB | T4-16G, RTX 3060-12G, L4-24G | Turn on torch_dtype="auto" , use padding/truncation; increase batch until near VRAM limit. |
Fill-Mask / MLM inference (short) | bf16 / fp16 | 128–512 | 32–128 | 6–8 GB | 12–16 GB | T4-16G, 3060-12G, L4-24G | For multiple <mask> tokens, keep batch modest. |
Long-doc embeddings | bf16 / fp16 | 2048–4096 | 8–32 | 12–16 GB | 24–40 GB | L4-24G, 4090-24G, A100-40G | Activations scale with seq length—drop batch if you hit OOM. |
8K-context inference | bf16 | 8192 | 1–8 | 24 GB | 40–48 GB | 4090-24G (batch=1–2), L40S-48G, A100-40G | If tight on VRAM, try gradient-free inference and reduce batch. |
Sentence-Transformers training (DPR/CL, short pairs) | bf16 | 128–512 | 256–512 | 24 GB | 40–48 GB | 4090-24G, L40S-48G, A100-40G | Use bf16, grad accum, and mixed precision. Increase batch until stable. |
Cross-Encoder reranking training (pairs) | bf16 | 512–1024 | 16–64 | 16–24 GB | 32–40 GB | 3090-24G, 4090-24G, A100-40G | CE doubles token count (query+doc). Start small; use grad accum. |
Sequence classification fine-tuning | bf16 | 512–2048 | 32–128 | 12–16 GB | 24–40 GB | T4-16G (smaller batch), L4-24G, A100-40G | Increase batch gradually; enable mixed precision, gradient checkpointing if needed. |
8K-context fine-tuning (specialized) | bf16 | 8192 | 2–8 | 40 GB | 80 GB+ | A100-80G, H100-80G | Use grad checkpointing + accum; consider multi-GPU DDP. |
Multi-GPU DDP (any of above) | bf16 | varies | scales | per-GPU as above | scales | 2×L4-24G, 2×A100-40G, 2×H100-80G | Linear scale batch across GPUs; set gradient_accumulation_steps to keep global batch constant. |
Rules of Thumb
- Memory math (very rough): params (~307M) in bf16 ≈ 0.6–0.7 GB; activations dominate with longer seq_len × batch.
- If OOM: lower seq_len first, then batch, then try fp16; for training add grad checkpointing and grad accumulation.
- Throughput: enable
torch.backends.cuda.matmul.allow_tf32=True
on Ampere+; pin memory and use DataLoader workers.
Resources
Link: https://huggingface.co/jhu-clsp/mmBERT-base
Note
Start by validating the model with the simple script, then graduate to an API, and finally a UI: create/activate a venv and install deps (pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio && pip install transformers sentencepiece
), save app.py
(the SDPA-only script you’ve got), and run python3 app.py
to confirm GPU/dtype prints, see top-5 <mask>
predictions, and a cosine-similarity matrix—this proves weights, tokenizer, and CUDA are good; next, turn it into a service by installing pip install fastapi uvicorn[standard]
, saving server.py
(embed + mlm endpoints), launching uvicorn server:app --host 0.0.0.0 --port 7860
, and smoke-testing with curl
to /healthz
, /embed
, and /mlm
(you should get JSON with embeddings or token guesses); once the API is solid, add a lightweight UI by pip install streamlit numpy pandas
, saving streamlit_mmbert.py
(the version that casts embeddings to float32 to avoid BF16 NumPy issues), and running streamlit run streamlit_mmbert.py --server.port 7860
to interactively paste multilingual text, set max tokens, compute embeddings with a similarity table, and try masked-LM—all three steps (script → FastAPI → Streamlit) give you a clean progression from local correctness, to programmable endpoints, to a friendly front end.
Step-by-Step Process to Install & Run mmBERT-base Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running mmBERT-base, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like mmBERT-base.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like mmBERT-base.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the mmBERT-base runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: System Prep, Created and Activated Python 3.10 Virtual Environment
Run the following commands for system prep, created and activated python 3.10 virtual environment:
sudo apt-get update -y
sudo apt-get install -y python3-venv git build-essential
python3 -m venv mmbert
source mmbert/bin/activate
python -m pip install --upgrade pip
Step 9: Install PyTorch
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio
Step 10: Install Dependencies
Run the following commands to install dependencies:
pip install "transformers>=4.44" accelerate sentencepiece datasets scipy scikit-learn
Step 11: Install Faiss
Run the following command to install faiss:
pip install faiss-cpu
Step 12: Connect to Your GPU VM with a Code Editor
Before you start running model script with the mmBERT-base model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 13 — Create The Script (app.py
), Paste the Code
Create the file and paste the code:
# app.py
# Run: python3 app.py
# Notes: This version intentionally avoids FlashAttention-2 and uses SDPA.
# Works on CUDA and CPU. On CUDA, it auto-picks a safe dtype.
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
import torch
import math
MODEL_ID = "jhu-clsp/mmBERT-base"
# -----------------------
# Device & dtype selection
# -----------------------
def pick_device_and_dtype():
if torch.cuda.is_available():
cap = torch.cuda.get_device_capability()
# Ampere (8.0) or newer -> prefer bfloat16; else fp16
if cap[0] >= 8:
return "cuda", torch.bfloat16
else:
return "cuda", torch.float16
return "cpu", torch.float32
device, dtype = pick_device_and_dtype()
# Enable TF32 where it helps (CUDA only)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
print(f"[init] device={device}, dtype={dtype}")
# -----------------------
# Load tokenizer & models
# -----------------------
tok = AutoTokenizer.from_pretrained(MODEL_ID)
# Force SDPA to avoid FlashAttention-2 entirely
common_kwargs = dict(dtype=dtype, attn_implementation="sdpa")
# Masked LM head (for <mask> predictions)
mlm = AutoModelForMaskedLM.from_pretrained(MODEL_ID, **common_kwargs).to(device)
# Plain encoder (for embeddings)
enc = AutoModel.from_pretrained(MODEL_ID, **common_kwargs).to(device)
# -----------------------
# Utils
# -----------------------
def ensure_mask_token(tokenizer):
"""
Ensure we have a mask token. mmBERT uses Gemma2 tokenizer which includes <mask>.
If not found for some reason, we add one (rare).
"""
if tokenizer.mask_token_id is None:
tokenizer.add_special_tokens({"mask_token": "<mask>"})
# Resize embeddings if we ever modify tokenizer (shouldn't be needed normally)
mlm.resize_token_embeddings(len(tokenizer))
enc.resize_token_embeddings(len(tokenizer))
return tokenizer.mask_token, tokenizer.mask_token_id
def mean_pool(last_hidden_state, attention_mask):
mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)
summed = (last_hidden_state * mask).sum(dim=1)
counts = mask.sum(dim=1).clamp(min=1e-9)
return summed / counts
# -----------------------
# Demos
# -----------------------
def demo_mlm():
mask_token, _ = ensure_mask_token(tok)
samples = [
f"The capital of France is {mask_token}.",
f"La capital de España es {mask_token}.",
f"Die Hauptstadt von Deutschland ist {mask_token}.",
]
print("\n[MLM] Top-5 predictions per sample:")
for text in samples:
ids = tok(text, return_tensors="pt").to(device)
with torch.inference_mode():
logits = mlm(**ids).logits
mask_positions = (ids.input_ids == tok.mask_token_id).nonzero(as_tuple=True)
# Handle possible multiple masks; take first position for display
if mask_positions[0].numel() == 0:
print(f" (no mask found in: {text})")
continue
pred = logits[mask_positions].softmax(-1)
topk = torch.topk(pred, k=5, dim=-1).indices[0].tolist()
tokens = [tok.decode(i).strip() for i in topk]
print(f" {text} -> {tokens}")
def demo_embeddings():
texts = [
"Artificial intelligence is transforming technology",
"La inteligencia artificial está transformando la tecnología",
"L'intelligence artificielle transforme la technologie",
"人工智能正在改变技术",
]
ids = tok(texts, padding=True, truncation=True, max_length=512, return_tensors="pt").to(device)
with torch.inference_mode():
last = enc(**ids).last_hidden_state
emb = mean_pool(last, ids.attention_mask)
emb = torch.nn.functional.normalize(emb, p=2, dim=1)
# Cosine similarity (4x4)
sim = emb @ emb.T
print("\n[Embeddings] Cosine similarity matrix:")
for row in sim.tolist():
print(" " + " ".join(f"{v:6.3f}" for v in row))
if __name__ == "__main__":
demo_mlm()
demo_embeddings()
print("\n[done] If you want an API service next, say the word and I’ll turn this into a FastAPI endpoint.")
What This Script Does
- Auto-selects device & precision
- Uses GPU if available; otherwise CPU.
- On Ampere/Ada/Hopper GPUs it uses bfloat16; on older CUDA GPUs it uses float16; CPU uses float32.
- Enables TF32 matmuls for extra speed on NVIDIA GPUs.
- Loads mmBERT-base twice
- As a Masked-Language-Model (MLM) to predict the word for
<mask>
.
- As a plain encoder to produce sentence embeddings.
- Forces safe attention backend
- Sets
attn_implementation="sdpa"
so it does not require FlashAttention-2 (no custom CUDA kernels).
- Utility helpers
ensure_mask_token
guarantees the tokenizer has a <mask>
token.
mean_pool
turns token features into a fixed-size sentence embedding by attention-mask-weighted averaging.
- Two quick demos (printed to your terminal)
- MLM demo: For three sample sentences (EN/ES/DE) containing
<mask>
, prints the top-5 predicted tokens (e.g., “Paris”, “Madrid”, “Berlin”…).
- Embeddings demo: Encodes four multilingual sentences, L2-normalizes them, then prints a 4×4 cosine-similarity matrix showing cross-lingual closeness.
- End message
- Prints a final
[done]
line so you know it finished successfully.
Tip: Edit the samples and texts lists to try your own sentences, or increase max_length (e.g., 2048) if you need longer inputs and your GPU has enough VRAM.
Step 14: Install Python 3.10 Toolchain + Headers
Run the following command to install python 3.10 toolchain + headers:
sudo apt update
sudo apt install -y \
build-essential \
python3.10 python3.10-venv python3.10-dev python3-pip \
python3-dev
Step 15: Run the Script
Run the script from the following command:
python3 app.py
This will download the model and generate response on terminal.
Step 16: Install the Server Dependencies
Run the following command to install server dependencies:
pip install fastapi uvicorn[standard]
Step 17: Create the Script (server.py
), Paste the Code
Create the file and paste the code:
# server.py
# Run: uvicorn server:app --host 0.0.0.0 --port 7860
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import List, Optional
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
MODEL_ID = "jhu-clsp/mmBERT-base"
def pick_device_and_dtype():
if torch.cuda.is_available():
major, _ = torch.cuda.get_device_capability()
return "cuda", (torch.bfloat16 if major >= 8 else torch.float16)
return "cpu", torch.float32
device, dtype = pick_device_and_dtype()
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
tok = AutoTokenizer.from_pretrained(MODEL_ID)
common_kwargs = dict(dtype=dtype, attn_implementation="sdpa")
enc = AutoModel.from_pretrained(MODEL_ID, **common_kwargs).to(device)
mlm = AutoModelForMaskedLM.from_pretrained(MODEL_ID, **common_kwargs).to(device)
def mean_pool(last_hidden_state, attention_mask):
mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)
summed = (last_hidden_state * mask).sum(dim=1)
counts = mask.sum(dim=1).clamp(min=1e-9)
return summed / counts
class EmbedReq(BaseModel):
texts: List[str] = Field(..., description="Texts to embed")
max_length: int = Field(512, ge=8, le=8192)
normalize: bool = True
class EmbedResp(BaseModel):
embeddings: List[List[float]]
class MLMReq(BaseModel):
text: str = Field(..., description="Text containing one <mask> token")
k: int = Field(5, ge=1, le=50)
class MLMResp(BaseModel):
predictions: List[str]
app = FastAPI(title="mmBERT Service", version="1.0")
@app.get("/healthz")
def healthz():
return {"ok": True, "device": device, "dtype": str(dtype)}
@app.post("/embed", response_model=EmbedResp)
@torch.inference_mode()
def embed(req: EmbedReq):
ids = tok(
req.texts, padding=True, truncation=True,
max_length=req.max_length, return_tensors="pt"
).to(device)
last = enc(**ids).last_hidden_state
emb = mean_pool(last, ids.attention_mask)
if req.normalize:
emb = torch.nn.functional.normalize(emb, p=2, dim=1)
return {"embeddings": emb.detach().cpu().tolist()}
@app.post("/mlm", response_model=MLMResp)
@torch.inference_mode()
def mlm_predict(req: MLMReq):
if tok.mask_token_id is None:
tok.add_special_tokens({"mask_token": "<mask>"})
mlm.resize_token_embeddings(len(tok))
enc.resize_token_embeddings(len(tok))
ids = tok(req.text, return_tensors="pt").to(device)
logits = mlm(**ids).logits
mask_pos = (ids.input_ids == tok.mask_token_id).nonzero(as_tuple=True)
if mask_pos[0].numel() == 0:
return {"predictions": []}
probs = logits[mask_pos].softmax(-1)
top_ids = torch.topk(probs, k=req.k, dim=-1).indices[0].tolist()
toks = [tok.decode(t).strip() for t in top_ids]
return {"predictions": toks}
What This Script Does
Boots a FastAPI service for mmBERT-base. On start, it loads the tokenizer and two models: the encoder (for embeddings) and the masked-LM head (for <mask>
predictions). It forces SDPA attention (no FlashAttention-2 needed).
Auto-selects compute: uses GPU if available, with bfloat16 on Ampere/Ada/Hopper (otherwise fp16); CPU falls back to fp32. Enables TF32 matmuls for extra GPU speed.
Endpoints:
GET /healthz
— quick health check returning {"ok": true, "device": "...", "dtype": "..."}
.
POST /embed
— input: {"texts":[...], "max_length":512, "normalize":true}
. It tokenizes, runs the encoder, mean-pools tokens using the attention mask to get sentence embeddings, optionally L2-normalizes, and returns {"embeddings": [[...], ...]}
.
POST /mlm
— input: {"text":"The capital of France is <mask>.", "k":5}
. It ensures a <mask>
token exists, runs the masked-LM, and returns the top-k token predictions as strings.
Utility: a small mean_pool
helper for clean sentence embeddings; <mask>
safety logic will resize embeddings if a mask token must be added.
Step 18: Run it
Run it from the following command:
uvicorn server:app --host 0.0.0.0 --port 7860
Step 19: Quick Tests
Step 19: Install Dependencies
Run the following command to install dependencies:
pip install streamlit numpy pandas
Step 20: Create a app.py
Create a file (ex: streamlit_mmbert.py) and add the following code:
import streamlit as st
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
MODEL_ID = "jhu-clsp/mmBERT-base"
# -----------------------
# Device & dtype helpers
# -----------------------
def pick_device_and_dtype():
if torch.cuda.is_available():
major, _ = torch.cuda.get_device_capability()
return "cuda", (torch.bfloat16 if major >= 8 else torch.float16)
return "cpu", torch.float32
device, dtype = pick_device_and_dtype()
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# -----------------------
# Cache & load once
# -----------------------
@st.cache_resource(show_spinner=True)
def load_models():
tok = AutoTokenizer.from_pretrained(MODEL_ID)
common = dict(dtype=dtype, attn_implementation="sdpa") # no FlashAttention-2
enc = AutoModel.from_pretrained(MODEL_ID, **common).to(device).eval()
mlm = AutoModelForMaskedLM.from_pretrained(MODEL_ID, **common).to(device).eval()
# Ensure <mask> exists (should already be present; this is a safety net)
if tok.mask_token_id is None:
tok.add_special_tokens({"mask_token": "<mask>"})
enc.resize_token_embeddings(len(tok))
mlm.resize_token_embeddings(len(tok))
return tok, enc, mlm
tok, enc, mlm = load_models()
# -----------------------
# Utils
# -----------------------
def mean_pool(last_hidden_state, attention_mask):
mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)
summed = (last_hidden_state * mask).sum(dim=1)
counts = mask.sum(dim=1).clamp(min=1e-9)
return summed / counts
def compute_embeddings(texts, max_len=512, normalize=True):
ids = tok(texts, padding=True, truncation=True, max_length=max_len, return_tensors="pt").to(device)
with torch.inference_mode():
last = enc(**ids).last_hidden_state
emb = mean_pool(last, ids.attention_mask)
if normalize:
emb = torch.nn.functional.normalize(emb, p=2, dim=1)
# Cast to float32 so NumPy/Streamlit can handle it (avoids BF16 error)
return emb.detach().to(torch.float32).cpu().numpy()
def mlm_topk(text, k=5):
ids = tok(text, return_tensors="pt").to(device)
with torch.inference_mode():
logits = mlm(**ids).logits
mask_pos = (ids.input_ids == tok.mask_token_id).nonzero(as_tuple=True)
if mask_pos[0].numel() == 0:
return []
probs = torch.softmax(logits[mask_pos], dim=-1)
top_ids = torch.topk(probs, k=k, dim=-1).indices[0].tolist()
return [tok.decode(t).strip() for t in top_ids]
# -----------------------
# UI
# -----------------------
st.set_page_config(page_title="mmBERT Streamlit", page_icon="🧠", layout="wide")
st.title("🧠 mmBERT-base (ModernBERT) — Streamlit")
st.caption(f"Device: **{device}**, dtype: **{dtype}**, attention: **SDPA** (FlashAttention disabled)")
tab_embed, tab_mlm, tab_info = st.tabs(["🔎 Embeddings", "🧩 Masked-LM", "ℹ️ Info"])
with tab_embed:
st.subheader("Compute sentence embeddings")
default_texts = (
"Artificial intelligence is transforming technology\n"
"La inteligencia artificial está transformando la tecnología\n"
"L'intelligence artificielle transforme la technologie\n"
"人工智能正在改变技术"
)
texts_str = st.text_area("One text per line:", value=default_texts, height=200)
max_len = st.slider("Max tokens", 32, 8192, 512, step=32)
normalize = st.checkbox("L2-normalize embeddings", value=True)
if st.button("Compute embeddings"):
texts = [t.strip() for t in texts_str.splitlines() if t.strip()]
if not texts:
st.warning("Please enter at least one line of text.")
else:
with st.spinner("Embedding…"):
embs = compute_embeddings(texts, max_len=max_len, normalize=normalize)
st.success(f"Done! Shape: {embs.shape} (rows=texts, cols=features)")
# Cosine similarity
sims = (embs @ embs.T) / (
np.linalg.norm(embs, axis=1, keepdims=True) *
np.linalg.norm(embs, axis=1, keepdims=True).T + 1e-9
)
df = pd.DataFrame(np.round(sims, 3),
index=[f"t{i+1}" for i in range(len(texts))],
columns=[f"t{i+1}" for i in range(len(texts))])
st.write("Cosine similarity matrix:")
st.dataframe(df, use_container_width=True)
# Downloads
emb_df = pd.DataFrame(embs)
st.download_button(
"⬇️ Download embeddings (CSV)",
emb_df.to_csv(index=False).encode("utf-8"),
file_name="embeddings.csv",
mime="text/csv",
)
with tab_mlm:
st.subheader("Predict masked tokens")
st.caption("Use the `<mask>` token in your text. Example: `The capital of France is <mask>.`")
text = st.text_input("Text with one <mask>:", value="The capital of France is <mask>.")
k = st.slider("Top-k", 1, 50, 5)
if st.button("Predict"):
if "<mask>" not in text:
st.error("Please include the `<mask>` token in your text.")
else:
with st.spinner("Running masked-LM…"):
preds = mlm_topk(text, k=k)
if preds:
st.success("Top-k predictions:")
st.write(preds)
else:
st.warning("No `<mask>` token found after tokenization or unexpected input.")
with tab_info:
st.markdown("""
**Notes**
- Uses `attn_implementation="sdpa"` so FlashAttention-2 isn’t required.
- Precision is auto-selected: **bfloat16** on Ampere/Ada/Hopper GPUs, **float16** on older CUDA, **float32** on CPU.
- We cast embeddings to **float32** before returning to avoid NumPy’s lack of bfloat16 support.
- Longer `max_len` increases memory/time; start with 512–2048 for speed.
""")
Step 21: Launch Streamlit
Run the following command to launch streamlit:
streamlit run streamlit_mmbert.py
Step 22: Access the WebUI in Your Browser
Once Streamlit is running, it will display three links:
- Local URL →
http://localhost:8501
(works if you’re running on your own machine).
- Network URL →
http://<internal-ip>:8501
(for internal access inside your VM network).
- External URL →
http://<your-vm-public-ip>:8501
(use this to open from your laptop/PC browser).
Open the External URL in your browser.
Example:
http://38.29.145.10:8501
Step 23: (Embeddings + Masked-LM)
Test embeddings
- In Embeddings tab, keep the 4 sample multilingual lines.
- Set Max tokens to 512 and keep L2-normalize embeddings checked.
- Click Compute embeddings.
Expected: green toast “Done! Shape: (4, 768) (rows=texts, cols=features)” and a cosine similarity matrix where cross-lingual pairs score around ~0.8–0.9. Use Download embeddings (CSV) if you want the vectors.
Test masked language modeling
- Switch to Masked-LM tab.
- Use:
The capital of France is <mask>.
and Top-k = 5.
- Click Predict.
Expected: Top predictions include “Paris” (usually rank #1), followed by cities like Strasbourg, Nice, Lyon, Brussels.
Notes (Info tab mirrors this)
- Uses
attn_implementation="sdpa"
→ no FlashAttention needed.
- Precision auto-selects (bf16 on Ampere/Ada/Hopper; fp16 on older CUDA; fp32 on CPU).
- Embeddings are cast to float32 before displaying/downloading to avoid NumPy’s bf16 limitation.
- Longer Max tokens increases memory/time—512–2048 is a good starting range.
Conclusion
You’ve installed and validated mmBERT-base, exposed it via FastAPI, and built a simple Streamlit UI—so it’s ready for multilingual embeddings and masked-LM out of the box. From here, plug the embeddings into your retrieval stack (FAISS/pgvector) or fine-tune for your domain.