How to Install & Run mmBERT-base Locally?

by Ayush Kumar | September 19, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

mmBERT (by JHU CLSP) is a modern multilingual encoder (≈307M params) trained on 3T+ tokens across 1,800+ languages. Built on the ModernBERT family, it brings fast inference (FlashAttention-2/unpadding in the official recipe), 8K context, and state-of-the-art cross-lingual performance on classification, embeddings, retrieval, and reranking. It also introduces training tricks like inverse mask scheduling, inverse temperature sampling, and progressive language addition, which especially help low-resource languages in the decay phase. Use it as:

a Masked-LM (fill-mask) for language understanding,
a feature extractor for multilingual embeddings & retrieval,
a backbone for classification/reranking fine-tuning.

Model Family

Model	Total Params	Non-embed Params	Languages	Download
mmBERT-small	140M	42M	1800+
mmBERT-base	307M	110M	1800+

Training Data

mmBERT training data is publicly available across different phases:

Phase	Dataset	Tokens	Description
Pre-training P1	mmbert-pretrain-p1	2.3T	60 languages, foundational training
Pre-training P2	mmbert-pretrain-p2	–	Extension data for pre-training phase
Pre-training P3	mmbert-pretrain-p3	–	Final pre-training data
Mid-training	mmbert-midtraining	600B	110 languages, context extension to 8K
Decay Phase	mmbert-decay	100B	1833 languages, premium quality

Model Architecture

Parameter	mmBERT-small	mmBERT-base
Layers	22	22
Hidden Size	384	768
Intermediate Size	1152	1152
Attention Heads	6	12
Total Parameters	140M	307M
Non-embedding Parameters	42M	110M
Max Sequence Length	8192	8192
Vocabulary Size	256,000	256,000
Tokenizer	Gemma 2	Gemma 2

Training Data

mmBERT was trained on a carefully curated 3T+ token multilingual dataset:

Phase	Dataset	Description
Pre-training P1	2.3T tokens	60 languages, diverse data mixture
Pre-training P2	–	Extension data for pre-training
Pre-training P3	–	Final pre-training data
Mid-training	600B tokens	110 languages, context extension
Decay Phase	100B tokens	1833 languages, premium quality

GPU configuration (mmBERT-base, ~307M params, max seq 8192)

Scenario	Precision	Seq Len	Typical Batch	Min VRAM	Comfortable VRAM	Example GPUs	Notes
Embeddings / Feature extraction (short text)	bf16 / fp16	256–512	64–256	6–8 GB	12–16 GB	T4-16G, RTX 3060-12G, L4-24G	Turn on `torch_dtype="auto"`, use padding/truncation; increase batch until near VRAM limit.
Fill-Mask / MLM inference (short)	bf16 / fp16	128–512	32–128	6–8 GB	12–16 GB	T4-16G, 3060-12G, L4-24G	For multiple `<mask>` tokens, keep batch modest.
Long-doc embeddings	bf16 / fp16	2048–4096	8–32	12–16 GB	24–40 GB	L4-24G, 4090-24G, A100-40G	Activations scale with seq length—drop batch if you hit OOM.
8K-context inference	bf16	8192	1–8	24 GB	40–48 GB	4090-24G (batch=1–2), L40S-48G, A100-40G	If tight on VRAM, try gradient-free inference and reduce batch.
Sentence-Transformers training (DPR/CL, short pairs)	bf16	128–512	256–512	24 GB	40–48 GB	4090-24G, L40S-48G, A100-40G	Use bf16, grad accum, and mixed precision. Increase batch until stable.
Cross-Encoder reranking training (pairs)	bf16	512–1024	16–64	16–24 GB	32–40 GB	3090-24G, 4090-24G, A100-40G	CE doubles token count (query+doc). Start small; use grad accum.
Sequence classification fine-tuning	bf16	512–2048	32–128	12–16 GB	24–40 GB	T4-16G (smaller batch), L4-24G, A100-40G	Increase batch gradually; enable mixed precision, gradient checkpointing if needed.
8K-context fine-tuning (specialized)	bf16	8192	2–8	40 GB	80 GB+	A100-80G, H100-80G	Use grad checkpointing + accum; consider multi-GPU DDP.
Multi-GPU DDP (any of above)	bf16	varies	scales	per-GPU as above	scales	2×L4-24G, 2×A100-40G, 2×H100-80G	Linear scale batch across GPUs; set `gradient_accumulation_steps` to keep global batch constant.

Rules of Thumb

Memory math (very rough): params (~307M) in bf16 ≈ 0.6–0.7 GB; activations dominate with longer seq_len × batch.
If OOM: lower seq_len first, then batch, then try fp16; for training add grad checkpointing and grad accumulation.
Throughput: enable torch.backends.cuda.matmul.allow_tf32=True on Ampere+; pin memory and use DataLoader workers.

Resources

Link: https://huggingface.co/jhu-clsp/mmBERT-base

Note

Start by validating the model with the simple script, then graduate to an API, and finally a UI: create/activate a venv and install deps (pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio && pip install transformers sentencepiece), save app.py (the SDPA-only script you’ve got), and run python3 app.py to confirm GPU/dtype prints, see top-5 <mask> predictions, and a cosine-similarity matrix—this proves weights, tokenizer, and CUDA are good; next, turn it into a service by installing pip install fastapi uvicorn[standard], saving server.py (embed + mlm endpoints), launching uvicorn server:app --host 0.0.0.0 --port 7860, and smoke-testing with curl to /healthz, /embed, and /mlm (you should get JSON with embeddings or token guesses); once the API is solid, add a lightweight UI by pip install streamlit numpy pandas, saving streamlit_mmbert.py (the version that casts embeddings to float32 to avoid BF16 NumPy issues), and running streamlit run streamlit_mmbert.py --server.port 7860 to interactively paste multilingual text, set max tokens, compute embeddings with a similarity table, and try masked-LM—all three steps (script → FastAPI → Streamlit) give you a clean progression from local correctness, to programmable endpoints, to a friendly front end.

Step-by-Step Process to Install & Run mmBERT-base Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running mmBERT-base, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like mmBERT-base.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like mmBERT-base.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the mmBERT-base runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: System Prep, Created and Activated Python 3.10 Virtual Environment

Run the following commands for system prep, created and activated python 3.10 virtual environment:

sudo apt-get update -y
sudo apt-get install -y python3-venv git build-essential
python3 -m venv mmbert
source mmbert/bin/activate
python -m pip install --upgrade pip

Step 9: Install PyTorch

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio

Step 10: Install Dependencies

Run the following commands to install dependencies:

pip install "transformers>=4.44" accelerate sentencepiece datasets scipy scikit-learn

Step 11: Install Faiss

Run the following command to install faiss:

pip install faiss-cpu

Step 12: Connect to Your GPU VM with a Code Editor

Before you start running model script with the mmBERT-base model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 13 — Create The Script (`app.py`), Paste the Code

Create the file and paste the code:

# app.py
# Run: python3 app.py
# Notes: This version intentionally avoids FlashAttention-2 and uses SDPA.
#        Works on CUDA and CPU. On CUDA, it auto-picks a safe dtype.

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
import torch
import math

MODEL_ID = "jhu-clsp/mmBERT-base"

# -----------------------
# Device & dtype selection
# -----------------------
def pick_device_and_dtype():
    if torch.cuda.is_available():
        cap = torch.cuda.get_device_capability()
        # Ampere (8.0) or newer -> prefer bfloat16; else fp16
        if cap[0] >= 8:
            return "cuda", torch.bfloat16
        else:
            return "cuda", torch.float16
    return "cpu", torch.float32

device, dtype = pick_device_and_dtype()

# Enable TF32 where it helps (CUDA only)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

print(f"[init] device={device}, dtype={dtype}")

# -----------------------
# Load tokenizer & models
# -----------------------
tok = AutoTokenizer.from_pretrained(MODEL_ID)

# Force SDPA to avoid FlashAttention-2 entirely
common_kwargs = dict(dtype=dtype, attn_implementation="sdpa")

# Masked LM head (for <mask> predictions)
mlm = AutoModelForMaskedLM.from_pretrained(MODEL_ID, **common_kwargs).to(device)

# Plain encoder (for embeddings)
enc = AutoModel.from_pretrained(MODEL_ID, **common_kwargs).to(device)

# -----------------------
# Utils
# -----------------------
def ensure_mask_token(tokenizer):
    """
    Ensure we have a mask token. mmBERT uses Gemma2 tokenizer which includes <mask>.
    If not found for some reason, we add one (rare).
    """
    if tokenizer.mask_token_id is None:
        tokenizer.add_special_tokens({"mask_token": "<mask>"})
        # Resize embeddings if we ever modify tokenizer (shouldn't be needed normally)
        mlm.resize_token_embeddings(len(tokenizer))
        enc.resize_token_embeddings(len(tokenizer))
    return tokenizer.mask_token, tokenizer.mask_token_id

def mean_pool(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)
    summed = (last_hidden_state * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1e-9)
    return summed / counts

# -----------------------
# Demos
# -----------------------
def demo_mlm():
    mask_token, _ = ensure_mask_token(tok)
    samples = [
        f"The capital of France is {mask_token}.",
        f"La capital de España es {mask_token}.",
        f"Die Hauptstadt von Deutschland ist {mask_token}.",
    ]
    print("\n[MLM] Top-5 predictions per sample:")
    for text in samples:
        ids = tok(text, return_tensors="pt").to(device)
        with torch.inference_mode():
            logits = mlm(**ids).logits
        mask_positions = (ids.input_ids == tok.mask_token_id).nonzero(as_tuple=True)
        # Handle possible multiple masks; take first position for display
        if mask_positions[0].numel() == 0:
            print(f"  (no mask found in: {text})")
            continue
        pred = logits[mask_positions].softmax(-1)
        topk = torch.topk(pred, k=5, dim=-1).indices[0].tolist()
        tokens = [tok.decode(i).strip() for i in topk]
        print(f"  {text} -> {tokens}")

def demo_embeddings():
    texts = [
        "Artificial intelligence is transforming technology",
        "La inteligencia artificial está transformando la tecnología",
        "L'intelligence artificielle transforme la technologie",
        "人工智能正在改变技术",
    ]
    ids = tok(texts, padding=True, truncation=True, max_length=512, return_tensors="pt").to(device)
    with torch.inference_mode():
        last = enc(**ids).last_hidden_state
        emb = mean_pool(last, ids.attention_mask)
        emb = torch.nn.functional.normalize(emb, p=2, dim=1)
    # Cosine similarity (4x4)
    sim = emb @ emb.T
    print("\n[Embeddings] Cosine similarity matrix:")
    for row in sim.tolist():
        print("  " + "  ".join(f"{v:6.3f}" for v in row))

if __name__ == "__main__":
    demo_mlm()
    demo_embeddings()
    print("\n[done] If you want an API service next, say the word and I’ll turn this into a FastAPI endpoint.")

What This Script Does

Auto-selects device & precision
- Uses GPU if available; otherwise CPU.
- On Ampere/Ada/Hopper GPUs it uses bfloat16; on older CUDA GPUs it uses float16; CPU uses float32.
- Enables TF32 matmuls for extra speed on NVIDIA GPUs.
Loads mmBERT-base twice
- As a Masked-Language-Model (MLM) to predict the word for <mask>.
- As a plain encoder to produce sentence embeddings.
Forces safe attention backend
- Sets attn_implementation="sdpa" so it does not require FlashAttention-2 (no custom CUDA kernels).
Utility helpers
- ensure_mask_token guarantees the tokenizer has a <mask> token.
- mean_pool turns token features into a fixed-size sentence embedding by attention-mask-weighted averaging.
Two quick demos (printed to your terminal)
1. MLM demo: For three sample sentences (EN/ES/DE) containing <mask>, prints the top-5 predicted tokens (e.g., “Paris”, “Madrid”, “Berlin”…).
2. Embeddings demo: Encodes four multilingual sentences, L2-normalizes them, then prints a 4×4 cosine-similarity matrix showing cross-lingual closeness.
End message
- Prints a final [done] line so you know it finished successfully.

Tip: Edit the samples and texts lists to try your own sentences, or increase max_length (e.g., 2048) if you need longer inputs and your GPU has enough VRAM.

Step 14: Install Python 3.10 Toolchain + Headers

Run the following command to install python 3.10 toolchain + headers:

sudo apt update
sudo apt install -y \
  build-essential \
  python3.10 python3.10-venv python3.10-dev python3-pip \
  python3-dev

Step 15: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Step 16: Install the Server Dependencies

Run the following command to install server dependencies:

pip install fastapi uvicorn[standard]

Step 17: Create the Script (`server.py`), Paste the Code

Create the file and paste the code:

# server.py
# Run: uvicorn server:app --host 0.0.0.0 --port 7860
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import List, Optional
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

MODEL_ID = "jhu-clsp/mmBERT-base"

def pick_device_and_dtype():
    if torch.cuda.is_available():
        major, _ = torch.cuda.get_device_capability()
        return "cuda", (torch.bfloat16 if major >= 8 else torch.float16)
    return "cpu", torch.float32

device, dtype = pick_device_and_dtype()
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

tok = AutoTokenizer.from_pretrained(MODEL_ID)

common_kwargs = dict(dtype=dtype, attn_implementation="sdpa")
enc = AutoModel.from_pretrained(MODEL_ID, **common_kwargs).to(device)
mlm = AutoModelForMaskedLM.from_pretrained(MODEL_ID, **common_kwargs).to(device)

def mean_pool(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)
    summed = (last_hidden_state * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1e-9)
    return summed / counts

class EmbedReq(BaseModel):
    texts: List[str] = Field(..., description="Texts to embed")
    max_length: int = Field(512, ge=8, le=8192)
    normalize: bool = True

class EmbedResp(BaseModel):
    embeddings: List[List[float]]

class MLMReq(BaseModel):
    text: str = Field(..., description="Text containing one <mask> token")
    k: int = Field(5, ge=1, le=50)

class MLMResp(BaseModel):
    predictions: List[str]

app = FastAPI(title="mmBERT Service", version="1.0")

@app.get("/healthz")
def healthz():
    return {"ok": True, "device": device, "dtype": str(dtype)}

@app.post("/embed", response_model=EmbedResp)
@torch.inference_mode()
def embed(req: EmbedReq):
    ids = tok(
        req.texts, padding=True, truncation=True,
        max_length=req.max_length, return_tensors="pt"
    ).to(device)
    last = enc(**ids).last_hidden_state
    emb = mean_pool(last, ids.attention_mask)
    if req.normalize:
        emb = torch.nn.functional.normalize(emb, p=2, dim=1)
    return {"embeddings": emb.detach().cpu().tolist()}

@app.post("/mlm", response_model=MLMResp)
@torch.inference_mode()
def mlm_predict(req: MLMReq):
    if tok.mask_token_id is None:
        tok.add_special_tokens({"mask_token": "<mask>"})
        mlm.resize_token_embeddings(len(tok))
        enc.resize_token_embeddings(len(tok))
    ids = tok(req.text, return_tensors="pt").to(device)
    logits = mlm(**ids).logits
    mask_pos = (ids.input_ids == tok.mask_token_id).nonzero(as_tuple=True)
    if mask_pos[0].numel() == 0:
        return {"predictions": []}
    probs = logits[mask_pos].softmax(-1)
    top_ids = torch.topk(probs, k=req.k, dim=-1).indices[0].tolist()
    toks = [tok.decode(t).strip() for t in top_ids]
    return {"predictions": toks}

What This Script Does

Boots a FastAPI service for mmBERT-base. On start, it loads the tokenizer and two models: the encoder (for embeddings) and the masked-LM head (for <mask> predictions). It forces SDPA attention (no FlashAttention-2 needed).

Auto-selects compute: uses GPU if available, with bfloat16 on Ampere/Ada/Hopper (otherwise fp16); CPU falls back to fp32. Enables TF32 matmuls for extra GPU speed.

Endpoints:

GET /healthz — quick health check returning {"ok": true, "device": "...", "dtype": "..."}.
POST /embed — input: {"texts":[...], "max_length":512, "normalize":true}. It tokenizes, runs the encoder, mean-pools tokens using the attention mask to get sentence embeddings, optionally L2-normalizes, and returns {"embeddings": [[...], ...]}.
POST /mlm — input: {"text":"The capital of France is <mask>.", "k":5}. It ensures a <mask> token exists, runs the masked-LM, and returns the top-k token predictions as strings.

Utility: a small mean_pool helper for clean sentence embeddings; <mask> safety logic will resize embeddings if a mask token must be added.

Step 18: Run it

Run it from the following command:

uvicorn server:app --host 0.0.0.0 --port 7860

Step 19: Quick Tests

Step 19: Install Dependencies

Run the following command to install dependencies:

pip install streamlit numpy pandas

Step 20: Create a `app.py`

Create a file (ex: streamlit_mmbert.py) and add the following code:

import streamlit as st
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

MODEL_ID = "jhu-clsp/mmBERT-base"

# -----------------------
# Device & dtype helpers
# -----------------------
def pick_device_and_dtype():
    if torch.cuda.is_available():
        major, _ = torch.cuda.get_device_capability()
        return "cuda", (torch.bfloat16 if major >= 8 else torch.float16)
    return "cpu", torch.float32

device, dtype = pick_device_and_dtype()
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# -----------------------
# Cache & load once
# -----------------------
@st.cache_resource(show_spinner=True)
def load_models():
    tok = AutoTokenizer.from_pretrained(MODEL_ID)
    common = dict(dtype=dtype, attn_implementation="sdpa")  # no FlashAttention-2
    enc = AutoModel.from_pretrained(MODEL_ID, **common).to(device).eval()
    mlm = AutoModelForMaskedLM.from_pretrained(MODEL_ID, **common).to(device).eval()
    # Ensure <mask> exists (should already be present; this is a safety net)
    if tok.mask_token_id is None:
        tok.add_special_tokens({"mask_token": "<mask>"})
        enc.resize_token_embeddings(len(tok))
        mlm.resize_token_embeddings(len(tok))
    return tok, enc, mlm

tok, enc, mlm = load_models()

# -----------------------
# Utils
# -----------------------
def mean_pool(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)
    summed = (last_hidden_state * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1e-9)
    return summed / counts

def compute_embeddings(texts, max_len=512, normalize=True):
    ids = tok(texts, padding=True, truncation=True, max_length=max_len, return_tensors="pt").to(device)
    with torch.inference_mode():
        last = enc(**ids).last_hidden_state
        emb = mean_pool(last, ids.attention_mask)
        if normalize:
            emb = torch.nn.functional.normalize(emb, p=2, dim=1)
    # Cast to float32 so NumPy/Streamlit can handle it (avoids BF16 error)
    return emb.detach().to(torch.float32).cpu().numpy()

def mlm_topk(text, k=5):
    ids = tok(text, return_tensors="pt").to(device)
    with torch.inference_mode():
        logits = mlm(**ids).logits
    mask_pos = (ids.input_ids == tok.mask_token_id).nonzero(as_tuple=True)
    if mask_pos[0].numel() == 0:
        return []
    probs = torch.softmax(logits[mask_pos], dim=-1)
    top_ids = torch.topk(probs, k=k, dim=-1).indices[0].tolist()
    return [tok.decode(t).strip() for t in top_ids]

# -----------------------
# UI
# -----------------------
st.set_page_config(page_title="mmBERT Streamlit", page_icon="🧠", layout="wide")
st.title("🧠 mmBERT-base (ModernBERT) — Streamlit")
st.caption(f"Device: **{device}**, dtype: **{dtype}**, attention: **SDPA** (FlashAttention disabled)")

tab_embed, tab_mlm, tab_info = st.tabs(["🔎 Embeddings", "🧩 Masked-LM", "ℹ️ Info"])

with tab_embed:
    st.subheader("Compute sentence embeddings")
    default_texts = (
        "Artificial intelligence is transforming technology\n"
        "La inteligencia artificial está transformando la tecnología\n"
        "L'intelligence artificielle transforme la technologie\n"
        "人工智能正在改变技术"
    )
    texts_str = st.text_area("One text per line:", value=default_texts, height=200)
    max_len = st.slider("Max tokens", 32, 8192, 512, step=32)
    normalize = st.checkbox("L2-normalize embeddings", value=True)
    if st.button("Compute embeddings"):
        texts = [t.strip() for t in texts_str.splitlines() if t.strip()]
        if not texts:
            st.warning("Please enter at least one line of text.")
        else:
            with st.spinner("Embedding…"):
                embs = compute_embeddings(texts, max_len=max_len, normalize=normalize)
            st.success(f"Done! Shape: {embs.shape} (rows=texts, cols=features)")
            # Cosine similarity
            sims = (embs @ embs.T) / (
                np.linalg.norm(embs, axis=1, keepdims=True) *
                np.linalg.norm(embs, axis=1, keepdims=True).T + 1e-9
            )
            df = pd.DataFrame(np.round(sims, 3),
                              index=[f"t{i+1}" for i in range(len(texts))],
                              columns=[f"t{i+1}" for i in range(len(texts))])
            st.write("Cosine similarity matrix:")
            st.dataframe(df, use_container_width=True)

            # Downloads
            emb_df = pd.DataFrame(embs)
            st.download_button(
                "⬇️ Download embeddings (CSV)",
                emb_df.to_csv(index=False).encode("utf-8"),
                file_name="embeddings.csv",
                mime="text/csv",
            )

with tab_mlm:
    st.subheader("Predict masked tokens")
    st.caption("Use the `<mask>` token in your text. Example: `The capital of France is <mask>.`")
    text = st.text_input("Text with one <mask>:", value="The capital of France is <mask>.")
    k = st.slider("Top-k", 1, 50, 5)
    if st.button("Predict"):
        if "<mask>" not in text:
            st.error("Please include the `<mask>` token in your text.")
        else:
            with st.spinner("Running masked-LM…"):
                preds = mlm_topk(text, k=k)
            if preds:
                st.success("Top-k predictions:")
                st.write(preds)
            else:
                st.warning("No `<mask>` token found after tokenization or unexpected input.")

with tab_info:
    st.markdown("""
**Notes**
- Uses `attn_implementation="sdpa"` so FlashAttention-2 isn’t required.
- Precision is auto-selected: **bfloat16** on Ampere/Ada/Hopper GPUs, **float16** on older CUDA, **float32** on CPU.
- We cast embeddings to **float32** before returning to avoid NumPy’s lack of bfloat16 support.
- Longer `max_len` increases memory/time; start with 512–2048 for speed.
""")

Step 21: Launch Streamlit

Run the following command to launch streamlit:

streamlit run streamlit_mmbert.py

Step 22: Access the WebUI in Your Browser

Once Streamlit is running, it will display three links:

Local URL → http://localhost:8501 (works if you’re running on your own machine).
Network URL → http://<internal-ip>:8501 (for internal access inside your VM network).
External URL → http://<your-vm-public-ip>:8501 (use this to open from your laptop/PC browser).

Open the External URL in your browser.
Example:

http://38.29.145.10:8501

Step 23: (Embeddings + Masked-LM)

Test embeddings

In Embeddings tab, keep the 4 sample multilingual lines.
Set Max tokens to 512 and keep L2-normalize embeddings checked.
Click Compute embeddings.
Expected: green toast “Done! Shape: (4, 768) (rows=texts, cols=features)” and a cosine similarity matrix where cross-lingual pairs score around ~0.8–0.9. Use Download embeddings (CSV) if you want the vectors.

Test masked language modeling

Switch to Masked-LM tab.
Use: The capital of France is <mask>. and Top-k = 5.
Click Predict.
Expected: Top predictions include “Paris” (usually rank #1), followed by cities like Strasbourg, Nice, Lyon, Brussels.

Notes (Info tab mirrors this)

Uses attn_implementation="sdpa" → no FlashAttention needed.
Precision auto-selects (bf16 on Ampere/Ada/Hopper; fp16 on older CUDA; fp32 on CPU).
Embeddings are cast to float32 before displaying/downloading to avoid NumPy’s bf16 limitation.
Longer Max tokens increases memory/time—512–2048 is a good starting range.

Conclusion

You’ve installed and validated mmBERT-base, exposed it via FastAPI, and built a simple Streamlit UI—so it’s ready for multilingual embeddings and masked-LM out of the box. From here, plug the embeddings into your retrieval stack (FAISS/pgvector) or fine-tune for your domain.

Relevant blog posts

September 18, 2025

How to Install & Run Alibaba Tongyi DeepResearch Locally?

Tongyi DeepResearch (30B-A3B) is a 30-billion parameter Mixture-of-Experts (MoE) language model developed by Alibaba Tongyi Lab, with only 3B active parameters per token for efficiency. Unlike general LLMs, it is purpose-built for deep, long-horizon information-seeking tasks, achieving state-of-the-art results on benchmarks such as Humanity’s Last Exam, BrowserComp, WebWalkerQA, GAIA, xbench-DeepSearch, and FRAMES. Key highlights include a fully automated synthetic data pipeline, large-scale continual pre-training on agentic data, and end-to-end reinforcement learning via a customized Group Relative Policy Optimization framework. At inference, it supports both ReAct-style lightweight reasoning and a test-time scaling “Heavy” mode (IterResearch) to maximize performance.

September 16, 2025

How to Install & Run Facebook MobileLLM-R1-950M Locally?

MobileLLM-R1-950M is Meta’s new reasoning-focused model in the MobileLLM family, optimized for math, programming (Python/C++), and scientific problems. Despite its smaller scale (<1B parameters), it rivals or outperforms much larger open-source models like Qwen3-0.6B and SmolLM2-1.7B across benchmarks such as MATH, GSM8K, MMLU, and LiveCodeBench. With a 32K context window, efficient training pipeline, and open recipes, it’s designed to be lightweight yet powerful for reasoning-heavy workloads.

September 15, 2025

How to Install & Run Google VaultGemma-1B Locally?

When we talk about open language models, most discussions revolve around performance and scale. But what if the conversation centered on privacy first? That’s where VaultGemma comes in. Developed by Google, VaultGemma is a unique variant of the Gemma family, built entirely from the ground up with Differential Privacy (DP) at its core. Using DP-SGD (Differentially Private Stochastic Gradient Descent), it provides strong, mathematically-backed guarantees that no single training example can be extracted from its parameters. In plain words: VaultGemma remembers patterns, not people. Despite being lightweight (under 1B parameters), the model shows solid performance on reasoning, code, and natural language tasks, while ensuring that the privacy of its training data is never compromised. That makes it a rare model suitable for healthcare, finance, and sensitive communication systems—where both performance and privacy matter. VaultGemma might not top the leaderboards compared to non-private models, but it represents a paradigm shift: proving that you don’t have to choose between utility and privacy—you can build responsibly from the start.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.