Omnilingual ASR is Meta’s groundbreaking open-source speech recognition system built to support over 1,600 languages, including hundreds never before covered by any ASR model.
It’s designed for inclusivity — allowing new languages to be added with just a few paired examples — and combines scalable zero-shot learning with flexible model architectures (Wav2Vec2, CTC, and LLM-based).
The flagship OmniASR_LLM_7B model achieves state-of-the-art transcription accuracy, with character error rates (CER) below 10% for nearly 80% of supported languages, making it the most globally comprehensive ASR ever released.
Each model is fully compatible with PyTorch, Fairseq2, and Hugging Face datasets, making it easy for developers and researchers to build multilingual transcription systems at scale.
GPU Configuration Guide
| Tier / Use Case | Model Name | Precision | Min VRAM (Approx.) | Suggested GPUs | Notes / Recommendations |
|---|
| Entry – Lightweight testing / fine-tuning | omniASR_CTC_300M | FP16 / BF16 | 2 GB | T4 16G, RTX 3050 6G, L4 24G | Fast inference; ideal for quick multilingual demos. |
| Standard – Medium-scale multilingual ASR | omniASR_CTC_1B | FP16 / BF16 | 3–4 GB | RTX 3060 12G, RTX 4060 8–16G | Balanced accuracy and efficiency; supports most languages. |
| Advanced – High-quality transcription | omniASR_CTC_3B | FP16 / BF16 | 8 GB | RTX 4070 12G, A10 24G | Strong performance on medium-length clips (≤40s). |
| Pro – Large-scale multilingual decoding | omniASR_CTC_7B | FP16 / BF16 | 15 GB | RTX 4090, L40S 48G, A5000 24G | Best accuracy in CTC family; supports dense decoding. |
| LLM-Powered – Context-aware multilingual ASR | omniASR_LLM_1B | FP16 / BF16 | 6 GB | RTX 3060 12G, A10 24G | LLM-based model with language conditioning; robust output. |
| LLM-Powered – Extended multilingual ASR | omniASR_LLM_3B | FP16 / BF16 | 10 GB | RTX 4070, L4 24G, A5000 24G | High performance on noisy audio; great balance. |
| Flagship – Full-scale multilingual accuracy | omniASR_LLM_7B | FP16 / BF16 | 17 GB | RTX 4090, L40S 48G, A6000, A100 40G | SOTA performance; used for all 1600+ languages. |
| Zero-Shot – Unknown or low-resource languages | omniASR_LLM_7B_ZS | FP16 / BF16 | 20 GB | RTX 4090, A6000, H100, A100 40G/80G | Best for unseen or underrepresented languages; zero-shot inference. |
Resources
Link: https://github.com/facebookresearch/omnilingual-asr
Step-by-Step Process to Install & Run Omnilingual ASR Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Omnilingual ASR, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc)
- Proper support for building and running GPU-based models like Omnilingual ASR.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Omnilingual ASR.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Omnilingual ASR runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.
Run the following commands to add the deadsnakes PPA:
apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update
Now, run the following commands to install Python 3.11, Pip and Wheel:
apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version
Step 9: Created and Activated Python 3.11 Virtual Environment
Run the following commands to created and activated Python 3.11 virtual environment:
python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version
Step 10: Install Omnilingual-ASR
Run the following command to install omnilingual-asr:
pip install omnilingual-asr
Step 11: Install libsndfile1 (Required for Fairseq2 / Omnilingual ASR Audio Support)
Run the following command to install libsndfile1:
sudo apt update
sudo apt install -y libsndfile1
Step 12: Verify Omnilingual ASR and Datasets Installation
Now that you’ve installed:
pip install "omnilingual-asr[data]" datasets
Step 13: Install and Verify Gradio (Web Interface Framework)
You’ve successfully executed:
pip install gradio
Step 14: Connect to Your GPU VM with a Code Editor
Before you start running model script with the Omnilingual ASR model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 15: Create the Script
Create a file (ex: #app.py) and add the following code:
import os
import uuid
import subprocess
import gradio as gr
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
# ----------------------------
# Config
# ----------------------------
MODEL_CARD = "omniASR_LLM_7B" # or "omniASR_LLM_7B_ZS"
TMP_DIR = "/tmp/omnilingual_asr" # for converted audio
os.makedirs(TMP_DIR, exist_ok=True)
# ----------------------------
# Init pipeline once
# ----------------------------
pipeline = ASRInferencePipeline(model_card=MODEL_CARD)
def _convert_to_wav_16k_mono(input_path: str) -> str:
"""
Convert any input (mp3, wav, etc.) to 16kHz mono WAV using ffmpeg.
Returns path to the converted file.
"""
if not os.path.exists(input_path):
raise FileNotFoundError(f"Uploaded file not found: {input_path}")
# Unique target filename
out_path = os.path.join(TMP_DIR, f"{uuid.uuid4().hex}.wav")
cmd = [
"ffmpeg",
"-y",
"-i",
input_path,
"-ar", "16000",
"-ac", "1",
out_path,
]
try:
subprocess.run(
cmd,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=True,
)
except subprocess.CalledProcessError as e:
raise RuntimeError(f"ffmpeg failed to convert {input_path}: {e}") from e
if not os.path.exists(out_path):
raise RuntimeError(f"Converted file missing for {input_path}")
return out_path
def transcribe_audio(files, languages):
"""
Gradio callback:
- files: list of filepaths (type='filepath')
- languages: optional comma-separated codes:
eng_Latn, hin_Deva, deu_Latn, ...
"""
if not files:
return "Please upload at least one audio file."
# Normalize to concrete paths
raw_paths = []
for f in files:
path = f if isinstance(f, str) else getattr(f, "name", None)
if not path or not os.path.exists(path):
return f"Could not find uploaded file on server: {f}"
raw_paths.append(path)
# Convert all to 16k mono WAV (robust vs MP3 issues)
converted_paths = []
try:
for p in raw_paths:
converted_paths.append(_convert_to_wav_16k_mono(p))
except Exception as e:
return f"Error while preparing audio: {e}"
# Language handling
lang_list = None
if languages:
tokens = [x.strip() for x in languages.split(",") if x.strip()]
if len(tokens) == 1 and len(converted_paths) > 1:
# single language -> apply to all files
lang_list = tokens * len(converted_paths)
elif len(tokens) == len(converted_paths):
lang_list = tokens
else:
return (
"Language codes must either:\n"
"- Be a single code (used for all files), or\n"
"- Match the number of uploaded files."
)
else:
# None => allow model / pipeline defaults (for supported configs)
lang_list = None
# Run transcription
try:
transcriptions = pipeline.transcribe(
converted_paths,
lang=lang_list,
batch_size=min(len(converted_paths), 4),
)
except Exception as e:
return f"Error during transcription: {e}"
# Format results nicely
blocks = []
for original, converted, text in zip(raw_paths, converted_paths, transcriptions):
name = os.path.basename(original)
blocks.append(f"### {name}\n{text}")
# (Optionally) clean up converted files; comment out if you prefer caching
for cp in converted_paths:
try:
os.remove(cp)
except OSError:
pass
return "\n\n".join(blocks)
# ----------------------------
# Gradio UI
# ----------------------------
with gr.Blocks() as demo:
gr.Markdown(
"""
# Omnilingual ASR – Gradio Demo
- Upload one or more **audio files** (≤ 40s each).
- Any common format is accepted (mp3, wav, flac); backend converts to 16k WAV.
- Optionally specify language codes, e.g. `eng_Latn`, `hin_Deva`, `deu_Latn`.
- Leave languages empty to rely on model behavior / auto-handling.
"""
)
files_input = gr.Files(
label="Upload Audio Files",
file_count="multiple",
type="filepath", # we use server-side paths
)
languages_input = gr.Textbox(
label="Languages (optional, comma-separated)",
placeholder="Example: eng_Latn (or eng_Latn, deu_Latn)",
)
transcribe_btn = gr.Button("Transcribe")
output_box = gr.Markdown(label="Transcriptions")
transcribe_btn.click(
fn=transcribe_audio,
inputs=[files_input, languages_input],
outputs=output_box,
)
if __name__ == "__main__":
demo.launch(server_name="0.0.0.0", server_port=7860)
What This Script Does
- Initializes Omnilingual ASR: Loads the
omniASR_LLM_7B speech recognition model once for fast inference.
- Handles any audio format: Automatically converts uploaded files (MP3, WAV, FLAC, etc.) to 16 kHz mono WAV using
ffmpeg for compatibility.
- Processes multiple languages: Accepts optional comma-separated language codes (e.g.,
eng_Latn, hin_Deva) or runs without them for auto-detection.
- Runs transcription: Sends the preprocessed audio to the ASR pipeline and returns text transcriptions for each file.
- Provides a web UI: Uses Gradio to create a browser interface where users upload audio and instantly see transcribed text results.
Step 16: Run the Gradio WebUI
python app.py
Step 17: Access Gradio WebUI in Your Browser
Go to:
http://0.0.0.0:7860/
Step 18: Play with Model
Conclusion
Omnilingual ASR marks a major milestone in open-source speech technology — bringing accurate, large-scale multilingual transcription to over 1,600 languages, many of which were never supported before. Its combination of Wav2Vec2, CTC, and LLM-based architectures enables both precision and adaptability, while the LLM-powered 7B model delivers state-of-the-art performance even on low-resource or unseen languages.
With simple installation, lightweight inference, and a ready-to-use Gradio WebUI, developers, linguists, and researchers can now easily build inclusive, real-time, multilingual speech recognition systems — from global enterprise applications to community-driven language preservation projects.