How to Install & Run Omnilingual ASR Locally?

by Ayush Kumar | November 11, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Omnilingual ASR is Meta’s groundbreaking open-source speech recognition system built to support over 1,600 languages, including hundreds never before covered by any ASR model.
It’s designed for inclusivity — allowing new languages to be added with just a few paired examples — and combines scalable zero-shot learning with flexible model architectures (Wav2Vec2, CTC, and LLM-based).

The flagship OmniASR_LLM_7B model achieves state-of-the-art transcription accuracy, with character error rates (CER) below 10% for nearly 80% of supported languages, making it the most globally comprehensive ASR ever released.
Each model is fully compatible with PyTorch, Fairseq2, and Hugging Face datasets, making it easy for developers and researchers to build multilingual transcription systems at scale.

GPU Configuration Guide

Tier / Use Case	Model Name	Precision	Min VRAM (Approx.)	Suggested GPUs	Notes / Recommendations
Entry – Lightweight testing / fine-tuning	`omniASR_CTC_300M`	FP16 / BF16	2 GB	T4 16G, RTX 3050 6G, L4 24G	Fast inference; ideal for quick multilingual demos.
Standard – Medium-scale multilingual ASR	`omniASR_CTC_1B`	FP16 / BF16	3–4 GB	RTX 3060 12G, RTX 4060 8–16G	Balanced accuracy and efficiency; supports most languages.
Advanced – High-quality transcription	`omniASR_CTC_3B`	FP16 / BF16	8 GB	RTX 4070 12G, A10 24G	Strong performance on medium-length clips (≤40s).
Pro – Large-scale multilingual decoding	`omniASR_CTC_7B`	FP16 / BF16	15 GB	RTX 4090, L40S 48G, A5000 24G	Best accuracy in CTC family; supports dense decoding.
LLM-Powered – Context-aware multilingual ASR	`omniASR_LLM_1B`	FP16 / BF16	6 GB	RTX 3060 12G, A10 24G	LLM-based model with language conditioning; robust output.
LLM-Powered – Extended multilingual ASR	`omniASR_LLM_3B`	FP16 / BF16	10 GB	RTX 4070, L4 24G, A5000 24G	High performance on noisy audio; great balance.
Flagship – Full-scale multilingual accuracy	`omniASR_LLM_7B`	FP16 / BF16	17 GB	RTX 4090, L40S 48G, A6000, A100 40G	SOTA performance; used for all 1600+ languages.
Zero-Shot – Unknown or low-resource languages	`omniASR_LLM_7B_ZS`	FP16 / BF16	20 GB	RTX 4090, A6000, H100, A100 40G/80G	Best for unseen or underrepresented languages; zero-shot inference.

Resources

Link: https://github.com/facebookresearch/omnilingual-asr

Step-by-Step Process to Install & Run Omnilingual ASR Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Omnilingual ASR, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like Omnilingual ASR.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Omnilingual ASR.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Omnilingual ASR runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 10: Install Omnilingual-ASR

Run the following command to install omnilingual-asr:

pip install omnilingual-asr

Step 11: Install libsndfile1 (Required for Fairseq2 / Omnilingual ASR Audio Support)

Run the following command to install libsndfile1:

sudo apt update
sudo apt install -y libsndfile1

Step 12: Verify Omnilingual ASR and Datasets Installation

Now that you’ve installed:

pip install "omnilingual-asr[data]" datasets

Step 13: Install and Verify Gradio (Web Interface Framework)

You’ve successfully executed:

pip install gradio

Step 14: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Omnilingual ASR model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 15: Create the Script

Create a file (ex: #app.py) and add the following code:

import os
import uuid
import subprocess
import gradio as gr
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# ----------------------------
# Config
# ----------------------------

MODEL_CARD = "omniASR_LLM_7B"     # or "omniASR_LLM_7B_ZS"
TMP_DIR = "/tmp/omnilingual_asr"  # for converted audio

os.makedirs(TMP_DIR, exist_ok=True)

# ----------------------------
# Init pipeline once
# ----------------------------

pipeline = ASRInferencePipeline(model_card=MODEL_CARD)


def _convert_to_wav_16k_mono(input_path: str) -> str:
    """
    Convert any input (mp3, wav, etc.) to 16kHz mono WAV using ffmpeg.
    Returns path to the converted file.
    """
    if not os.path.exists(input_path):
        raise FileNotFoundError(f"Uploaded file not found: {input_path}")

    # Unique target filename
    out_path = os.path.join(TMP_DIR, f"{uuid.uuid4().hex}.wav")

    cmd = [
        "ffmpeg",
        "-y",
        "-i",
        input_path,
        "-ar", "16000",
        "-ac", "1",
        out_path,
    ]

    try:
        subprocess.run(
            cmd,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
            check=True,
        )
    except subprocess.CalledProcessError as e:
        raise RuntimeError(f"ffmpeg failed to convert {input_path}: {e}") from e

    if not os.path.exists(out_path):
        raise RuntimeError(f"Converted file missing for {input_path}")

    return out_path


def transcribe_audio(files, languages):
    """
    Gradio callback:
    - files: list of filepaths (type='filepath')
    - languages: optional comma-separated codes:
        eng_Latn, hin_Deva, deu_Latn, ...
    """

    if not files:
        return "Please upload at least one audio file."

    # Normalize to concrete paths
    raw_paths = []
    for f in files:
        path = f if isinstance(f, str) else getattr(f, "name", None)
        if not path or not os.path.exists(path):
            return f"Could not find uploaded file on server: {f}"
        raw_paths.append(path)

    # Convert all to 16k mono WAV (robust vs MP3 issues)
    converted_paths = []
    try:
        for p in raw_paths:
            converted_paths.append(_convert_to_wav_16k_mono(p))
    except Exception as e:
        return f"Error while preparing audio: {e}"

    # Language handling
    lang_list = None
    if languages:
        tokens = [x.strip() for x in languages.split(",") if x.strip()]

        if len(tokens) == 1 and len(converted_paths) > 1:
            # single language -> apply to all files
            lang_list = tokens * len(converted_paths)
        elif len(tokens) == len(converted_paths):
            lang_list = tokens
        else:
            return (
                "Language codes must either:\n"
                "- Be a single code (used for all files), or\n"
                "- Match the number of uploaded files."
            )
    else:
        # None => allow model / pipeline defaults (for supported configs)
        lang_list = None

    # Run transcription
    try:
        transcriptions = pipeline.transcribe(
            converted_paths,
            lang=lang_list,
            batch_size=min(len(converted_paths), 4),
        )
    except Exception as e:
        return f"Error during transcription: {e}"

    # Format results nicely
    blocks = []
    for original, converted, text in zip(raw_paths, converted_paths, transcriptions):
        name = os.path.basename(original)
        blocks.append(f"### {name}\n{text}")

    # (Optionally) clean up converted files; comment out if you prefer caching
    for cp in converted_paths:
        try:
            os.remove(cp)
        except OSError:
            pass

    return "\n\n".join(blocks)


# ----------------------------
# Gradio UI
# ----------------------------

with gr.Blocks() as demo:
    gr.Markdown(
        """
        # Omnilingual ASR – Gradio Demo

        - Upload one or more **audio files** (≤ 40s each).
        - Any common format is accepted (mp3, wav, flac); backend converts to 16k WAV.
        - Optionally specify language codes, e.g. `eng_Latn`, `hin_Deva`, `deu_Latn`.
        - Leave languages empty to rely on model behavior / auto-handling.
        """
    )

    files_input = gr.Files(
        label="Upload Audio Files",
        file_count="multiple",
        type="filepath",  # we use server-side paths
    )

    languages_input = gr.Textbox(
        label="Languages (optional, comma-separated)",
        placeholder="Example: eng_Latn (or eng_Latn, deu_Latn)",
    )

    transcribe_btn = gr.Button("Transcribe")
    output_box = gr.Markdown(label="Transcriptions")

    transcribe_btn.click(
        fn=transcribe_audio,
        inputs=[files_input, languages_input],
        outputs=output_box,
    )

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860)

What This Script Does

Initializes Omnilingual ASR: Loads the omniASR_LLM_7B speech recognition model once for fast inference.
Handles any audio format: Automatically converts uploaded files (MP3, WAV, FLAC, etc.) to 16 kHz mono WAV using ffmpeg for compatibility.
Processes multiple languages: Accepts optional comma-separated language codes (e.g., eng_Latn, hin_Deva) or runs without them for auto-detection.
Runs transcription: Sends the preprocessed audio to the ASR pipeline and returns text transcriptions for each file.
Provides a web UI: Uses Gradio to create a browser interface where users upload audio and instantly see transcribed text results.

Step 16: Run the Gradio WebUI

python app.py

Step 17: Access Gradio WebUI in Your Browser

Go to:

http://0.0.0.0:7860/

Step 18: Play with Model

Conclusion

Omnilingual ASR marks a major milestone in open-source speech technology — bringing accurate, large-scale multilingual transcription to over 1,600 languages, many of which were never supported before. Its combination of Wav2Vec2, CTC, and LLM-based architectures enables both precision and adaptability, while the LLM-powered 7B model delivers state-of-the-art performance even on low-resource or unseen languages.

With simple installation, lightweight inference, and a ready-to-use Gradio WebUI, developers, linguists, and researchers can now easily build inclusive, real-time, multilingual speech recognition systems — from global enterprise applications to community-driven language preservation projects.

Relevant blog posts

November 12, 2025

How to Install & Run SAP-RPT-1-OSS Locally?

sap-rpt-1-oss is SAP’s table-native, semantics-aware in-context learner for classification and regression. It embeds column names and cell values (no manual preprocessing), handles missing data, and scales quality with context size and bagging. For peak accuracy, it prefers big VRAM; for speed or smaller GPUs, just shrink the context and bagging.

November 11, 2025

How to Cut Your AI Costs in Half with TOON – The Smarter, Token-Optimized Alternative to JSON

Every token you send to an AI model costs money, and when your application scales, those costs can balloon fast. That’s where Token-Oriented Object Notation (TOON) steps in, offering a revolutionary way to save on API expenses without sacrificing data clarity or model accuracy. Designed as a compact, human-readable, and LLM-optimized alternative to JSON, TOON drastically reduces token usage by 30–60% across large structured datasets. It blends the simplicity of CSV, the readability of YAML, and the precision of JSON, creating a format that’s tailor-made for AI inputs. With features like tabular arrays, indentation-based hierarchy, and optional key folding, TOON helps models parse and reason about structured data more efficiently, all while maintaining perfect fidelity to your original dataset. The result? You send less data, get faster responses, and cut your AI inference costs dramatically, all by changing how you represent your data.

November 10, 2025

How to Install and Run Kimi K2 Thinking GGUF Locally

Kimi K2 Thinking is one of the most advanced open-source reasoning models available today, combining a Mixture-of-Experts (MoE) architecture with a massive 1 trillion total parameters, yet it efficiently activates only 32 billion parameters per token, delivering extraordinary intelligence without overwhelming hardware demands. What makes K2 Thinking particularly astonishing is its agentic capability: before it answer your questions, it plans, reasons step-by-step, and invokes tools autonomously to solve multi-step problems, write code, perform research, analyse data, and execute workflows that may span 200–300 sequential steps without losing coherence. This makes it especially powerful for developers building autonomous agents, researchers performing long-horizon reasoning, and anyone who needs more than just a chat-style response. Additionally, thanks to native INT4 quantization and Quantization-Aware Training, K2 Thinking offers lossless reasoning at significantly reduced GPU memory and latency, enabling smooth inference even on local hardware. And if you want to run it locally without needing a huge hardware, the Kimi-K2-GGUF quantized builds by Unsloth bring the storage requirements down from 1.09TB to ~230GB, an 80% size reduction, making it realistically deployable on consumer-grade setups.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.