How to Install & Run NVIDIA Parakeet TDT 0.6B V3 Locally?

by Ayush Kumar | September 1, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Parakeet-TDT-0.6B-v3 is NVIDIA’s multilingual automatic speech recognition (ASR) model with 600M parameters, built on the FastConformer-TDT architecture. It supports 25 European languages, automatically detects the input language, and delivers accurate transcriptions with punctuation and capitalization. Optimized for NVIDIA GPUs via the NeMo toolkit, it handles both short clips and long-form audio (up to 3 hours with local attention). Trained on a mix of the Granary dataset (660K hours) and NeMo ASR Set 3.0 (10K hours), it achieves strong performance across multilingual benchmarks while remaining lightweight enough for production deployment.

Multilingual ASR

The tables below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Language	Fleurs	MLS	CoVoST
Average WER ↓	11.97%	7.83%	11.98%
bg	12.64%	–	–
cs	11.01%	–	–
da	18.41%	–	–
de	5.04%	–	4.84%
el	20.70%	–	–
en	4.85%	–	6.80%
es	3.45%	4.39%	3.41%
et	17.73%	–	22.04%
fi	13.21%	–	–
fr	5.15%	4.97%	6.05%
hr	12.46%	–	–
hu	15.72%	–	–
it	3.00%	10.08%	3.69%
lt	20.35%	–	–
lv	22.84%	–	38.36%
mt	20.46%	–	–
nl	7.48%	12.78%	6.50%
pl	7.31%	7.28%	–
pt	4.76%	7.50%	3.96%
ro	12.44%	–	–
ru	5.51%	–	3.00%
sk	8.82%	–	–
sl	24.03%	–	31.80%
sv	15.08%	–	20.16%
uk	6.79%	–	5.10%

Huggingface Open-ASR-Leaderboard

Model	Avg WER	AMI	Earnings-22	GigaSpeech	LS test-clean	LS test-other	SPGI Speech	TEDLIUM-v3	VoxPopuli
`parakeet-tdt-0.6b-v3`	6.34%	11.31%	11.42%	9.59%	1.93%	3.59%	3.97%	2.75%	6.14%

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples [14]:

SNR Level	Avg WER	AMI	Earnings	GigaSpeech	LS test-clean	LS test-other	SPGI	Tedlium	VoxPopuli	Relative Change
Clean	6.34%	11.31%	11.42%	9.59%	1.93%	3.59%	3.97%	2.75%	6.14%	–
SNR 10	7.12%	13.99%	11.79%	9.96%	2.15%	4.55%	4.45%	3.05%	6.99%	-12.28%
SNR 5	8.23%	17.59%	13.01%	10.69%	2.62%	6.05%	5.23%	3.33%	7.31%	-29.81%
SNR 0	11.66%	24.44%	17.34%	13.60%	4.82%	10.38%	8.41%	5.39%	8.91%	-83.97%
SNR -5	19.88%	34.91%	26.92%	21.41%	12.21%	19.98%	16.96%	11.36%	15.30%	-213.64%

GPU Configuration Table for Parakeet-TDT-0.6B-v3

Scenario	GPU(s)	VRAM / GPU	Total VRAM	Max Audio Length	Precision	System RAM (rec.)	Notes
Minimum Inference (short clips)	1× T4 / L4	16 GB	16 GB	~10–15 min	FP16	8–16 GB	Suitable for smaller apps or demos
Standard Inference (long audio)	1× A30 / V100	24–32 GB	24–32 GB	~1 hr	FP16 / BF16	32 GB	Balance of cost and throughput
High-Throughput Production	1× A100 40 GB	40 GB	40 GB	~2 hr	FP16 / BF16	64 GB	Handles batch transcription and noise-robust inputs
Maximum Long-Form Audio	1× A100 80 GB or H100 80 GB	80 GB	80 GB	Up to 24 min (full attention) / 3 hr (local attention)	FP16 / BF16	128 GB	Recommended for research, streaming, or multi-language workloads
Massive Training / Fine-tuning	4–128× A100 80 GB	80 GB each	320 GB – 10 TB	24 min – 3 hr	Mixed Precision	256 GB+	Matches NVIDIA training setup (128 A100s, 150K steps)

Resources

Link: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

Step-by-Step Process to Install & Run
NVIDIA Parakeet TDT 0.6B V3 Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTXA6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running NVIDIA Parakeet TDT 0.6B V3, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like NVIDIA Parakeet TDT 0.6B V3
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like NVIDIA Parakeet TDT 0.6B V3 .

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the NVIDIA Parakeet TDT 0.6B V3 runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo-apt-get update
sudo apt-getinstall -y python3.10 python3.10-venv python3.10-dev python3-pip

Step 10: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

python3.10 -m venv ~/.venvs/parakeet310
source ~/.venvs/parakeet310/bin/activate

Step 11: Update and Install Basic Dependencies

Run the following command to update and install basic dependencies:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git wget curl ffmpeg

Step 12: Install PyTorch with CUDA Support

Run the following command to install Pytorch with CUDA support:

pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 \
  --index-url https://download.pytorch.org/whl/cu121

Step 13: Install NeMo Toolkit for ASR

Run the following command to install NeMo toolkit for ASR:

pip install "nemo_toolkit[asr]==2.4.0"

Step 14: Download a Sample Audio File

Run the following command to download a simple audio file:

wget -q https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Step 15: Connect to Your GPU VM with a Code Editor

Before you start running model script and streamlit scripts with the NVIDIA Parakeet TDT 0.6B V3 model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 16: Create the Python Script and Download the Model (ex: `quick_asr.py`)

We’ll write a full python script that lets you download the model & generate a response from model on terminal.

Create quick_asr.py in your VM (inside your project folder) and add the following code:

# save as quick_asr.py
import nemo.collections.asr as nemo_asr

asr = nemo_asr.models.ASRModel.from_pretrained(
    model_name="nvidia/parakeet-tdt-0.6b-v3"
)

# sample wav (16 kHz mono)
# wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
out = asr.transcribe(["2086-149220-0033.wav"])
print(out[0].text)

Step 17: Run the Script

Run the script from the following command:

python3 quick_asr.py

This will download the model and generate response on terminal.

Up to this point, we’ve been interacting with the model entirely through the terminal, running commands and viewing outputs directly in the shell. While this works for quick tests, it’s not always the most user-friendly way to explore or demo results. Now, we’ll take things a step further by setting up a simple Streamlit interface so we can generate and view responses right in the browser. This will give us a clean, interactive web app where we can input audio or text, run the model, and instantly see the transcriptions or responses rendered in real time.

Step 18: Install Streamlit

Run the following command to install streamlit:

pip install streamlit==1.37.1

Step 19: Create the Streamlit App Script (`app.py`)

We’ll write a full Streamlit UI that lets you generate a response from model on browser.

Create app.py in your VM (inside your project folder) and add the following code:

import os
import io
import tempfile
import streamlit as st
import soundfile as sf
import numpy as np

# Optional: resampling without ffmpeg
import librosa

@st.cache_resource(show_spinner=True)
def load_model():
    import nemo.collections.asr as nemo_asr
    asr = nemo_asr.models.ASRModel.from_pretrained(
        model_name="nvidia/parakeet-tdt-0.6b-v3"
    )
    return asr

def ensure_16k_mono_wav(raw_bytes: bytes, target_sr=16000):
    """Load any uploaded audio, convert to 16k mono WAV on disk, return path."""
    # Try to read directly
    try:
        data, sr = sf.read(io.BytesIO(raw_bytes), always_2d=False)
    except Exception:
        # Fallback via librosa
        y, sr = librosa.load(io.BytesIO(raw_bytes), sr=None, mono=False)
        data = y

    # To mono
    if data.ndim == 2:
        data = np.mean(data, axis=1)

    # Resample if needed
    if sr != target_sr:
        data = librosa.resample(y=data, orig_sr=sr, target_sr=target_sr)
        sr = target_sr

    # Write to a temp WAV
    tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    sf.write(tmp.name, data, sr, subtype="PCM_16")
    return tmp.name

st.set_page_config(page_title="Parakeet-TDT-0.6B-v3 ASR", layout="centered")

st.title("🎙️ Parakeet-TDT-0.6B-v3 — Multilingual ASR (Streamlit)")
st.write(
    "Upload an audio file (WAV/FLAC/MP3/etc). The app converts it to **16 kHz mono WAV**, "
    "then runs NVIDIA Parakeet for transcription. Supports timestamps and long-audio mode."
)

col1, col2 = st.columns(2)
with col1:
    use_timestamps = st.checkbox("Return timestamps (word & segment)", value=True)
with col2:
    long_audio = st.checkbox("Long-audio mode (local attention)", value=False)

uploaded = st.file_uploader("Upload audio file", type=["wav", "flac", "mp3", "m4a", "ogg"])
transcribe_btn = st.button("Transcribe", type="primary", disabled=uploaded is None)

# Lazy-load the model
asr = load_model()

if long_audio:
    # Switch to local attention for very long recordings
    # (handles up to hours by chunking context windows)
    try:
        asr.change_attention_model(
            self_attention_model="rel_pos_local_attn",
            att_context_size=[256, 256]
        )
    except Exception as e:
        st.warning(f"Could not enable long-audio local attention: {e}")

if transcribe_btn and uploaded:
    with st.spinner("Converting & transcribing..."):
        try:
            wav_path = ensure_16k_mono_wav(uploaded.getvalue(), target_sr=16000)
            outputs = asr.transcribe([wav_path], timestamps=use_timestamps)
            text = outputs[0].text
            st.subheader("Transcript")
            st.text_area("Text", value=text, height=200)

            if use_timestamps:
                ts = outputs[0].timestamp

                # Segment table
                segs = ts.get("segment", [])
                if segs:
                    st.markdown("**Segment timestamps**")
                    st.dataframe(
                        [{"start_s": f"{s['start']:.2f}", "end_s": f"{s['end']:.2f}", "segment": s["segment"]}
                         for s in segs],
                        use_container_width=True
                    )

                # Word table
                words = ts.get("word", [])
                if words:
                    st.markdown("**Word timestamps**")
                    st.dataframe(
                        [{"start_s": f"{w['start']:.2f}", "end_s": f"{w['end']:.2f}", "word": w["word"]}
                         for w in words],
                        use_container_width=True
                    )

            # Cleanup tmp file
            try:
                os.remove(wav_path)
            except Exception:
                pass

        except Exception as e:
            st.error(f"Transcription failed: {e}")
            st.stop()

st.markdown("---")
with st.expander("⚙️ Tips & GPU notes"):
    st.write(
        "- For long recordings, enable **Long-audio mode** to reduce memory by using local attention.\n"
        "- If you hit GPU OOM, try shorter clips or disable timestamps first, then re-enable.\n"
        "- Best throughput: A100/H100 80GB; balanced: L40S/A6000 48GB; budget: L4 24GB/T4 16GB."
    )

Step 20: Launch the Streamlit App

Now that we’ve written our app.py Streamlit script, the next step is to launch the app from the terminal.

Run the following command inside your VM:

streamlit run app.py --server.address 0.0.0.0 --server.port 8501

Once executed, Streamlit will start the web server and you’ll see a message:

You can now view your Streamlit app in your browser.

URL: http://0:0:0:0:8501

Step 21: Access the Streamlit App in Browser

After launching the app, you’ll see the interface in your browser.

Go to:

http://localhost:8501

Upload the audio and generate text.

Conclusion

Parakeet-TDT-0.6B-v3 hits a rare sweet spot: it’s compact enough to run economically, yet strong enough to deliver production-grade multilingual transcription with punctuation, capitalization, and precise timestamps. With automatic language detection across 25 European languages and a FastConformer-TDT backbone, it handles everything from short clips to multi-hour recordings (via local attention) without the operational headaches of heavyweight models. The NeMo toolchain makes setup, streaming, and fine-tuning straightforward, and our Streamlit UI shows how quickly you can move from terminal tests to a clean, browser-based experience. If your workload spans European languages and you want reliable throughput on modern NVIDIA GPUs (L4/A6000 for balanced cost; A100/H100 when you need maximum headroom), Parakeet-TDT-0.6B-v3 is an easy default. For best results, keep audio at 16 kHz mono, consider domain-specific fine-tuning, and pair it with diarization/VAD if you need speaker-aware transcripts. In short: fast to deploy, affordable to scale, and accurate enough to trust—ready for real apps today.

Relevant blog posts

August 29, 2025

RefusalBench Showdown: How Hermes 4 Crushed Frontier Giants

Hermes 4 70B is Nous Research’s flagship reasoning model, built on Llama-3.1-70B and fine-tuned with a massive new post-training corpus (~60B tokens). It introduces a hybrid reasoning mode with explicit segments, giving users the choice between fast responses or deep, step-by-step deliberation. Key upgrades over Hermes 3 include huge improvements in math, logic, code, STEM, and creativity, stronger schema-faithful outputs (valid JSON, structured responses), and much easier steerability with reduced refusal rates. Hermes 4 also supports function calling and tool use, making it production-ready for both conversational and structured applications. With state-of-the-art performance on RefusalBench, Hermes 4 pushes open-source reasoning closer to frontier closed models while staying fully open, steerable, and aligned to user needs.

August 27, 2025

How to Install & Run DeepSeek-V3.1-GGUF Locally?

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family, designed as a hybrid reasoning model supporting both thinking and non-thinking modes. Unlike earlier versions, it integrates smarter tool-calling, higher efficiency in structured reasoning, and long-context handling up to 128K tokens. It has been post-trained on 630B+209B tokens with UE8M0 FP8 scale formatting, making it compatible with modern microscaling approaches. Benchmarks show major jumps in math, coding, reasoning, and agent-style tasks—with competitive results against DeepSeek R1 while being more efficient. The GGUF quants by Unsloth come with fixed chat templates for llama.cpp backends (–jinja required) and provide recommended runtime settings (temperature=0.6, top_p=0.95).

August 25, 2025

How to Install & Run Grok 2 Locally?

Grok 2, the flagship AI model from Elon Musk’s xAI, is now officially open source. Announced by Musk himself, this move gives developers free access to enterprise-level AI for the first time. The model is already available on Hugging Face, making it easy to download, experiment with, and run locally. This is a golden chance to explore cutting-edge AI without cost barriers and prepare for what’s next—especially with Grok 3 also set to go open source in just six months.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.