Khmer_ASR_CodeSwitchingd

Khmer_ASR_CodeSwitching is an automatic speech recognition (ASR) model designed to transcribe Khmer speech with English code-switching. In real-world Cambodian conversations, speakers often mix Khmer and English within the same sentence, which traditional ASR systems fail to handle accurately. This project addresses that gap by enabling robust transcription for bilingual and mixed-language audio.

The model is intended for research, education, and practical applications such as voice assistants, transcription tools, and multilingual AI systems in Cambodia and Southeast Asia.

Model Overview

The Khmer_ASR_CodeSwitching model is built on a deep learning–based ASR pipeline that processes raw audio input and outputs text containing both Khmer and English tokens.

Key characteristics:

Supports bilingual transcription (Khmer + English)
Optimized for conversational speech
Designed for noisy, real-world audio

Architecture

The model follows a standard ASR pipeline:

Audio Preprocessing
- Input audio is resampled to a fixed sampling rate
- Noise normalization and feature extraction are applied
Acoustic Model
- Learns speech-to-text mappings
- Encodes phonetic patterns from Khmer and English speech
Language Modeling
- Enables code-switching recognition
- Maintains correct word boundaries across languages
Decoder
- Converts model outputs into readable text
- Produces mixed Khmer–English transcriptions

Code-Switching Support

Code-switching is a core feature of this model. It supports:

Khmer sentences with embedded English words
English technical terms inside Khmer speech
Natural switching without forcing a single language output

Example:

Audio: “ថ្ងៃនេះ meeting មាន importance ខ្លាំង”
Output: “ថ្ងៃនេះ meeting មាន importance ខ្លាំង”

Create a virtual environment

python3 -m venv venv source venv/bin/activate

Install required packages

pip install torch torchaudio transformers librosa soundfile

Load the Model

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import librosa

# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("1morecupofhottea/Whisper-Code-Switching-Kh-En")
model = AutoModelForSpeechSeq2Seq.from_pretrained("1morecupofhottea/Whisper-Code-Switching-Kh-En")
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Audio Preprocessing

# Load audio file
audio_path = "example_audio.wav"
speech, sr = librosa.load(audio_path, sr=16000)

# Convert to tensor
input_values = processor(speech, return_tensors="pt", sampling_rate=16000).input_values
input_values = input_values.to(device)

Model Inference

Get model predictions
with torch.no_grad():
    logits = model(input_values).logits

# Decode predicted ids to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print("Transcription:", transcription)

PreviousAksor NextTurtorials

Last updated 2 months ago

hashtagModel Overview

hashtagArchitecture

hashtagCode-Switching Support

hashtagCreate a virtual environment

hashtagInstall required packages