Clarispeech: LLM-Enhanced Speech Recognition Post-Correction
Аннотация
Recent advances in Automatic Speech Recognition (ASR)have made these systems widely applicable, including in virtual assistantsand web-based interfaces. However, even cutting-edge ASRmodels often produce errors, particularly when adapting to newspeech domains. Conventional solutions involve fine-tuning ASR modelson target-domain data or integrating language models (LMs) torescore predictions. However, joint fine-tuning of ASR and LM modelscan be unstable, demand substantial training data, and sufferfrom alignment issues. Using more sophisticated language models forshallow fusion, especially large language models (LLMs), is impractical,leading to significant computational overhead. In this paper,we address these challenges by focusing on post-transcription corrections,using parameter-efficient fine-tuning of external languagemodels while leaving the ASR system frozen. Our experiments showthat this approach significantly improves accuracy and computationalefficiency. Compared to the baseline ASR system, employingan ASR+LLM configuration reduces the word error rate from 12%to 10%, while increasing computational cost by less than 50%, despitean eightfold rise in the number of parameters.
Похожие публикации
сотрудничества и партнерства