Multilingual speech recognition initiative for African languages

2 minute read

Published:

Download the Paper

KEY INSIGHT

The paper discusses the need to address low-resource language gaps in speech recognition for African languages and proposes methods to improve language representation using modern machine learning techniques. It introduces the DVoice initiative, a community-driven initiative to create datasets and models for various African languages, focusing on accessibility and collaboration. The paper also presents various approaches for developing multilingual speech recognition systems, including innovative model training techniques, focusing on data quality and collection, and discussing the potential social and economic benefits of ASR systems for communities in Africa, particularly in improving access to technology for populations with high illiteracy rates.

The authors trained monolingual models for seven African languages, using a naïve multilingual approach that led to grapheme overlapping. They then developed a one-hot encoder vector for speech features during training to address this issue. The final model used language-specific tokens to predict spoken language, eliminating the need for a language identification model during inference. Self-supervised learning techniques were used to generalize across languages without extensive labeled datasets. Data quality and collection initiatives, such as Mozilla CommonVoice, were also emphasized for improving ASR system performance.

The study focuses on automatic speech recognition (ASR) systems developed for seven African languages, aiming to represent the continent's linguistic diversity. The training environment was a high-performance computing cluster with two GPU nodes, allowing for multilingual experiments. The datasets were divided into three sets: 60% for training, 20% for validation, and 20% for testing. The Word Error Rate (WER) metrics were calculated to assess the accuracy of the ASR systems. The study found that removing diacritics from the data improved model performance in tonal languages, suggesting that addressing lexical ambiguity could enhance understanding and recognition in these languages. Data augmentation techniques were used to enhance training datasets and improve ASR performance. The paper presents three different multilingual approaches, each with varying complexity and effectiveness, indicating that these approaches could significantly improve the use of large-scale speech technologies for low-resource languages.

The paper presents automatic speech recognition (ASR) systems designed for African languages, focusing on seven languages representing the continent's linguistic diversity. The authors believe that similar projects can be easily reproduced for other African languages within the same linguistic groups and families. Lexical ambiguity, introduced by removing diacritics, significantly enhanced model performance in tonal languages, suggesting future research should explore its effects on language understanding. Data augmentation techniques were used to address data scarcity in African languages. Future work with language models could enhance transcriptions, leading to more accurate and reliable ASR systems for African languages. A cross-lingual Wav2Vec model is also proposed for further research.