African language Speech Recognition — Speech-to-Text

8 min readAug 18, 2021

Image from Dreamstime

Abstract

Speech recognition simply is the ability of devices to respond to spoken commands. Speech recognition enables hands-free control of various devices and equipment (a particular boon to many disabled persons), provides input to automatic translation, and creates print-ready dictation.

In this project, African language Speech Recognition — Speech-to-Text is implemented end to end. Starting from data loading and metadata generation up to designing a model to predict and deployment the model.

My GitHub link

Motivation

The World Food Program wants to collect nutritional information of food bought and sold at markets in two different countries in Africa — Ethiopia and Kenya.

The speciﬁc need of this project is that Tenacious data science consultancy is in agreement with the World food program (WFP) to deliver speech-to-text technology for two languages: Amharic and Swahili using an app that can use their voice to activate the app to register the list of items they just bought in their own language.

Why not use an already available ASR system?

There are many matured speech recognition systems available, such as Google Assistant, Amazon Alexa, and Apple’s Siri. However, all of those voice assistants work for limited languages only.

Review of previous works

1. Development of Isolated Numeric Speech Corpus for Swahili Language

Development of Automatic Speech Recognition System-Speech corpus being the basic requirement for the development of automatic speech recognition (ASR) system, it should be done with much accuracy in order to enhance the performance of the system. This paper describes the proposed procedure to abide while collecting the speech corpus of Swahili language from the native and non-native speaker for the development of the Automatic Speech Recognition system in the Swahili language.

2. Deep Learning for Amharic speech recognition article by Tilaye:

In this article, the core idea is to have a network of interconnected nodes (also known as Neural Networks) where each node computes a function and passes information to the nodes next to it. Every node computes its output using the function it was conﬁgured to use, and some parameters. The process of learning updates these parameters. The function itself will not change. Initially, the parameters are randomly initialized. Then training data is passed through the network, making small improvements to the parameters on every step. If it all looks like magic, you are not alone! In fact, even leading researchers in the ﬁeld experiment with different functions and network layouts to see which one works best. There are some well-understood functions and network architectures. In the case of speech recognition, Recurrent Neural Networks (RNN) is used as their output depends not just on the current set of inputs, but on previous inputs too. It’s crucial in speech recognition because predicting what has been said at a particular window of time becomes much easier if what has been said before is known.

Data and preprocessing techniques

Data preprocessing is a step-by-step process that processes audio data for deep learning models. The techniques used are listed below.

Load Audio ﬁles: The input data are audio ﬁles of the spoken speech that are in the audio format “.wav”

2. Resample the audio ﬁles so as to have uniform sample rates for each item.

3. Convert all the items to have the same number of channels. The channels could be mono (one channel) or stereo (2 channels). Our data has mono channels.

4. Convert all items to have the same duration, which involves padding the shorter audio ﬁles and truncating the longer ones.

5. Time Shift our audio left or right randomly by a small percentage, or change the Pitch or the Speed of the audio by a small amount so as to add noise to our audio ﬁles.

6. Convert the raw audio ﬁles to Mel Spectrograms which capture the nature of the audio ﬁles as images by decomposing them into sets of frequencies.

7. The Mel Spectrograms are converted to Mel Frequency Cepstral Coeﬃcients (MFCCs) which is important when dealing with human speech. This is because MFCCs correspond to the frequency ranges at which humans speak.

Preparing the metadata for the transcriptions

We used the librosa package to preprocess our audio ﬁles. By default, librosa converts all audio ﬁles to mono and resamples the ﬁles to 22050 Hz when loading the audio ﬁles, unless other arguments are passed in the load function. Now that we have our input features(the preprocessed audio ﬁles) and our target labels (the preprocessed transcriptions), we build a deep neural network that accepts the preprocessed data as input and returns a predicted transcription of the spoken language as the output.

The deep learning architecture

Convolutional Neural Network(CNN) plus Recurrent Neural Network (RNN): CNN encodes the information from the voice features. This data is sent to a convolution layer with a kernel which also acts as a fully connected layer. This ﬁnally gives softmax probabilities of each character that can be placed in the transcription

An RNN model consists of a Bidirectional LSTM that processes the feature maps as a series of distinct timesteps or ‘frames’ that correspond to our desired sequence of output characters. In other words, it takes the feature maps which are a continuous representation of the audio and converts them into a discrete representation.

Feature maps -> Bidirectional LSTM A linear layer with softmax that uses the LSTM outputs to produce character probabilities for each timestep of the output linear layers that sit between the convolution(CNN) and recurrent networks(RNN). It helps to reshape the outputs of one network to the inputs of the other.

Bidirectional LSTM -> Linear Layer-> softmax -> character probabilities for each timestep in that Spectrogram The CNN encodes the information from the image and sends that data to the RNN based decoder which decodes the data and outputs the corresponding text from the image.

MLOPs setup

MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Practicing MLOps means that you advocate for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management.

Modeling results & Conclusion

1- True transcription: karibu katika matangazo yetu ya asubuhi ya idhaa ya kiswahili
1- Predicted transcription: karibu kaia natanazo yeti ya subuia idha ya kiswahili
2- True transcription: kwenda na kuondoka jarkata Indonesia
2- Predicted transcription: kuwenda kuna kundoka jijya kata indoneziaoanfuyuhyhya b

As you can see the model is far from perfect but it is getting there. The words in the predicted transcription are misplaced here and there. For a general acceptable speech recognition system 3000hrs of recorded audio is needed. So for this project with limited data the model prediction is fairly a success. In addition to that to make it more powerful, the model can have more hidden layers.

More data is needed with different speakers and text. Data can also be synthesized. For example, audio can be duplicated and background noise added to the duplicated data. It helps the model become robust on audio with noise.

Also for the model to be robust, we can use more augmentation on the data and adding noise will improve the robustness of the model in outdoor places where most likely will be implemented for the objective of the project; which is in a food marketplace where likely people will use the app to register the food they bought using voice command.

The Responsible Machine Learning

Principles Bias, fairness, and what is done to mitigate issues discovered

When building systems that have to make non-trivial decisions, we will always face the computational and societal bias that is inherent in data, which is impossible to avoid but is possible to document and/or mitigate.

However, we should take a step back from only trying to embed ethics directly into the algorithms themselves. Instead, technologists should focus on building processes & methods to identify & document the inherent bias in the data, features, and inference results, and subsequently the implications of this bias.

Sample bias: Sample bias is a problem with training data that occurs when the data used to train the model does not accurately represent the environment that the model will operate in. There is virtually no situation where an algorithm can be trained on the entire universe of data it could interact with. But there’s a science to choosing a subset of that universe that is both large enough and representative enough to mitigate sample bias.

Evaluation bias: Evaluation bias occurs during the model iteration and evaluation. A model is optimized using training data, but its quality is often measured against certain benchmarks. Bias can arise when these benchmarks do not represent the general population or are not appropriate for the way the model will be used.

Aggregation bias: Aggregation bias arises during model construction where distinct populations are inappropriately combined. There are many AI applications where the population of interest is heterogeneous, and a single model is unlikely to suit all groups. One example is in health care. For diagnosing and monitoring diabetes, models have historically used levels of Hemoglobin AIc (HbAIc) to make their predictions. However, a 2019 paper showed that these levels differ in complicated ways across ethnicities, and a single model for all populations is bound to exhibit bias.

Fairness: Helps to ensure that biases in the data and model inaccuracies do not lead to models that treat individuals unfavorably on the basis of characteristics such as e.g. race, gender, disabilities, and sexual or political orientation.

Data risk awareness: We commit to develop and improve reasonable processes and infrastructure to ensure data and model security are being taken into consideration during the development of machine learning systems.

Future work

What can be improved with more time and resources?

Data & Model

3k hours of speech used to be considered suﬃcient for training. Big tech companies now use up to 100k hours and beefy hardware. Only limited hours of data are available for Swahili and Amharic. More data is needed with different speakers and text. Data can also be synthesized. For example, audio can be duplicated and background noise added to the duplicated data. It helps the model become robust on audio with noise.

We have proved to ourselves that the model is capable of ﬁtting training data given to it. The next step is to look into regularization to have a good validation accuracy. This problem can beneﬁt from having more data as well. Different network architectures can be investigated too. As mentioned under the Language Model, the goal is to have end-to-end training. And so we wouldn’t even need any pre-processing. Baidu 2 goes a step in this direction by using spectrogram features. There is also some research on using audio signals (the raw data) although it’s still in its infancy.

REFERENCES

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.5170&rep=rep1 &type=pdf

2. https://towardsdatascience.com/audio-deep-learning-made-simple-automatic-sp eech-recognition-asr-how-it-works-716cfce4c706

3. http://ainsightful.com/index.php/2018/11/27/deep-learning-for-amharic-speechrecognition/

4. https://ethical.institute/principles.html

Thank you for reading 😊

Euel Fantaye, August 2021