Learn how to Build your own Speech-to-Text Model (using Python)

We are converting an audio signal to a discrete signal through sampling so that it can be stored and processed efficiently in memory.

I really like the below illustration.

It depicts how the analog audio signal is discretized and stored in the memory: The key thing to take away from the above figure is that we are able to reconstruct an almost similar audio wave even after sampling the analog signal since I have chosen a high sampling rate.

The sampling rate or sampling frequency is defined as the number of samples selected per second.

Different Feature Extraction Techniques for an Audio Signal The first step in speech recognition is to extract the features from an audio signal which we will input to our model later.

So now, l will walk you through the different ways of extracting features from the audio signal.

Time-domain Here, the audio signal is represented by the amplitude as a function of time.

In simple words, it is a plot between amplitude and time.

The features are the amplitudes which are recorded at different time intervals.

The limitation of the time-domain analysis is that it completely ignores the information about the rate of the signal which is addressed by the frequency domain analysis.

So let’s discuss that in the next section.

Frequency domain In the frequency domain, the audio signal is represented by amplitude as a function of frequency.

Simply put – it is a plot between frequency and amplitude.

The features are the amplitudes recorded at different frequencies.

The limitation of this frequency domain analysis is that it completely ignores the order or sequence of the signal which is addressed by time-domain analysis.

Remember: Time-domain analysis completely ignores the frequency component whereas frequency domain analysis pays no attention to the time component.

We can get the time-dependent frequencies with the help of a spectrogram.

Spectrogram Ever heard of a spectrogram?.It’s a 2D plot between time and frequency where each point in the plot represents the amplitude of a particular frequency at a particular time in terms of intensity of color.

In simple terms, the spectrogram is a spectrum (broad range of colors) of frequencies as it varies with time.

The right features to extract from audio depends on the use case we are working with.

It’s finally time to get our hands dirty and fire up our Jupyter Notebook!.  Understanding the Problem Statement for our Speech-to-Text Project Let’s understand the problem statement of our project before we move into the implementation part.

We might be on the verge of having too many screens around us.

It seems like every day, new versions of common objects are “re-invented” with built-in wifi and bright touchscreens.

A promising antidote to our screen addiction is voice interfaces.

TensorFlow recently released the Speech Commands Datasets.

It includes 65,000 one-second long utterances of 30 short words, by thousands of different people.

We’ll build a speech recognition system that understands simple spoken commands.

You can download the dataset from here.

Implementing the Speech-to-Text Model in Python The wait is over!.It’s time to build our own Speech-to-Text model from scratch.

Import the libraries First, import all the necessary libraries into our notebook.

LibROSA and SciPy are the Python libraries used for processing audio signals.

View the code on Gist.

Data Exploration and Visualization Data Exploration and Visualization helps us to understand the data as well as pre-processing steps in a better way.

Visualization of Audio signal in time series domain Now, we’ll visualize the audio signal in the time series domain: View the code on Gist.

Sampling rate Let us now look at the sampling rate of the audio signals: ipd.

Audio(samples, rate=sample_rate) print(sample_rate) Resampling From the above, we can understand that the sampling rate of the signal is 16,000 Hz.

Let us re-sample it to 8000 Hz since most of the speech-related frequencies are present at 8000 Hz: samples = librosa.

resample(samples, sample_rate, 8000) ipd.

Audio(samples, rate=8000) Now, let’s understand the number of recordings for each voice command: View the code on Gist.

Duration of recordings What’s next?.A look at the distribution of the duration of recordings: View the code on Gist.

Preprocessing the audio waves In the data exploration part earlier, we have seen that the duration of a few recordings is less than 1 second and the sampling rate is too high.

So, let us read the audio waves and use the below-preprocessing steps to deal with this.

Here are the two steps we’ll follow: Resampling Removing shorter commands of less than 1 second Let us define these preprocessing steps in the below code snippet: View the code on Gist.

Convert the output labels to integer encoded: View the code on Gist.

Now, convert the integer encoded labels to a one-hot vector since it is a multi-classification problem: from keras.

utils import np_utils y=np_utils.

to_categorical(y, num_classes=len(labels)) Reshape the 2D array to 3D since the input to the conv1d must be a 3D array: all_wave = np.

array(all_wave).

reshape(-1,8000,1) Split into train and validation set Next, we will train the model on 80% of the data and validate on the remaining 20%: from sklearn.

model_selection import train_test_split x_tr, x_val, y_tr, y_val = train_test_split(np.

array(all_wave),np.

array(y),stratify=y,test_size = 0.

2,random_state=777,shuffle=True) Model Architecture for this problem We will build the speech-to-text model using conv1d.

Conv1d is a convolutional neural network which performs the convolution along only one dimension.

Here is the model architecture: Model building Let us implement the model using Keras functional API.

View the code on Gist.

Define the loss function to be categorical cross-entropy since it is a multi-classification problem: model.

compile(loss=categorical_crossentropy,optimizer=adam,metrics=[accuracy]) Early stopping and model checkpoints are the callbacks to stop training the neural network at the right time and to save the best model after every epoch: es = EarlyStopping(monitor=val_loss, mode=min, verbose=1, patience=10, min_delta=0.

0001) mc = ModelCheckpoint(best_model.

hdf5, monitor=val_acc, verbose=1, save_best_only=True, mode=max) Let us train the model on a batch size of 32 and evaluate the performance on the holdout set: history=model.

fit(x_tr, y_tr ,epochs=100, callbacks=[es,mc], batch_size=32, validation_data=(x_val,y_val)) Diagnostic plot I’m going to lean on visualization again to understand the performance of the model over a period of time: View the code on Gist.

hdf5) Define the function that predicts text for the given audio: View the code on Gist.

Prediction time!.Make predictions on the validation data: View the code on Gist.

The best part is yet to come!.Here is a script that prompts a user to record voice commands.

Record your own voice commands and test it on the model: View the code on Gist.

Let us now read the saved voice command and convert it to text: View the code on Gist.

Here is an awesome video that I tested on one of my colleague’s voice commands: https://s3-ap-south-1.

amazonaws.

mp4   Congratulations!.You have just built your very own speech-to-text model!.Code Find the notebook here End Notes Got to love the power of deep learning and NLP.

This is a microcosm of the things we can do with deep learning.

I encourage you to try it out and share the results with our community.

????.In this article, we covered all the concepts and implemented our own speech recognition system from scratch in Python.

I hope you have learned something new today.

I will see you in the next article.