ML Understanding Speech—Using a Convolutional Neural Network (CNN) to Classify Audio Clips

8 min readJan 31, 2021

“Don’t worry about a thing, cause every little thing gonna be all right. Singin’: Don’t worry about a thing, ’Cause every little thing gonna be all right!” — Bob Marley

“If you heard this legendary lyric out-loud, I’m sure you would be able to understand or classify the words being said.

Well, a Convolutional Neural Network (CNN) can also classify audio clips into distinct words! In this article, I will:

Explain how a CNN can do this.
Explain why classifying audio clips with machine learning(ML) can help with efficiency, health, and well being.

What is a Convolutional Neural Network (CNN)?

A CNN is a class of deep learning networks used in Computer Vision — a subdomain of ML. Before we dive into how CNN’s work, let’s make sure we understand what these words mean.

Deep learning networks: This is a subfield of ML concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
Computer Vision: This refers to the ability to give computers a high-level understanding of digital images or videos to “see” and understand the content of digital images and videos.

A CNN can take in an input image and then assign importance (learnable weights and biases) to parts of the image. Then, the CNN can classify the image based on the patterns of weights and biases it learned.

How do CNN’s Work?

When you look at an image, your brain can identify objects in the picture quickly and seemingly effortlessly. The goal of using a CNN to classify images is to get computers to recognize objects in the same manner. However, there is a huge difference between how a human brain and a computer see and interpret images.

For computers, an image is just an array of numbers. However, this array of numbers allows computers to identify patterns, and these patterns are what the computer will use to classify an object in an image.

For a CNN to extract the important features and patterns in an image from arrays of numbers, the CNN uses kernels to scan over an image.

Rather than looking at an entire image all at once to find certain features, the CNN uses kernels to look at smaller portions of the image. The kernels look to find patterns between what they see at that point in the image to what they’re looking for. Then, when a feature matches, it is recorded and stored in a feature map.

A feature map is a refined version of the original image. It saves the important features of the image and ignores the rest. After several different kernels go over the original image and extract different important features, the kernels join to create the final convolved pattern.

The role of the CNN is to reduce the images into a form that is easier to process, without losing features that are critical for getting a good prediction.

What Can CNN’s Be Used For?

CNN’s are a go-to method for any type of prediction problem involving image data as an input. Personally, I think CNN’s can help in many fields and situations. Below I listed out just a few examples:

🚗 Transportation

CNN’s can identify road signs and categorize them correctly. This allows autonomous vehicles to advance by teaching them how to navigate the road abiding by laws and regulations.

📱 Social Media

Face recognition serves to streamline the often tedious process of tagging people in the photo. The face recognition used for this feature can be powered by a CNN. So, next time you use this feature when having to tag people in a couple of hundred images from a conference, thank CNN’s.

👩‍⚕️ Health Care

Image processing with CNN’s can help radiologists identify medical issues from patient X-rays. Also, CNN’s can be used for identifying when elderly people fall and coupled with other technology, caregivers can get notified of the fall get them the help they need.

Personally, I am excited about this use case as falls are the leading cause of fatal injury and the most common cause of nonfatal trauma-related hospital admissions among older adults.

🗣 Speech Recognition

CNN’s can also be used for speech recognition! After following a Tensorflow tutorial, I was able to make a model that classifies audio clips as “down”, “go”, “left”, “no”, “right”, “stop”, “up” and “yes”.

The model uses a simple convolutional neural network (CNN) since the audio files are converted into spectrogram images.

Diving Deeper Into How a Tensorflow Model Was Able to Classify Audio Clips

Please note that real speech and audio recognition systems are much more complex however, this project allowed me to get a basic understanding of the techniques involved in classifying audio clips.

Imports & Data

This tutorial used TensorFlow Keras, and the Speech Commands dataset, which consists of over 105,000 audio files of people saying thirty different words.

This project classifies 8 words, which each have 1000 examples per label. The data was then split into the Training set (80% of the data), Validation set(10% of the data), and a Test set (10% of the data).

Reading Audio Files and Their Labels

When first imported the audio file will be read as a binary file. To use the model we want to convert this into a numerical tensor.

To load an audio file, we use tf.audio.decode_wav, which returns the WAV-encoded audio as a Tensor and the sample rate—which defines how many times per second a sound is sampled.

Each sample represents the amplitude of the audio signal at that specific time. In a 16-bit system, the values range from -32768 to 32767. However, the tf.audio.decode_wav will normalize the values to the range [-1.0, 1.0].

Data & Spectrogram’s

Left: Waveform of word. Right: Spectrogram of word.

Next, we have to convert the waveform into a spectrogram, which shows frequency changes over time when saying a word and represents these changes as a 2D image. This can be done by applying the short-time Fourier transform (STFT) to convert the audio files into the time-frequency domain.

Then a Fourier transform (tf.signal.fft) converts a signal to its component frequencies but loses all-time related information of how the word was said.

However, luckily we can use the STFT (tf.signal.stft) to split the signal into windows of time and run a Fourier transform on each window. This preserves some time information and returns a 2D tensor that we can run standard convolutions on.

The STFT produces an array of complex numbers representing the magnitude and phase of the word being said. However, we only need the magnitude for this project, which can be derived by applying tf.abson the output of tf.signal.stft.

We want the generated spectrogram “image” to be almost square.
To do that we need to chooseframe_length and frame_stepparameters.

We also want the waveforms to have the same length so that when we convert them to a spectrogram image, the results will have similar dimensions. We can do this by simply zero padding the audio clips that are shorter than one second.

def get_spectrogram(waveform):
  # Padding for files with less than 16000 samples
  zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform, tf.float32)
  equal_length = tf.concat([waveform, zero_padding], 0)
  spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)

  spectrogram = tf.abs(spectrogram)

  return spectrogram

Explore the Data

Now let’s compare the waveform, the spectrogram and the actual audio of one example from the dataset.

for waveform, label in waveform_ds.take(1):
  label = label.numpy().decode('utf-8')
  spectrogram = get_spectrogram(waveform)

print('Label:', label)
print('Waveform shape:', waveform.shape)
print('Spectrogram shape:', spectrogram.shape)
print('Audio playback')
display.display(display.Audio(waveform, rate=16000))

Doing so yields the following results:

Label: yes
Waveform shape: (13375,)
Spectrogram shape: (124, 129)

Plot of the waveform & spectrogram for yes

Build and Train the Model

For the model, we will use a simple convolutional neural network (CNN), since we have transformed the audio files into spectrogram images. The model also has the following additional preprocessing layers:

A Resizing layer to downsample the input to enable the model to train faster.
A Normalization layer to normalize each pixel in the image based on its mean and standard deviation.

For the Normalization layer, its adapt method would first need to be called on the training data to compute aggregate statistics; for example, finding out the mean and standard deviation and uses an EPOCHS of 10.

for spectrogram, _ in spectrogram_ds.take(1):
  input_shape = spectrogram.shape
print('Input shape:', input_shape)
num_labels = len(commands)

norm_layer = preprocessing.Normalization()
norm_layer.adapt(spectrogram_ds.map(lambda x, _: x))

model = models.Sequential([
    layers.Input(shape=input_shape),
    preprocessing.Resizing(32, 32), 
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

model.summary()

Results

After training the model it has a test set accuracy of 84%. You can check out all the code here.

Applications of Classifying Audio Clips

The reason why I got interested in how CNN’s can classify audio clips centers around exploring the question: Can you tell if someone is depressed just from their voice?

After reading Detecting Depression Severity from Vocal Prosody it was clear that depression could be identified to some degree due to one’s voice in an interview for depression. Together, participants’ and interviewers’ vocal prosody allowed for detection of the ordinal range of depression severity (low, mild, and moderate-to-severe) in 69% of cases.

Although we are far from the day in which doctors can be assisted with ML to diagnose depression, I am excited to continue to keep exploring this idea.