ML Understanding Speech—Using a Convolutional Neural Network (CNN) to Classify Audio Clips

What is a Convolutional Neural Network (CNN)?

How do CNN’s Work?

Graphic to help understand kernels

What Can CNN’s Be Used For?

Diving Deeper Into How a Tensorflow Model Was Able to Classify Audio Clips

Imports & Data

Reading Audio Files and Their Labels

Data & Spectrogram’s

Left: Waveform of word. Right: Spectrogram of word.
def get_spectrogram(waveform):
# Padding for files with less than 16000 samples
zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)

# Concatenate audio with padding so that all audio clips will be of the
# same length
waveform = tf.cast(waveform, tf.float32)
equal_length = tf.concat([waveform, zero_padding], 0)
spectrogram = tf.signal.stft(
equal_length, frame_length=255, frame_step=128)

spectrogram = tf.abs(spectrogram)

return spectrogram

Explore the Data

for waveform, label in waveform_ds.take(1):
label = label.numpy().decode('utf-8')
spectrogram = get_spectrogram(waveform)

print('Label:', label)
print('Waveform shape:', waveform.shape)
print('Spectrogram shape:', spectrogram.shape)
print('Audio playback')
display.display(display.Audio(waveform, rate=16000))
Label: yes
Waveform shape: (13375,)
Spectrogram shape: (124, 129)
Plot of the waveform & spectrogram for yes

Build and Train the Model

for spectrogram, _ in spectrogram_ds.take(1):
input_shape = spectrogram.shape
print('Input shape:', input_shape)
num_labels = len(commands)

norm_layer = preprocessing.Normalization()
norm_layer.adapt( x, _: x))

model = models.Sequential([
preprocessing.Resizing(32, 32),
layers.Conv2D(32, 3, activation='relu'),
layers.Conv2D(64, 3, activation='relu'),
layers.Dense(128, activation='relu'),



Results graph of correctly classifying audio

Applications of Classifying Audio Clips

I am passionate about health and technology especially when the two are used together to help others.