Audio Processing and Remove Silence using Python

Audio Processing Techniques like Play an Audio, Plot the Audio Signals, Merge and Split Audio, Change the Frame Rate, Sample Width and Channel, Silence Remove in Audio, Slow down and Speed up audio

Bala Murugan N G
8 min readJul 14, 2020
Image Source : https://gsshawnee.org/soundtraining

Why this tutorial ???

Many people are doing projects like Speech to Text conversion process and they needed some of the Audio Processing Techniques like

  • Play an audio
  • Plot the Audio Signals
  • Merge and Split Audio Contents
  • Slow down and Speed up the Audio — Speed Changer
  • Change the Frame Rate, Channels and Sample Width
  • Silence Remove

Even after exploring many articles on Silence Removal and Audio Processing, I couldn’t find an article that explained in detail, that’s why I am writing this article. I hope this article will help you to do such tasks like Data collection and other works.

Why Python ???

Python is a general purpose programming language. Hence, you can use the programming language for developing both desktop and web applications. Also, you can use Python for developing complex scientific and numeric applications.

Python is designed with features to facilitate data analysis and visualization. You can take advantage of the data analysis features of Python to create custom big data solutions without putting extra time and effort. At the same time, the data visualization libraries and APIs provided by Python help you to visualize and present data in a more appealing and effective way.

Many Python developers even use Python to accomplish Artificial Intelligence (AI), Machine Learning(ML), Deep Learning(DL), Computer Vision(CV) and Natural Language Processing(NLP) tasks.

Requirements and Installation

  • Of course, We need Python 3.5 or above
  • Install Pydub, Wave, Simple Audio and webrtcvad Packages
pip install webrtcvad==2.0.10 wave pydub simpleaudio numpy matplotlib

Let’s Start the Audio Manipulation . . . . . .

Listen Audio

People who wants to listen their Audio and play their audio without using tool slike VLC or Windows Media Player

Create a file named “listenaudio.py” and paste the below contents in that file

# Import packages
from pydub import AudioSegment
from pydub.playback import play
# Play
playaudio = AudioSegment.from_file("<Paste your File Name Here>", format="<File format Eg. WAV>")
play(playaudio)

Here is the gist for Listen Audio . . .

Listen Audio

Plot Audio Signal

Plotting the Audio Signal makes you to visualize the Audio frequency. This will help you to decide where we can cut the audio and where is having silences in the Audio Signal

# Loading the Libraries
from scipy.io.wavfile import read
import numpy as np
import matplotlib.pyplot as plt

# Read the Audiofile
samplerate, data = read('6TU5302374.wav')
# Frame rate for the Audio
print(samplerate)

# Duration of the audio in Seconds
duration = len(data)/samplerate
print("Duration of Audio in Seconds", duration)
print("Duration of Audio in Minutes", duration/60)

time = np.arange(0,duration,1/samplerate)

# Plotting the Graph using Matplotlib
plt.plot(time,data)
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.title('6TU5302374.wav')
plt.show()

Here is the gist for plotting the Audio Signal . . . . . .

Plot Audio
Plot Audio Signal — Image by Author

In the Graph, the horizontal straight lines are the silences in Audio

Split Audio Files

This helps you to Split Audio files based on the Duration that you set.

Threshold value usually in milliseconds. (1 Sec = 1000 milliseconds). By Adjusting the Threshold value in the code, you can split the audio as you wish.

Here I am splitting the audio by 10 Seconds.

from pydub import AudioSegment
import os

if not os.path.isdir("splitaudio"):
os.mkdir("splitaudio")

audio = AudioSegment.from_file("<filenamewithextension>")
lengthaudio = len(audio)
print("Length of Audio File", lengthaudio)

start = 0
# In Milliseconds, this will cut 10 Sec of audio
threshold = 10000
end = 0
counter = 0

while start < len(audio):

end += threshold

print(start , end)

chunk = audio[start:end]

filename = f'splitaudio/chunk{counter}.wav'

chunk.export(filename, format="wav")

counter +=1

start += threshold

Here is the gist for Split Audio Files . . .

Split Audio

You can get the Audio files as chunks in “splitaudio” folder.

Merge Audio File

This helps you to merge audio from different audio files . . .

import os
from pydub import AudioSegment
import glob

# if "audio" folder not exists, it will create
if not os.path.isdir("audio"):
os.mkdir("audio")

# Grab the Audio files in "audio" folder
wavfiles = glob.glob("./audio/*.wav")
print(wavfiles)

# Loopting each file and include in Audio Segment
wavs = [AudioSegment.from_wav(wav) for wav in wavfiles]

combined = wavs[0]

# Appending all the audio file
for wav in wavs[1:]:
combined = combined.append(wav)

# Export Merged Audio File
combined.export("Mergedaudio.wav", format="wav")

Here is the gist for Merge Audio content . . .

Merge Audio Files

You can view the Merged audio in “Mergedaudio.wav” file

Speed Changer-Slow down and Speed up

Change the Speed of the Audio — Slow down or Speed Up

Create a file named “speedchangeaudio.py” and copy the below content

from pydub import AudioSegment

sound = AudioSegment.from_file("chunk.wav")

def speed_change(sound, speed):

sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={
"frame_rate": int(sound.frame_rate * speed)
})

filename = 'changed_speed.wav'

sound_with_altered_frame_rate.export(filename, format ="wav")
# To Slow down audio
slow_sound = speed_change(sound, 0.8)
# To Speed up the audio
#fast_sound = speed_change(sound, 1.2)

Normal Speed of Every Audio : 1.0. To Slow down audio, tweak the range below 1.0 and to Speed up the Audio, tweak the range above 1.0

Adjust the speed as much as you want in “speed_change” function parameter

Here is the gist for Slow down and Speed Up the Audio

Speed Change Audio

You can see the Speed changed Audio in “changed_speed.wav”

Adjust the Frame Rate, Channels and Sample Width in Audio

This help you to preprocess the audio file while doing Data Preparation for “Speech to Text” projects etc . . .

from pydub import AudioSegmentsound = AudioSegment.from_file("chunk.wav")print("----------Before Conversion--------")
print("Frame Rate", sound.frame_rate)
print("Channel", sound.channels)
print("Sample Width",sound.sample_width)
# Change Frame Rate
sound = sound.set_frame_rate(16000)
# Change Channel
sound = sound.set_channels(1)
# Change Sample Width
sound = sound.set_sample_width(2)
# Export the Audio to get the changed contentsound.export("convertedrate.wav", format ="wav")

Set Frame rate 8KHz as 8000, 16KHz as 16000, 44KHz as 44000

Set Channel : 1 is Mono and 2 is Stereo

Set Sample Width

1 : “8 bit Signed Integer PCM”,
2 : “16 bit Signed Integer PCM”,
3 : “32 bit Signed Integer PCM”,
4 : “64 bit Signed Integer PCM”

Here is the gist for Changing the Frame Rate, Channels and Sample Width

Frame Rate Conversion

You can see the Frame Rate, Channels and Sample Width of Audio in “convertedrate.wav”

Silence Remove

Here we will Remove the Silence using Voice Activity Detector(VAD) Algorithm.

Basically the Silence Removal code reads the audio file and convert into frames and then check VAD to each set of frames using Sliding Window Technique. The Frames having voices are collected in seperate list and non-voices(silences) are removed. Hence, all frames which contains voices is in the list are converted into “Audio file”.

Create a file named “silenceremove.py” and copy the below contents

import collections
import contextlib
import sys
import wave
import webrtcvad


def read_wave(path):
"""Reads a .wav file.
Takes the path, and returns (PCM audio data, sample rate).
"""
with contextlib.closing(wave.open(path, 'rb')) as wf:
num_channels = wf.getnchannels()
assert num_channels == 1
sample_width = wf.getsampwidth()
assert sample_width == 2
sample_rate = wf.getframerate()
assert sample_rate in (8000, 16000, 32000, 48000)
pcm_data = wf.readframes(wf.getnframes())
return pcm_data, sample_rate


def write_wave(path, audio, sample_rate):
"""Writes a .wav file.
Takes path, PCM audio data, and sample rate.
"""
with contextlib.closing(wave.open(path, 'wb')) as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(sample_rate)
wf.writeframes(audio)


class Frame(object):
"""Represents a "frame" of audio data."""
def __init__(self, bytes, timestamp, duration):
self.bytes = bytes
self.timestamp = timestamp
self.duration = duration


def frame_generator(frame_duration_ms, audio, sample_rate):
"""Generates audio frames from PCM audio data.
Takes the desired frame duration in milliseconds, the PCM data, and
the sample rate.
Yields Frames of the requested duration.
"""
n = int(sample_rate * (frame_duration_ms / 1000.0) * 2)
offset = 0
timestamp = 0.0
duration = (float(n) / sample_rate) / 2.0
while offset + n < len(audio):
yield Frame(audio[offset:offset + n], timestamp, duration)
timestamp += duration
offset += n


def vad_collector(sample_rate, frame_duration_ms,
padding_duration_ms, vad, frames):
"""Filters out non-voiced audio frames.
Given a webrtcvad.Vad and a source of audio frames, yields only
the voiced audio.
Uses a padded, sliding window algorithm over the audio frames.
When more than 90% of the frames in the window are voiced (as
reported by the VAD), the collector triggers and begins yielding
audio frames. Then the collector waits until 90% of the frames in
the window are unvoiced to detrigger.
The window is padded at the front and back to provide a small
amount of silence or the beginnings/endings of speech around the
voiced frames.
Arguments:
sample_rate - The audio sample rate, in Hz.
frame_duration_ms - The frame duration in milliseconds.
padding_duration_ms - The amount to pad the window, in milliseconds.
vad - An instance of webrtcvad.Vad.
frames - a source of audio frames (sequence or generator).
Returns: A generator that yields PCM audio data.
"""
num_padding_frames = int(padding_duration_ms / frame_duration_ms)
# We use a deque for our sliding window/ring buffer.
ring_buffer = collections.deque(maxlen=num_padding_frames)
# We have two states: TRIGGERED and NOTTRIGGERED. We start in the
# NOTTRIGGERED state.
triggered = False

voiced_frames = []
for frame in frames:
is_speech = vad.is_speech(frame.bytes, sample_rate)

sys.stdout.write('1' if is_speech else '0')
if not triggered:
ring_buffer.append((frame, is_speech))
num_voiced = len([f for f, speech in ring_buffer if speech])
# If we're NOTTRIGGERED and more than 90% of the frames in
# the ring buffer are voiced frames, then enter the
# TRIGGERED state.
if num_voiced > 0.9 * ring_buffer.maxlen:
triggered = True
sys.stdout.write('+(%s)' % (ring_buffer[0][0].timestamp,))
# We want to yield all the audio we see from now until
# we are NOTTRIGGERED, but we have to start with the
# audio that's already in the ring buffer.
for f, s in ring_buffer:
voiced_frames.append(f)
ring_buffer.clear()
else:
# We're in the TRIGGERED state, so collect the audio data
# and add it to the ring buffer.
voiced_frames.append(frame)
ring_buffer.append((frame, is_speech))
num_unvoiced = len([f for f, speech in ring_buffer if not speech])
# If more than 90% of the frames in the ring buffer are
# unvoiced, then enter NOTTRIGGERED and yield whatever
# audio we've collected.
if num_unvoiced > 0.9 * ring_buffer.maxlen:
sys.stdout.write('-(%s)' % (frame.timestamp + frame.duration))
triggered = False
yield b''
.join([f.bytes for f in voiced_frames])
ring_buffer.clear()
voiced_frames = []
if triggered:
sys.stdout.write('-(%s)' % (frame.timestamp + frame.duration))
sys.stdout.write('\n')
# If we have any leftover voiced audio when we run out of input,
# yield it.
if voiced_frames:
yield b''.join([f.bytes for f in voiced_frames])


def main(args):
if len(args) != 2:
sys.stderr.write(
'Usage: silenceremove.py <aggressiveness> <path to wav file>\n')
sys.exit(1)
audio, sample_rate = read_wave(args[1])
vad = webrtcvad.Vad(int(args[0]))
frames = frame_generator(30, audio, sample_rate)
frames = list(frames)
segments = vad_collector(sample_rate, 30, 300, vad, frames)

# Segmenting the Voice audio and save it in list as bytes
concataudio = [segment for segment in segments]

joinedaudio = b"".join(concataudio)

write_wave("Non-Silenced-Audio.wav", joinedaudio, sample_rate)


if __name__ == '__main__':
main(sys.argv[1:])

Set the aggressiveness mode, which is an integer between 0 and 3. 0 is the least aggressive about filtering out non-speech, 3 is the most aggressive.

Run the “python silenceremove.py ‘aggressiveness’ <inputfile.wav>” in command prompt(For Eg. “python silenceremove.py 3 abc.wav”).

Here is the gist for Silence Removal of the Audio . . . . . .

You will get non-silenced audio as “Non-Silenced-Audio.wav”.

If you want to Split the audio using Silence, check this

Conclusion

The article is a summary of how to remove silence in audio file and some audio processing techniques in Python

Thanks,

Bala Murugan N G

References

[1] Webrtcvad

[2] Pydub

--

--