This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

voiceActivityDetector System object

Detect presence of speech in audio signal

Description

The voiceActivityDetector System object™ detects the presence of speech in an audio segment. You can also use the voiceActivityDetector System object to output an estimate of the noise variance per frequency bin.

To detect the presence of speech:

  1. Create the voiceActivityDetector object and set its properties.

  2. Call the object with arguments, as if it were a function.

To learn more about how System objects work, see What Are System Objects? (MATLAB).

Creation

Syntax

VAD = voiceActivityDetector
VAD = voiceActivityDetector(Name,Value)

Description

VAD = voiceActivityDetector creates a System object, VAD, that detects the presence of speech independently across each input channel.

VAD = voiceActivityDetector(Name,Value) sets each property Name to the specified Value. Unspecified properties have default values.

Example: VAD = voiceActivityDetector('InputDomain','Frequency') creates a System object, VAD, that accepts frequency-domain input.

Properties

expand all

Unless otherwise indicated, properties are nontunable, which means you cannot change their values after calling the object. Objects lock when you call them, and the release function unlocks them.

If a property is tunable, you can change its value at any time.

For more information on changing property values, see System Design in MATLAB Using System Objects (MATLAB).

Domain of the input signal, specified as 'Time' or 'Frequency'.

Tunable: No

Data Types: char | string

FFT length, specified as a positive scalar. The default is [], which means that the FFTLength is equal to the number of rows of the input.

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time'.

Data Types: single | double

Time-domain window function applied before calculating the discrete-time Fourier transform (DTFT), specified as 'Hann', 'Rectangular', 'Flat Top', 'Hamming', 'Chebyshev', or 'Kaiser'.

The window function is designed using the algorithms of the following functions:

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time'.

Data Types: char | string

Sidelobe attenuation of the window in dB, specified as a real positive scalar.

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time' and Window to 'Chebyshev' or 'Kaiser'.

Data Types: single | double

Probability of transition from a frame of silence to a frame of speech, specified as a scalar in the range [0,1].

Tunable: Yes

Data Types: single | double

Probability of transition from a frame of speech to a frame of silence, specified as a scalar in the range [0,1].

Tunable: Yes

Data Types: single | double

Usage

Syntax

[probability,noiseEstimate] = VAD(audioIn)

Description

example

[probability,noiseEstimate] = VAD(audioIn) applies a voice activity detector on the input, audioIn, and returns the probability that speech is present. It also returns the estimated noise variance per frequency bin.

Input Arguments

expand all

Audio input to the voice activity detector, specified as a scalar, vector, or matrix. If audioIn is a matrix, the columns are treated as independent audio channels.

The size of the audio input is locked after the first call to the voiceActivityDetector object. To change the size of audioIn, call release on the object.

If InputDomain is set to 'Time', audioIn must be real-valued. If InputDomain is set to 'Frequency', audioIn can be real-valued or complex-valued.

Data Types: single | double
Complex Number Support: Yes

Output Arguments

expand all

Probability that speech is present, returned as a scalar or row vector with the same number of columns as audioIn.

Data Types: single | double

Estimate of the noise variance per frequency bin, returned as a column vector or matrix with the same number of columns as audioIn.

Data Types: single | double

Object Functions

To use an object function, specify the System object as the first input argument. For example, to release system resources of a System object named obj, use this syntax:

release(obj)

expand all

cloneCreate duplicate System object
isLockedDetermine if System object is in use
releaseRelease resources and allow changes to System object property values and input characteristics
resetReset internal states of System object
stepRun System object algorithm

Examples

expand all

Use the default voiceActivityDetector System object? to detect the presence of speech in a streaming audio signal.

Create an audio file reader to stream an audio file for processing. Define parameters to chunk the audio signal into 10 ms non-overlapping frames.

fileReader = dsp.AudioFileReader('Counting-16-44p1-mono-15secs.wav');
fs = fileReader.SampleRate;
fileReader.SamplesPerFrame = ceil(10e-3*fs);

Create a default voiceActivityDetector System object to detect the presence of speech in the audio file.

VAD = voiceActivityDetector;

Create a scope to plot the audio signal and corresponding probability of speech presence as detected by the voice activity detector. Create an audio device writer to play the audio through your sound card.

scope = dsp.TimeScope( ...
    'NumInputPorts',2, ...
    'SampleRate',fs, ...
    'TimeSpan',3, ...
    'BufferLength',3*fs, ...
    'YLimits',[-1.5 1.5], ...
    'TimeSpanOverrunAction','Scroll', ...
    'ShowLegend',true, ...
    'ChannelNames',{'Audio','Probability of speech presence'});
deviceWriter = audioDeviceWriter('SampleRate',fs);

In an audio stream loop:

  1. Read from the audio file.

  2. Calculate the probability of speech presence.

  3. Visualize the audio signal and speech presence probability.

  4. Play the audio signal through your sound card.

while ~isDone(fileReader)
    audioIn = fileReader();
    probability = VAD(audioIn);
    scope(audioIn,probability*ones(fileReader.SamplesPerFrame,1))
    deviceWriter(audioIn);
end

Use a voice activity detector to detect the presence of speech in an audio signal. Plot the probability of speech presence along with the audio samples.

Create a dsp.AudioFileReader System object? to read a speech file.

afr = dsp.AudioFileReader('Counting-16-44p1-mono-15secs.wav');
fs = afr.SampleRate;

Chunk the audio into 20 ms frames with 75% overlap between successive frames. Convert the frame time in seconds to samples. Determine the hop size (the increment of new samples). In the audio file reader, set the samples per frame to the hop size. Create a default dsp.AsyncBuffer object to manage overlapping between audio frames.

frameSize = ceil(20e-3*fs);
overlapSize = ceil(0.75*frameSize);
hopSize = frameSize - overlapSize;
afr.SamplesPerFrame = hopSize;

inputBuffer = dsp.AsyncBuffer('Capacity',frameSize);

Create a voiceActivityDetector System object. Specify an FFT length of 1024.

VAD = voiceActivityDetector('FFTLength',1024);

Create a scope to plot the audio signal and corresponding probability of speech presence as detected by the voice activity detector. Create an audioDeviceWriter System object to play audio through your sound card.

scope = dsp.TimeScope('NumInputPorts',2, ...
    'SampleRate',fs, ...
    'TimeSpan',3, ...
    'BufferLength',3*fs, ...
    'YLimits',[-1.5,1.5], ...
    'TimeSpanOverrunAction','Scroll', ...
    'ShowLegend',true, ...
    'ChannelNames',{'Audio','Probability of speech presence'});

player = audioDeviceWriter('SampleRate',fs);

Initialize a vector to hold the probability values.

pHold = ones(hopSize,1);

In an audio stream loop:

  1. Read a hop worth of samples from the audio file and save the samples into the buffer.

  2. Read a frame from the buffer with specified overlap from the previous frame.

  3. Call the voice activity detector to get the probability of speech for the frame under analysis.

  4. Set the last element of the probability vector to the new probability decision. Visualize the audio and speech presence probability using the time scope.

  5. Play the audio through your sound card.

  6. Set the probability vector to the most recent result for plotting in the next loop.

while ~isDone(afr)
    x = afr();
    n = write(inputBuffer,x);

    overlappedInput = read(inputBuffer,frameSize,overlapSize);

    p = VAD(overlappedInput);

    pHold(end) = p;
    scope(x,pHold)

    player(x);

    pHold(:) = p;
end

Release the player once the audio finishes playing.

release(player)

Many feature extraction techniques operate on the frequency domain. Converting an audio signal to the frequency domain only once is efficient. In this example, you convert a streaming audio signal to the frequency domain and feed that signal into a voice activity detector. If speech is present, mel-frequency cepstral coefficients (MFCC) features are extracted from the frequency-domain signal using the cepstralFeatureExtractor System object™.

Create a dsp.AudioFileReader System object to read from an audio file.

fileReader = dsp.AudioFileReader('Counting-16-44p1-mono-15secs.wav');
fs = fileReader.SampleRate;

Process the audio in 30 ms frames with a 10 ms hop. Create a default dsp.AsyncBuffer object to manage overlap between audio frames.

samplesPerFrame = ceil(0.03*fs);
samplesPerHop = ceil(0.01*fs);
samplesPerOverlap = samplesPerFrame - samplesPerHop;

fileReader.SamplesPerFrame = samplesPerHop;
buffer = dsp.AsyncBuffer;

Create a voiceActivityDetector System object and a cepstralFeatureExtractor System object. Specify that they operate in the frequency domain. Create a dsp.SignalSink to log the extracted cepstral features.

VAD = voiceActivityDetector('InputDomain','Frequency');
cepFeatures = cepstralFeatureExtractor('InputDomain','Frequency','SampleRate',fs,'LogEnergy','Replace');
sink = dsp.SignalSink;

In an audio stream loop:

  1. Read one hop's of samples from the audio file and save the samples into the buffer.

  2. Read a frame from the buffer with specified overlap from the previous frame.

  3. Call the voice activity detector to get the probability of speech for the frame under analysis.

  4. If the frame under analysis has a probability of speech greater than 0.75, extract cepstral features and log the features using the signal sink. If the frame under analysis has a probability of speech less than 0.75, write a vector of NaNs to the sink.

threshold = 0.75;
nanVector = nan(1,13);
while ~isDone(fileReader)
    audioIn = fileReader();
    write(buffer,audioIn);
    
    overlappedAudio = read(buffer,samplesPerFrame,samplesPerOverlap);
    X = fft(overlappedAudio,2048);
    
    probabilityOfSpeech = VAD(X);
    if probabilityOfSpeech > threshold
        xFeatures = cepFeatures(X);
        sink(xFeatures')
    else
        sink(nanVector)
    end
end

Visualize the cepstral coefficients over time.

timeVector = linspace(0,15,size(sink.Buffer,1));
plot(timeVector,sink.Buffer)
xlabel('Time (s)')
ylabel('MFCC Amplitude')
legend('Log-Energy','c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12')

Read in an entire speech file and determine the fundamental frequency of the audio using the pitch function. Then use the voiceActivityDetector to remove irrelevant pitch information that does not correspond to the speaker.

Read in the audio file and associated sample rate.

[audio,fs] = audioread('Counting-16-44p1-mono-15secs.wav');

Specify pitch detection using a 50 ms window length and 40 ms overlap (10 ms hop). Specify that the pitch function searches for the fundamental frequency over the range 50-150 Hz and postprocesses the results with a median filter. Plot the results.

windowLength = round(0.05*fs);
overlapLength = round(0.04*fs);
hopLength = windowLength - overlapLength;

[f0,loc] = pitch(audio,fs, ...
    'WindowLength',windowLength, ...
    'OverlapLength',overlapLength, ...
    'Range',[50 150], ...
    'MedianFilterLength',3);

plot(loc/fs,f0)
ylabel('Fundamental Frequency (Hz)')
xlabel('Time (s)')

Create a dsp.AsyncBuffer System object™ to chunk the audio signal into overlapped frames. Also create a voiceActivityDetector System object™ to determine if the frames contain speech.

buffer = dsp.AsyncBuffer(numel(audio));
write(buffer,audio);
VAD = voiceActivityDetector;

While there are enough samples to hop, read from the buffer and determine the probability that the frame contains speech. To mimic the decision spacing in time of the pitch function, the first frame read from the buffer has no overlap.

n = 1;
probabilityVector = zeros(numel(loc),1);
while buffer.NumUnreadSamples >= hopLength
    if n==1
        x = read(buffer,windowLength);
    else
        x = read(buffer,windowLength,overlapLength);
    end
    probabilityVector(n) = VAD(x);
    n = n+1;
end

Use the probability vector determined by the voiceActivityDetector to plot a pitch contour for the speech file that corresponds to regions of speech.

validIdx = probabilityVector>0.99;
loc(~validIdx) = nan;
f0(~validIdx) = nan;
plot(loc/fs,f0)
ylabel('Fundamental Frequency (Hz)')
xlabel('Time (s)')

Algorithms

The voiceActivityDetector implements the algorithm described in [1].

If InputDomain is specified as 'Time', the input signal is windowed and then converted to the frequency domain according to the Window, SidelobeAttenuation, and FFTLength properties. If InputDomain is specified as frequency, the input is assumed to be a windowed discrete time Fourier transform (DTFT) of an audio signal. The signal is then converted to the power domain. Noise variance is estimated according to [2]. The posterior and prior SNR are estimated according to the Minimum Mean-Square Error (MMSE) formula described in [3]. A log likelihood ratio test and Hidden Markov Model (HMM)-based hang-over scheme determine the probability that the current frame contains speech, according to [1].

References

[1] Sohn, Jongseo., Nam Soo Kim, and Wonyong Sung. "A Statistical Model-Based Voice Activity Detection." Signal Processing Letters IEEE. Vol. 6, No. 1, 1999.

[2] Martin, R. "Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics." IEEE Transactions on Speech and Audio Processing. Vol. 9, No. 5, 2001, pp. 504–512.

[3] Ephraim, Y., and D. Malah. "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 32, No. 6, 1984, pp. 1109–1121.

Extended Capabilities

See Also

System Objects

Blocks

Functions

Introduced in R2018a