Main Content

vadnet

(Not recommended) Voice activity detection (VAD) neural network

Since R2023a

    vadnet is not recommended. Use the audioPretrainedNetwork function instead.

    Description

    net = vadnet() returns a pretrained VAD model.

    This function requires both Audio Toolbox™ and Deep Learning Toolbox™.

    example

    Examples

    collapse all

    Read in an audio signal containing speech and music and listen to the sound.

    [audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
    sound(audioIn,fs)

    Use vadnetPreprocess to preprocess the audio by computing a mel spectrogram.

    features = vadnetPreprocess(audioIn,fs);

    Call audioPretrainedNetwork to obtain a pretrained VAD neural network.

    net = audioPretrainedNetwork("vadnet");

    Pass the preprocessed audio through the network to obtain the probability of speech in each frame.

    probs = predict(net,features);

    Use vadnetPosprocess to postprocess the network output and determine the boundaries of the speech regions in the signal.

    roi = vadnetPostprocess(audioIn,fs,probs)
    roi = 2×2
    
               1       63120
           83600      150000
    
    

    Plot the audio with the detected speech regions.

    vadnetPostprocess(audioIn,fs,probs)

    Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

    Create a dsp.AudioFileReader object to stream an audio file for processing. Set the SamplesPerFrame property to read 100 ms nonoverlapping chunks from the signal.

    afr = dsp.AudioFileReader("MaleVolumeUp-16-mono-6secs.ogg");
    analysisDuration = 0.1; % seconds
    afr.SamplesPerFrame = floor(analysisDuration*afr.SampleRate);

    The vadnet architecture does not retain state between calls, and it performs best when analyzing larger chunks of audio signals. When you use vadnet in a streaming scenario, specific application requirements of accuracy, computational efficiency, and latency dictate the analysis duration and whether to overlap analysis chunks.

    Create a timescope object to plot the audio signal and the corresponding speech probabilities. Create an audioDeviceWriter to play the audio as you stream it.

    scope = timescope(NumInputPorts=2, ...
        SampleRate=afr.SampleRate, ...
        TimeSpanSource="property",TimeSpan=5, ...
        YLimits=[-1.2,1.2], ...
        ShowLegend=true,ChannelNames=["Audio","Speech Probability"]);
    adw = audioDeviceWriter(afr.SampleRate);

    Call audioPretrainedNetwork to obtain a pretrained VAD neural network.

    net = audioPretrainedNetwork("vadnet");

    In a streaming loop:

    1. Read in a 100 ms chunk from the audio file.

    2. Preprocess the audio into a mel spectrogram using vadnetPreprocess.

    3. Use the VAD network to predict the probability of speech in each frame of the spectrogram. Replicate the probabilities to correspond to each sample in the audio signal.

    4. Plot the audio signal and the probabilities of speech.

    5. Play the audio with the device writer.

    hop = 0.01 * afr.SampleRate;
    while ~isDone(afr)
        audioIn = afr();
    
        features = vadnetPreprocess(audioIn,afr.SampleRate);
        probs = predict(net,features);
        % Replicate probs to correspond to samples in audioIn
        probs = repelem(probs,hop)';
        probs = probs((hop/2)+1:end-hop/2);
    
        scope(audioIn,probs)
        adw(audioIn);
    end

    Output Arguments

    collapse all

    Pretrained VAD neural network, returned as a DAGNetwork (Deep Learning Toolbox) object.

    Algorithms

    The neural network is a ported version of the vad-crdnn-libriparty pretrained model provided by SpeechBrain[1], which combines convolutional, recurrent, and fully connected layers.

    References

    [1] Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624

    Extended Capabilities

    Version History

    Introduced in R2023a