gtcc

Extract gammatone cepstral coefficients, log-energy, delta, and delta-delta

Description

example

coeffs = gtcc(audioIn,fs) returns the gammatone cepstral coefficients (GTCCs) for the audio input, sampled at a frequency of fs Hz.

example

coeffs = gtcc(___,Name,Value) specifies options using one or more Name,Value pair arguments.

example

[coeffs,delta,deltaDelta,loc] = gtcc(___) returns the delta, delta-delta, and location in samples corresponding to each window of data. This output syntax can be used with any of the previous input syntaxes.

Examples

collapse all

Get the gammatone cepstral coefficients for an audio file using default settings. Plot the results.

[audioIn,fs] = audioread('Counting-16-44p1-mono-15secs.wav');

[coeffs,~,~,loc] = gtcc(audioIn,fs);

t = loc./fs;

plot(t,coeffs)
xlabel('Time (s)')
title('Gammatone Cepstral Coefficients')
legend('logE','0','1','2','3','4','5','6','7','8','9','10','11','12', ...
    'Location','northeastoutside')

Read in an audio file.

[audioIn,fs] = audioread('Turbine-16-44p1-mono-22secs.wav');

Calculate 20 GTCC using filters equally spaced on the ERB scale between hz2erb(62.5) and hz2erb(12000). Calculate the coefficients using 50 ms windows with 25 ms overlap. Replace the 0th coefficient with the log-energy. Use time-domain filtering.

[coeffs,~,~,loc] = gtcc(audioIn,fs, ...
                       'NumCoeffs',20, ...
                       'FrequencyRange',[62.5,12000], ...
                       'WindowLength',round(0.05*fs), ...
                       'OverlapLength',round(0.025*fs), ...
                       'LogEnergy','Replace', ...
                       'FilterDomain','Time');

Plot the results.

t = loc./fs;

plot(t,coeffs)
xlabel('Time (s)')
title('Gammatone Cepstral Coefficients')
legend('logE','1','2','3','4','5','6','7','8','9','10','11','12','13', ...
    '14','15','16','17','18','19','Location','northeastoutside');

Read in an audio file and convert it to a frequency representation.

[audioIn,fs] = audioread("Rainbow-16-8-mono-114secs.wav");

win = hann(1024,"periodic");
S = stft(audioIn,"Window",win,"OverlapLength",512,"Centered",false);

To extract the gammatone cepstral coefficients, call gtcc with the frequency-domain audio. Ignore the log-energy.

coeffs = gtcc(S,fs,"LogEnergy","Ignore");

In many applications, GTCC observations are converted to summary statistics for use in classification tasks. Plot probability density functions of each of the gammatone cepstral coefficients to observe their distributions.

nbins = 60;
for i = 1:size(coeffs,2)
    figure
    histogram(coeffs(:,i),nbins,'Normalization','pdf')
    title(sprintf("Coefficient %d",i-1))
end

Input Arguments

collapse all

Input signal, specified as a vector, matrix, or 3-D array.

If 'FilterDomain' is set to 'Frequency' (default), then audioIn can be real or complex.

  • If audioIn is real, it is interpreted as a time-domain signal and must be a column vector or a matrix. Columns of the matrix are treated as independent audio channels.

  • If audioIn is complex, it is interpreted as a frequency-domain signal. In this case, audioIn must be an L-by-M-by-N array, where L is the number of DFT points, M is the number of individual spectrums, and N is the number of individual channels.

If 'FilterDomain' is set to 'Time', then audioIn must be a real column vector or matrix. Columns of the matrix are treated as independent audio channels.

Data Types: single | double
Complex Number Support: Yes

Sample rate of the input signal in Hz, specified as a positive scalar.

Data Types: single | double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: coeffs = gtcc(audioIn,fs,'LogEnergy','Replace') returns gammatone cepstral coefficients for the audio input signal sampled at fs Hz. For each analysis window, the first coefficient in the coeffs vector is replaced with the log energy of the input signal.

Number of samples in analysis window used to calculate the coefficients, specified as the comma-separated pair consisting of 'WindowLength' and an integer in the range [2, size(audioIn,1)]. If unspecified, WindowLength defaults to round(0.03*fs).

Data Types: single | double

Number of samples overlapped between adjacent windows, specified as the comma-separated pair consisting of 'OverlapLength' and an integer in the range [0, WindowLength). If unspecified, OverlapLength defaults to round(0.02*fs).

Data Types: single | double

Number of coefficients returned for each window of data, specified as the comma-separated pair consisting of 'NumCoeffs' and an integer in the range [2, v]. v is the number of valid passbands. If unspecified, NumCoeffs defaults to 13.

The number of valid passbands is defined as the number of ERB steps (ERBN) in the frequency range of the filter bank. The frequency range of the filter bank is specified by FrequencyRange.

Data Types: single | double

Domain in which to apply filtering, specified as the comma-separated pair consisting of 'FilterDomain' and 'Frequency' or 'Time'. If unspecified, FilterDomain defaults to Frequency.

Data Types: string | char

Frequency range of gammatone filter bank in Hz, specified as the comma-separated pair consisting of 'FrequencyRange' and a two-element row vector of increasing values in the range [0, fs/2]. If unspecified, FrequencyRange defaults to [50, fs/2]

Data Types: single | double

Number of bins used to calculate the DFT of windowed input samples, specified as the comma-separated pair consisting of 'FFTLength' and a positive scalar integer. If unspecified, FFTLength defaults to WindowLength.

Data Types: single | double

Number of coefficients used to calculate the delta and the delta-delta values, specified as the comma-separated pair consisting of 'DeltaWindowLength' and two or an odd integer greater than two. If unspecified, DeltaWindowLength defaults to 2.

If DeltaWindowLength is set to 2, the delta is given by the difference between the current coefficients and the previous coefficients.

If DeltaWindowLength is set to an odd integer greater than 2, the following equation defines their values:

The function uses a least-squares approximation of the local slope over a region around the coefficients of the current analysis window. The delta cepstral values are computed by fitting the cepstral coefficients of neighboring analysis windows (M analysis windows before the current analysis window and M analysis windows after the current analysis window) to a straight line. For details, see [3].

Data Types: single | double

Log energy usage, specified as the comma-separated pair consisting of 'LogEnergy' and 'Append', 'Replace', or 'Ignore'. If unspecified, LogEnergy defaults to Append.

  • 'Append' –– The function prepends the log energy to the coefficients vector. The length of the coefficients vector is 1 + NumCoeffs.

  • 'Replace' –– The function replaces the first coefficient with the log energy of the signal. The length of the coefficients vector is NumCoeffs.

  • 'Ignore' –– The function does not calculate or return the log energy.

Data Types: char | string

Output Arguments

collapse all

Gammatone cepstral coefficients, returned as an L-by-M matrix or an L-by-M-by-N array, where:

  • L –– Number of analysis windows the audio signal is partitioned into. The input size, WindowLength, and OverlapLength control this dimension: L = floor((size(audioIn,1) − WindowLength))/(WindowLengthOverlapLength) + 1.

  • M –– Number of coefficients returned per frame. This value is determined by NumCoeffs and LogEnergy.

    When LogEnergy is set to:

    • 'Append' –– The object prepends the log energy value to the coefficients vector. The length of the coefficients vector is 1 + NumCoeffs.

    • 'Replace' –– The object replaces the first coefficient with the log energy of the signal. The length of the coefficients vector is NumCoeffs.

    • 'Ignore' –– The object does not calculate or return the log energy. The length of the coefficients vector is NumCoeffs.

  • N –– Number of input channels (columns). This value is size(audioIn,2).

Data Types: single | double

Change in coefficients from one analysis window to another, returned as an L-by-M matrix or an L-by-M-by-N array. The delta array is the same size and data type as the coeffs array. See coeffs for the definitions of L, M, and N.

The function uses a least-squares approximation of the local slope over a region around the current time sample. For details, see [3].

Data Types: single | double

Change in delta values, returned as an L-by-M matrix or an L-by-M-by-N array. The deltaDelta array is the same size and data type as the coeffs and delta arrays. See coeffs for the definitions of L, M, and N.

The function uses a least-squares approximation of the local slope over a region around the current time sample. For details, see [3].

Data Types: single | double

Location of last sample in each analysis window, returned as a column vector with the same number of rows as coeffs.

Data Types: single | double

Algorithms

collapse all

The gtcc function splits the entire data into overlapping segments. The length of each analysis window is determined by WindowLength. The length of overlap between analysis windows is determined by OverlapLength. The algorithm to determine the gammatone cepstral coefficients depends on the filter domain, specified by FilterDomain. The default filter domain is frequency.

Frequency-Domain Filtering

gtcc computes the gammatone cepstral coefficients, log energy values, delta, and delta-delta values for each analysis window as per the algorithm described in cepstralFeatureExtractor.

Time-Domain Filtering

If FilterDomain is specified as 'Time', the gtcc function uses the gammatoneFilterBank to apply time-domain filtering. The basic steps of the gtcc algorithm are outlined by the diagram.

The FrequencyRange and sample rate (fs) parameters are set on the filter bank using the name-value pairs input to the gtcc function. The number of filters in the gammatone filter bank is defined as hz2erb(FrequencyRange(2)) − hz2erb(FrequencyRange(1)).This roughly corresponds to placing a gammatone filter every 0.9 mm in the cochlea.

The output from the gammatone filter bank is a multichannel signal. Each channel output from the gammatone filter bank is buffered into overlapped analysis windows, as specified by WindowLength and OverlapLength. Then a periodic Hamming window is applied to each analysis window. The energy for each analysis window of data is calculated. The STE of the channels are concatenated. The concatenated signal is then passed through a logarithm function and transformed to the cepstral domain using a discrete cosine transform (DCT).

The log-energy is calculated on the original audio signal using the same buffering scheme applied to the gammatone filter bank output.

References

[1] Shao, Yang, Zhaozhang Jin, Deliang Wang, and Soundararajan Srinivasan. "An Auditory-Based Feature for Robust Speech Recognition." IEEE International Conference on Acoustics, Speech and Signal Processing. 2009.

[2] Valero, X., and F. Alias. "Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification." IEEE Transactions on Multimedia. Vol. 14, Issue 6, 2012, pp. 1684–1689.

[3] Rabiner, Lawrence R., and Ronald W. Schafer. Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Pearson, 2010.

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Introduced in R2019a