Main Content


YAMNet sound classification network

Since R2021b

  • YAMNet block

Audio Toolbox / Deep Learning


The YAMNet block leverages a pretrained sound classification network that is trained on the AudioSet dataset to predict audio events from the AudioSet ontology.



expand all

Mel spectrograms, specified as a 96-by-64 matrix or a 96-by-64-by-1-by-N array, where:

  • 96 –– Represents the number of 25 ms frames in each mel spectrogram

  • 64 –– Represents the number of mel bands spanning 125 Hz to 7.5 kHz

  • N –– Number of channels.

You can use the YAMNet Preprocess block to generate mel spectrograms. The dimensions of these spectrograms are 96-by-64.

Data Types: single | double


expand all

Predicted sound label, returned as an enumerated scalar.

Data Types: enumerated

Predicted activation or score values for each supported sound label, returned as a 1-by-521 vector, where 521 is the number of classes in YAMNet.

Data Types: single

Class labels for predicted scores, returned as a 1-by-521 vector.

Data Types: enumerated


expand all

Size of mini-batches to use for prediction, specified as a positive integer. Larger mini-batch sizes require more memory, but can lead to faster predictions.

Data Types: int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Enable the output port sound, which outputs the classified sound.

Enable the output ports scores and labels, which output all predicted scores and associated class labels.

Block Characteristics

Data Types

double | single

Direct Feedthrough


Multidimensional Signals


Variable-Size Signals


Zero-Crossing Detection



expand all


[1] Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80. (Crossref), doi:10.1109/ICASSP.2017.7952261.

[2] Hershey, Shawn, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. “CNN Architectures for Large-Scale Audio Classification.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35. (Crossref), doi:10.1109/ICASSP.2017.7952132.

Extended Capabilities

Version History

Introduced in R2021b