How the number of parameters is calculated if multihead self attention layer is used in a CNN model?

52 Ansichten (letzte 30 Tage)
I have run the example in the following link in two cases:
Case 1: NumHeads = 4, NumKeyChannels = 784 Case 2: NumHeads = 8, NumKeyChannels = 392 Note that:
4x784 = 8x392 = 3136 (size of input feature vector to the attention layer). I have calculated the number of model parameters in the two cases and I got the following: 9.8 M for the first case, and 4.9 M for the second case.
I expected the number of learnable parameters to be the same. However, MATLAB reports different parameter counts.
My understanding from research papers is that the total parameters should not scale with how input is split across heads. The number of parameters should be the same as long as the input feature vector is the same, and the product of the number of heads by the size of each head (number of channels) is equal to the input size.
Why does MATLAB’s selfAttentionLayer produce different parameter counts for these two configurations? Am I misinterpreting how the layer is implemented in this toolbox?
  6 Kommentare
Hana Ahmed
Hana Ahmed am 3 Sep. 2025 um 5:24
Bearbeitet: Hana Ahmed am 3 Sep. 2025 um 5:25
I have implemented the pseudo code you provided in Matlab. I would be grateful if you could check it.
I have run the example in the following link using the new implementation, and the previous incorrect implementation already found in Matlab. I have used 8 heads and 64 keychannels. Using the new implementation, the number of parametrs is 6.4 M. Using the previous implementation, the number of parametrs is 0,8 M. Could you please revise and confirm the correct implementation?
I achieved a high classification accuracy of approximately 99.5% using a self-attention-based network , following the example provided in the following MATLAB Central post that employs selfAttentionLayer(8, 64) for the Digit Dataset. This result is excellent, and my goal is to preserve it while ensuring the correct interpretation of the layer’s parameters in my technical report. I would like to clarify how the numKeyChannels parameter is interpreted in MATLAB’s implementation.
Can I use the built-in MATLAB implementation to report the system accuracy (since it is validated and achieves ~99.5%), while using my custom implementation to report the number of parameters, to reflect a more transparent and explicitly defined architecture?
If so, how can I clearly and accurately communicate this approach in a technical report, ensuring that the reader understands that the accuracy and parameter count come from different but compatible implementations, one for performance evaluation and the other for architectural analysis, without misleading the audience?
https://www.mathworks.com/matlabcentral/answers/1932550-example-of-using-self-attention-layer-in-matlab-r2023a
Umar
Umar am 3 Sep. 2025 um 5:53
Bearbeitet: Umar am 3 Sep. 2025 um 5:53

Hi @Hana Ahmed,

I checked your code — the “Not enough input arguments” error happens because the function needs the input X when you call it, e.g.,

X = randn(128, 512);  
[Y, numParams] = multiHeadAttention(X, 8, 64);
  • Please see attached

Once X is provided, it should run and give the parameter count. It’s normal if your custom count is higher than MATLAB’s built-in selfAttentionLayer, since the built-in layer may share or optimize weights internally.

For reporting, you can use the built-in layer for accuracy (~99.5%) and your custom implementation for parameter counts — just make it clear that they come from compatible but distinct implementations. Also, replacing squeeze with reshape(..., batchSize, d_k) avoids issues if batchSize = 1.

Melden Sie sich an, um zu kommentieren.

Antworten (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by