Grad-CAM Reveals the Why Behind Deep Learning Decisions

This example shows how to use the Grad-CAM approach to understand why a deep learning network makes its classification decisions. Grad-CAM, invented by Selvaraju and coauthors [1], uses the gradient of the classification score with respect to the last convolutional layer in a network in order to understand which parts of the image are most important for classification. The example uses the GoogLeNet pretrained network for images.

Grad-CAM is a generalization of the CAM technique. This example shows Grad-CAM using the dlgradient automatic differentiation function to perform the required computations easily. For activation mapping techniques on live webcam data, see Investigate Network Predictions Using Class Activation Mapping.

Load Pretrained Network

Load the GoogLeNet network.

net = googlenet;

Classify Image

Read the GoogLeNet image size.

inputSize = net.Layers(1).InputSize(1:2);

Load sherlock.jpg., an image of a golden retriever included with this example.

img = imread("sherlock.jpg");

Resize the image to the GoogLeNet dimensions.

img = imresize(img,inputSize);

Classify the image and display it, along with its classification and classification score.

[classfn,score] = classify(net,img);
imshow(img);
title(sprintf("%s (%.2f)", classfn, score(classfn)));

GoogLeNet correctly classifies the image as a golden retriever. But why? What characteristics of the image cause the network to make this classification?

Grad-CAM Explains Why

The idea behind Grad-CAM [1] is to calculate the gradient of the final classification score with respect to the final convolutional layer in a network. The places where this gradient is large are exactly the places where the final score depends most on the data. The gradcam helper function computes the Grad-CAM map for a dlnetwork, taking the derivative of the softmax layer score for a given class with respect to a convolutional feature map. For automatic differentiation, the input image dlImg must be a dlarray.

type gradcam.m
function [convMap,dScoresdMap] = gradcam(dlnet, dlImg, softmaxName, convLayerName, classfn)
[scores,convMap] = predict(dlnet, dlImg, 'Outputs', {softmaxName, convLayerName});
classScore = scores(classfn);
dScoresdMap = dlgradient(classScore,convMap);
end

The first line of the gradcam function obtains the scores and convolution map from the final convolutional layer in the network. The second line finds the score for the selected classification (golden retriever, in this case). dlgradient calculates gradients only for scalar-valued functions. So gradcam calculates the gradient of the image score only for the selected classification. The third line uses automatic differentiation to calculate the gradient of the final score with respect to the weights in the final convolutional layer.

To use Grad-CAM, create a dlnetwork from the GoogLeNet network. Create a layer graph from the network.

lgraph = layerGraph(net);

To access the data that GoogLeNet uses for classification, remove its final classification layer.

lgraph = removeLayers(lgraph, lgraph.Layers(end).Name);

Create a dlnetwork from the layer graph.

dlnet = dlnetwork(lgraph);

Specify the name of the softmax layer, 'prob'. Specify the name of the final convolutional layer, 'inception_5b-output'.

softmaxName = 'prob';
convLayerName = 'inception_5b-output';

To use automatic differentiation, convert the sherlock image to a dlarray.

dlImg = dlarray(single(img),'SSC');

Compute the Grad-CAM gradient for the image by calling dlfeval on the gradcam function.

[convMap, dScoresdMap] = dlfeval(@gradcam, dlnet, dlImg, softmaxName, convLayerName, classfn);

Resize the gradient map to the GoogLeNet image size, and scale the scores to the appropriate levels for display.

gradcamMap = sum(convMap .* sum(dScoresdMap, [1 2]), 3);
gradcamMap = extractdata(gradcamMap);
gradcamMap = rescale(gradcamMap);
gradcamMap = imresize(gradcamMap, inputSize, 'Method', 'bicubic');

Show the Grad-CAM levels on top of the image by using an 'AlphaData' value of 0.5. The 'jet' colormap has deep blue as the lowest value and deep red as the highest.

imshow(img);
hold on;
imagesc(gradcamMap,'AlphaData',0.5);
colormap jet
hold off;
title("Grad-CAM");

Clearly, the upper face and ear of the dog have the greatest impact on the classification.

For a different approach to investigating the reasons for deep network classifications, see occlusionSensitivity.

References

[1] Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization." In IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. Available at Grad-CAM on the Computer Vision Foundation Open Access website.

See Also

| | |

Related Topics