Revealing the differences between deep neural network models based on outcome explanation methods

Different machine learning models can predict the same class for an instance based on prior learned internal concepts. Goal of this work is to find and compare the concepts that models use to make the classification decision and utilize them to train new models.

One way to reveal concepts is to use the explanations a model gives for specific input instances (outcome explanations). These explanations can then be used to compare the concepts of different models for the specific predictions. This comparison could also be used to train a new model in such a way, that it explains every instance in the same way as the model which was used to train it. In other words, the saliency maps of both models are the same. Another task which the comparison of two models by the internal used concepts could solve, would be to find the concepts of one model that are most different from those of another model.