Opening the Black Box: Analyzing Attention Weights and Hidden States in Pre-trained Language Models for Non-language Tasks

Abstract

Investigating deep learning language models has always been a significantresearch area due to the ``black box" nature of most advanced models. With therecent advancements in pre-trained language models based on transformers andtheir increasing integration into daily life, addressing this issue has becomemore pressing. In order to achieve an explainable AI model, it is essential tocomprehend the procedural steps involved and compare them with human thoughtprocesses. Thus, in this paper, we use simple, well-understood non-languagetasks to explore these models' inner workings. Specifically, we apply apre-trained language model to constrained arithmetic problems with hierarchicalstructure, to analyze their attention weight scores and hidden states. Theinvestigation reveals promising results, with the model addressing hierarchicalproblems in a moderately structured manner, similar to human problem-solvingstrategies. Additionally, by inspecting the attention weights layer by layer,we uncover an unconventional finding that layer 10, rather than the model'sfinal layer, is the optimal layer to unfreeze for the least parameter-intensiveapproach to fine-tune the model. We support these findings with entropyanalysis and token embeddings similarity analysis. The attention analysisallows us to hypothesize that the model can generalize to longer sequences inListOps dataset, a conclusion later confirmed through testing on sequenceslonger than those in the training set. Lastly, by utilizing a straightforwardtask in which the model predicts the winner of a Tic Tac Toe game, we identifylimitations in attention analysis, particularly its inability to capture 2Dpatterns.

Quick Read (beta)

loading the full paper ...