Egocentric activity recognition is one of the most challenging tasks in videoanalysis. It requires a fine-grained discrimination of small objects and theirmanipulation. While some methods base on strong supervision and attentionmechanisms, they are either annotation consuming or do not take spatio-temporalpatterns into account. In this paper we propose LSTA as a mechanism to focus onfeatures from spatial relevant parts while attention is being tracked smoothlyacross the video sequence. We demonstrate the effectiveness of LSTA onegocentric activity recognition with an end-to-end trainable two-streamarchitecture, achieving state of the art performance on four standardbenchmarks.