Moments capture a huge part of our lives. Accurate recognition of thesemoments is challenging due to the diverse and complex interpretation of themoments. Action recognition refers to the act of classifying the desiredaction/activity present in a given video. In this work, we perform experimentson Moments in Time dataset to recognize accurately activities occurring in 3second clips. We use state of the art techniques for visual, auditory andspatio temporal localization and develop method to accurately classify theactivity in the Moments in Time dataset. Our novel approach of using VisualBased Textual features and fusion techniques performs well providing an overall89.23 % Top - 5 accuracy on the 20 classes - a significant improvement over theBaseline TRN model.