Video Captioning is considered to be one of the most challenging problems inthe field of computer vision. Video Captioning involves the combination ofdifferent deep learning models to perform object detection, action detection,and localization by processing a sequence of image frames. It is crucial toconsider the sequence of actions in a video in order to generate a meaningfuldescription of the overall action event. A reliable, accurate, and real-timevideo captioning method can be used in many applications. However, this paperfocuses on one application: video captioning for fostering and facilitatingphysical activities. In broad terms, the work can be considered to be assistivetechnology. Lack of physical activity appears to be increasingly widespread inmany nations due to many factors, the most important being the convenience thattechnology has provided in workplaces. The adopted sedentary lifestyle isbecoming a significant public health issue. Therefore, it is essential toincorporate more physical movements into our daily lives. Tracking one's dailyphysical activities would offer a base for comparison with activities performedin subsequent days. With the above in mind, this paper proposes a videocaptioning framework that aims to describe the activities in a video andestimate a person's daily physical activity level. This framework couldpotentially help people trace their daily movements to reduce an inactivelifestyle's health risks. The work presented in this paper is still in itsinfancy. The initial steps of the application are outlined in this paper. Basedon our preliminary research, this project has great merit.