Recent achievements in language models have showcased their extraordinarycapabilities in bridging visual information with semantic languageunderstanding. This leads us to a novel question: can language models connecttextual semantics with IoT sensory signals to perform recognition tasks, e.g.,Human Activity Recognition (HAR)? If so, an intelligent HAR system withhuman-like cognition can be built, capable of adapting to new environments andunseen categories. This paper explores its feasibility with an innovativeapproach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which jointlyaligns textual embeddings with IoT sensor signals, including camera video,LiDAR, and mmWave. Through the IoT-language contrastive learning, we derive aunified semantic feature space that aligns multi-modal features with languageembeddings, so that the IoT data corresponds to specific words that describethe IoT data. To enhance the connection between textual categories and theirIoT data, we propose supplementary descriptions and learnable prompts thatbring more semantic information into the joint feature space. TENT can not onlyrecognize actions that have been seen but also ``guess'' the unseen action bythe closest textual words from the feature space. We demonstrate TENT achievesstate-of-the-art performance on zero-shot HAR tasks using different modalities,improving the best vision-language models by over 12%.