Abstract
The combination of Transformer-based encoders with contrastive learningrepresents the current mainstream paradigm for sentence representationlearning. This paradigm is typically based on the hidden states of the lastTransformer block of the encoder. However, within Transformer-based encoders,different blocks exhibit varying degrees of semantic perception ability. Fromthe perspective of interpretability, the semantic perception potential ofknowledge neurons is modulated by stimuli, thus rational cross-blockrepresentation fusion is a direction worth optimizing. To balance the semanticredundancy and loss across block fusion, we propose a sentence representationselection mechanism S\textsuperscript{2}Sent, which integrates a parameterizednested selector downstream of the Transformer-based encoder. This selectorperforms spatial selection (SS) and nested frequency selection (FS) from amodular perspective. The SS innovatively employs a spatial squeeze basedself-gating mechanism to obtain adaptive weights, which not only achievesfusion with low information redundancy but also captures the dependenciesbetween embedding features. The nested FS replaces GAP with different DCT basisfunctions to achieve spatial squeeze with low semantic loss. Extensiveexperiments have demonstrated that S\textsuperscript{2}Sent achievessignificant improvements over baseline methods with negligible additionalparameters and inference latency, while highlighting high integrability andscalability.