Visual Entailment with natural language explanations aims to infer therelationship between a text-image pair and generate a sentence to explain thedecision-making process. Previous methods rely mainly on a pre-trainedvision-language model to perform the relation inference and a language model togenerate the corresponding explanation. However, the pre-trainedvision-language models mainly build token-level alignment between text andimage yet ignore the high-level semantic alignment between the phrases (chunks)and visual contents, which is critical for vision-language reasoning. Moreover,the explanation generator based only on the encoded joint representation doesnot explicitly consider the critical decision-making points of relationinference. Thus the generated explanations are less faithful to visual-languagereasoning. To mitigate these problems, we propose a unified Chunk-awareAlignment and Lexical Constraint based method, dubbed as CALeC. It contains aChunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a LexicalConstraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentencestructure inherent in language and various image regions to build chunk-awaresemantic alignment. Relation inferrer uses an attention-based reasoning networkto incorporate the token-level and chunk-level vision-language representations.LeCG utilizes lexical constraints to expressly incorporate the words or chunksfocused by the relation inferrer into explanation generation, improving thefaithfulness and informativeness of the explanations. We conduct extensiveexperiments on three datasets, and experimental results indicate that CALeCsignificantly outperforms other competitor models on inference accuracy andquality of generated explanations.