Composed image retrieval aims to find an image that best matches a givenmulti-modal user query consisting of a reference image and text pair. Existingmethods commonly pre-compute image embeddings over the entire corpus andcompare these to a reference image embedding modified by the query text at testtime. Such a pipeline is very efficient at test time since fast vectordistances can be used to evaluate candidates, but modifying the reference imageembedding guided only by a short textual description can be difficult,especially independent of potential candidates. An alternative approach is toallow interactions between the query and every possible candidate, i.e.,reference-text-candidate triplets, and pick the best from the entire set.Though this approach is more discriminative, for large-scale datasets thecomputational cost is prohibitive since pre-computation of candidate embeddingsis no longer possible. We propose to combine the merits of both schemes using atwo-stage model. Our first stage adopts the conventional vector distancingmetric and performs a fast pruning among candidates. Meanwhile, our secondstage employs a dual-encoder architecture, which effectively attends to theinput triplet of reference-text-candidate and re-ranks the candidates. Bothstages utilize a vision-and-language pre-trained network, which has provenbeneficial for various downstream tasks. Our method consistently outperformsstate-of-the-art approaches on standard benchmarks for the task.