Abstract
Sentence encoders map sentences to real valued vectors for use in downstreamapplications. To peek into these representations - e.g., to increaseinterpretability of their results - probing tasks have been designed whichquery them for linguistic knowledge. However, designing probing tasks forlesser-resourced languages is tricky, because these often lack large-scaleannotated data or (high-quality) dependency parsers as a prerequisite ofprobing task design in English. To investigate how to probe sentence embeddingsin such cases, we investigate sensitivity of probing task results to structuraldesign choices, conducting the first such large scale study. We show thatdesign choices like size of the annotated probing dataset and type ofclassifier used for evaluation do (sometimes substantially) influence probingoutcomes. We then probe embeddings in a multilingual setup with design choicesthat lie in a 'stable region', as we identify for English, and find thatresults on English do not transfer to other languages. Fairer and morecomprehensive sentence-level probing evaluation should thus be carried out onmultiple languages in the future.