Abstract
Is it possible to use natural language to intervene in a model's behavior andalter its prediction in a desired way? We investigate the effectiveness ofnatural language interventions for reading-comprehension systems, studying thisin the context of social stereotypes. Specifically, we propose a new languageunderstanding task, Linguistic Ethical Interventions (LEI), where the goal isto amend a question-answering (QA) model's unethical behavior by communicatingcontext-specific principles of ethics and equity to it. To this end, we buildupon recent methods for quantifying a system's social stereotypes, augmentingthem with different kinds of ethical interventions and the desired modelbehavior under such interventions. Our zero-shot evaluation finds that eventoday's powerful neural language models are extremely poor ethical-advicetakers, that is, they respond surprisingly little to ethical interventions eventhough these interventions are stated as simple sentences. Few-shot learningimproves model behavior but remains far from the desired outcome, especiallywhen evaluated for various types of generalization. Our new task thus poses anovel language understanding challenge for the community.