Counterfactual Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation

Abstract

The goal of counterfactual learning for statistical machine translation (SMT)is to optimize a target SMT system from logged data that consist of userfeedback to translations that were predicted by another, historic SMT system. Achallenge arises by the fact that risk-averse commercial SMT systemsdeterministically log the most probable translation. The lack of sufficientexploration of the SMT output space seemingly contradicts the theoreticalrequirements for counterfactual learning. We show that counterfactual learningfrom deterministic bandit logs is possible nevertheless by smoothing outdeterministic components in learning. This can be achieved by additive andmultiplicative control variates that avoid degenerate behavior in empiricalrisk minimization. Our simulation experiments show improvements of up to 2 BLEUpoints by counterfactual learning from deterministic bandit feedback.

Quick Read (beta)

loading the full paper ...