Authors: Luton Zou, Ziping Xu, Daiqi Gao, Susan Murphy
Abstract
It is well known that in reinforcement learning (RL) different reward functions may lead to the same optimal policy, while some reward functions can be substantially easier to learn. In this paper, we propose a framework for reward design by constructing surrogate rewards with mediators informed by causal directed acyclic graphs (DAGs), which are often available in real-world applications through domain knowledge. We show that under the surrogacy assumption, the proposed reward is unbiased and has lower variance than the primary reward. Specifically, we use an online reward design agent that adaptively learns the target surrogate reward in an unknown environment. Feeding the surrogate rewards to standard online learning oracles, we show that the regret bound can be improved. Our framework provides a theoretical improvement without the surrogacy assumption, when the total number of decision times is small compared to surrogacy error. We complement the theoretical analysis with simulation studies, where we show that the proposed framework can lead to significant performance improvement.
Access preprint: http://people.seas.harvard.edu/~samurphy/papers/CausalDAGInformedRewardDesign.pdf