Iterative Policy Learning for Constrained RL via Dissipative Gradient Descent-Ascent
In constrained reinforcement learning (C-RL), an agent aims to learn a policy that maximizes an expected cumulative reward (return) subject to satisfying additional requirements on secondary objectives. Recently, several algorithms—primarily rooted in stochastic (gradient) descent-ascent (SGDA) methods—have been proposed to solve this problem. While such algorithms can successfully provide rigorous regret bounds, the policy iterates that they compute fail to converge to the optimal policy. Though this problem is (informally) attributed to a fundamental limitation of constrained reinforcement learning, in this talk, we show that this limitation stems from the inability of SGDA algorithms to guarantee convergence for non-strongly monotone problems. To solve this problem, we propose a novel dissipative (stochastic) gradient descent-ascent algorithm (D-SGDA) for min-max problems that can guarantee convergence under mild assumptions on the convexity-concavity of the saddle function. We then show how to apply our D-SGDA algorithm to find optimal policies for tabular C-RL problems.