(543f) Offset-Free Deep Deterministic Policy Gradient with Lyapunov Learning Penalties | AIChE

(543f) Offset-Free Deep Deterministic Policy Gradient with Lyapunov Learning Penalties

Authors 

Hedrick, E. - Presenter, West Virginia University
Hedrick, K., West Virginia University
Bhattacharyya, D., West Virginia University
Zitney, S., National Energy Technology Laboratory
Omell, B. P., National Energy Technology Laboratory
Recent advancements in deep learning have seen the expansion in applications and efficacy of actor-critic reinforcement learning (RL) algorithms [1]–[3]. However, there still exist significant challenges in applying these types of approaches to automatic process control, most notably due to sample inefficiency and lack of performance guarantees. While deep networks can approximate generic functions very well, the large number of parameters in these networks (especially where network architectures are relatively limited) requires many samples to achieve satisfactory performance. Further, the structure of the actor network can be limiting in that, while “good” performance may be achieved, neither stability nor elimination of offset is guaranteed. The approaches detailed in this work propose methods to address both issues.

To address the problem of offset-free control, a two-policy approach is proposed where it is assumed that, close to the origin, a linear, state-feedback controller exists that will drive the states to zero. Further from the origin it is assumed that a fully parameterized control policy (i.e., a neural network generating input moves) will drive the states near enough to origin that offset can then be eliminated. To retain the model-free nature of these approaches it is only assumed that the feedback policy exists, but that its gains are unknown; a second RL agent is used to learn these values. This approach is applied to linear and nonlinear examples, where learning is carried out in episodes starting from random states. The learning method applied is deep deterministic policy gradient (DDPG), where deep networks are used to approximate both the action-value function and the optimal policy [4] (two sets of networks are used in the proposed approach). After learning, it is shown that the proposed approach can eliminate offset in both systems.

To address the problem on sample efficiency, work exists in the inverse (I-)RL and apprenticeship learning literature [5], [6]. However, most work in this area aims to also generate parameterized reward functions, leading to significant complications. This level of complexity may not be necessary where well-posed value functions (e.g., quadratic penalties) are already defined. Further, it is assumed for the purposes of this work that an appropriate controller for the plant, PID or other simple controllers, exists or can easily be generated. In this way, the value function can be trained based on the reward profile of the existing controller; the policy, rather than being trained to maximize reward, can be trained to approximate the current control policy. These networks can then be used for initialization when learning is initiated on the true plant, allowing for possibly less exploration, faster convergence, or some combination of the two. Result are presented for the application of this approach to several energy and chemical systems and compared with the naïve initialization with the same network structures.

Bibliography

[1] H. Yoo, B. Kim, J. W. Kim, and J. H. Lee, “Reinforcement learning based optimal control of batch processes using Monte-Carlo deep deterministic policy gradient with phase segmentation,” Comput. Chem. Eng., vol. 144, Jan. 2021, doi: 10.1016/j.compchemeng.2020.107133.

[2] M. Zanon and S. Gros, “Safe Reinforcement Learning Using Robust MPC,” IEEE Trans. Automat. Contr., vol. 66, no. 8, pp. 3638–3652, Aug. 2021, doi: 10.1109/TAC.2020.3024161.

[3] D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017, doi: 10.1038/nature24270.

[4] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, Sep. 2015.

[5] P. Abbeel, “Inverse Reinforcement Learning,” SpringerReference, 2012, doi: 10.1007/springerreference_179129.

[6] M. Mowbray, R. Smith, E. A. Del Rio-Chanona, and D. Zhang, “Using process data to generate an optimal control policy via apprenticeship and reinforcement learning,” AIChE J., no. May, pp. 1–15, 2021, doi: 10.1002/aic.17306.