(457c) Continuous Learning of the Value Function Utilizing Deep Reinforcement Learning to be Used As the Objective in Model Predictive Control | AIChE

(457c) Continuous Learning of the Value Function Utilizing Deep Reinforcement Learning to be Used As the Objective in Model Predictive Control

Authors 

Hedrick, E., West Virginia University
Bhattacharyya, D., West Virginia University
Reinforcement learning (RL), a machine learning technique, and model predictive control (MPC) possess an inherent synergy in the manner in which they function. There are many examples in open literature of the integration of RL and MPC. The most common methods evaluate high-level or meta RL in conjunction with MPC as opposed to implementation at the level of control execution [1]. It is often the case that for this high-level implementation, the focus is on tuning of internal weights and hyperparameters of the controller structure itself [2], [3]. RL can also be utilized to modify the objective function of the MPC. In this way, an approximation of an infinite horizon MPC may be derived with the value function (a linear approximation as opposed to an ANN) of the RL algorithm acting as a terminal cost [4], [5]. This approach can allow for a multi-step return to be used in the update procedure for the RL algorithm, taking advantage of the similar structure of the MPC prediction horizon and the return of the RL’s value function. This generally points to a problem (even in simple cases) in RL sample inefficiency, which is discussed widely in the literature [6]

The work presented here investigates implementing RL with existing MPC frameworks. Selection of MPC for combination with RL is not arbitrary. Two specific aspects of MPC are advantageous for such a combination- the use of a value function and the use of a model. Direct use of a value function in MPC (VF-MPC) as the objective function maintains the structure of the MPC policy in terms of constraint formulation but may be limited in the selection of optimization solver. That is dependent on the form of the value function approximation. The use of a model in MPC is also useful in that, by solving for the optimal trajectory, a projected view of the expected reward is gained. While this information can be inaccurate based on the current value function, it can allow for accelerated learning.

The most current RL algorithms for application to continuous systems generally make use of NNs as the basis function. However, this presents potential issues with second order optimization methods for use in the VF-MPC algorithm. Most continuous RL algorithms adopt an actor-critic structure to circumvent this issue, such as the DDPG [7] or TD3 [8]. The work develops an approach for adapting the actor-critic structure to the value function, while maintaining the advantages of MPC. In addition, a common shortcoming found within most RL algorithms is the sensitivity of the algorithm to its own learning parameters. For the structure of the VF-MPC algorithm, the most significant parameter is the length of learned projection. This work develops an approach for meta-learning of the optimal parameter for a given case study, while also deriving an effective value function. This algorithm will be evaluated by applying it to the classic double integrator and an industrial SCR unit, a time-varying time-delay system where the traditional MPC has poor performance [2].

Following with these points, the main contributions of this work are:

  • The proposal of a combination of RL and MPC in which the learned value function is used as the cost function for the MPC. This formulation yields an optimal MPC with respect to the reward function.
  • A key contribution is the proposal of the use of the optimized trajectory from the MPC to accelerate learning, along with analysis of how the search depth in the trajectory affects the rate of learning.
  • The proposal of two algorithms employing these concepts: VFMPC(0), using the one step return in order to learn the cost function, and VFMPC(n), using the optimal trajectory to learn on the n-step return subject to the dynamics of the controller model.
  • These algorithms and their performance are exhibited on two process control examples.


[1] O. Dogru et al., “Reinforcement Learning in Process Industries : Review and Perspective,” 2 IEEE/CAA J. Autom. Sin., vol. 11, no. 2, pp. 1–19, 2024, doi: 10.1109/JAS.2024.124227.

[2] E. Hedrick, K. Hedrick, D. Bhattacharyya, S. E. Zitney, and B. Omell, “Reinforcement learning for online adaptation of model predictive controllers: Application to a selective catalytic reduction unit,” Comput. Chem. Eng., vol. 160, p. 107727, 2022, doi: 10.1016/j.compchemeng.2022.107727.

[3] S. Gros and M. Zanon, “Data-driven economic NMPC using reinforcement learning,” IEEE Trans. Automat. Contr., vol. 65, no. 2, pp. 636–648, Feb. 2020, doi: 10.1109/TAC.2019.2913768.

[4] Y. Yang and S. Lucia, “Multi-step greedy reinforcement learning based on model predictive control,” IFAC-PapersOnLine, vol. 54, no. 3, pp. 699–705, 2021, doi: 10.1016/j.ifacol.2021.08.323.

[5] X. Pan, X. Chen, Q. Zhang, and N. Li, “Model Predictive Control : A Reinforcement Learning-based Approach,” J. Phys. Conf. Ser., vol. 2203, no. 1, p. 012058, 2022, doi: 10.1088/1742-6596/2203/1/012058.

[6] R. Nian, J. Liu, and B. Huang, “A review On reinforcement learning: Introduction and applications in industrial process control,” Comput. Chem. Eng., vol. 139, p. 106886, Aug. 2020, doi: 10.1016/j.compchemeng.2020.106886.

[7] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc., Sep. 2016, [Online]. Available: http://arxiv.org/abs/1509.02971

[8] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” in 35th International Conference on Machine Learning, ICML 2018, Feb. 2018, pp. 2587–2601. [Online]. Available: http://arxiv.org/abs/1802.09477