(108b) Fast-Convergence of Deep Reinforcement Learning Controller: Application to a Continuous Stirred Tank Reactor | AIChE

(108b) Fast-Convergence of Deep Reinforcement Learning Controller: Application to a Continuous Stirred Tank Reactor

Authors 

Bangi, M. S. F. - Presenter, Texas A&M University
Kwon, J., Texas A&M University
Reinforcement Learning (RL) originated several decades ago in computer science and operations research to solve complex sequential decision-making problems but its application to process control has been recent and limited. RL involves an agent that learns the optimal policy by interacting with the environment in real-time [1]. Many approaches have been proposed in the past to solve the RL problem but recent advancements in deep learning have made it possible to combine deep neural networks (DNNs) with RL. This combination has already delivered tremendous success in video games like Atari games [2] etc. More recently, in the context of process control, a Deep RL (DRL) controller, a model-free off-policy actor-critic algorithm, was proposed based on temporal difference (TD) learning [1] and Deterministic Policy Gradient (DPG) algorithm [3] for controlling discrete-time nonlinear processes [4]. The DRL controller utilizes two DNNs to generalize the actor and the critic to the continuous state and action spaces, and two more as target networks for their learning. Ideas like replay memory and gradient clipping were used in DRL controller to make learning appropriate for process control applications. The DRL controller was able to solve set-point tracking problems for single-input-single-output (SISO), multi-input-multi-output (MIMO), and a nonlinear system with external disturbances [4].

Despite its success, DRL controller has few limitations including the requirement of a large amount of data and high computational loads, and careful selection and initialization of hyperparameters for fast convergence, etc. Additionally, one glaring limitation of DRL controller, as with many other RL methods, is the long training time before it can deliver satisfactory control performance [5]. In order to overcome this challenge, we propose to train the actor and the critic offline using historical process data before using it for online control. For the actor network which approximates the policy function, we use the information of past states and control actions to train it offline until convergence within the training region is achieved. To train the critic network offline, which approximates the action-value function, we use the information of reward gain, calculated based on a pre-defined reward function, to train it offline until convergence is achieved. Once trained offline, we use the learned actor-critic as the starting point in the DRL controller. This pre-trained DRL controller is implemented to track concentration and temperature set-points for a continuous stirred tank reactor (CSTR) process, and we successfully demonstrate its ability to adapt and learn to track set-points outside the training region faster compared to a DRL controller that was randomly initialized. We also compare the control performance of this pre-trained DRL controller against a model-predictive controller in tracking a set-point.

Literature cited:

[1] Sutton, R.S., Barto, A.G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 1998.

[2] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv Preprint, arXiv:13125602, 2013.

[3] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014.

[4] Spielberg, S., Tulsyan, A., Lawrence, N.P., Loewen, P.D., Bhushan Gopaluni, R. Toward self‐driving processes: A deep reinforcement learning approach to control. AIChE J, 65(10), 2019.

[5] Shin, J., Badgwell T.A., Liu, KH., Lee, J.H. Reinforcement Learning – Overview of recent progress and implications for process control. Computer Aided Chemical Engineering, 44:71-85, 2018.