(207b) Entropy-Maximizing TD3-Based Reinforcement Learning for Controlling and Optimizing Complex Dynamical Systems | AIChE

(207b) Entropy-Maximizing TD3-Based Reinforcement Learning for Controlling and Optimizing Complex Dynamical Systems

Authors 

Lu, Q. - Presenter, Texas Tech University
Chowdhury, M. A., Texas Tech University
Deep reinforcement learning (DRL) has become increasingly popular in the automatic control and optimization of complex dynamical systems in recent years [1, 2]. Model-free DRL is a powerful data-driven approach that can learn from experience without explicit knowledge of the environment, handle large and continuous state and action spaces, adapt to changing environments, and learn optimal policies. Among different DRL algorithms, the actor-critic methods have low variance in the Q-value estimate, high stability, and fast convergence [3]. However, the stochastic actor-critic methods, such as proximal policy optimization, often have poor sample efficiency due to the integral over the large state and action [4-6]. On the contrary, the deterministic actor-critic methods, such as the twin-delayed deep deterministic policy gradient algorithm (TD3), are sample-efficient but suffer from under-exploration due to the lack of uncertainty in the policy. This inherent under-exploration of TD3 and deterministic methods can lead to sub-optimal solutions.

In this work, we present an entropy-maximizing TD3 method (EMTD3) to address the challenges associated with the stochastic and deterministic actor-critic methods discussed above [7]. In our proposed method, a stochastic actor with an entropy-maximizing term in the objective function is deployed at the beginning to ensure sufficient explorations. This entropy-maximizing term adds uncertainty to the policy and facilitates the systematic exploration of the action space, leading to better learning performance than deterministic methods. Afterward, a deterministic actor is employed to focus on local exploitation and discover the optimal solution. The proposed method combines the advantages of stochastic actor-critic methods in exploring the space and those of deterministic methods in fast convergence. As a result, the proposed EMTD3 method can outperform existing TD3 and other DRL approaches in terms of sample efficiency and fast convergence to the global optimum for continuous state-action space.

Finally, the effectiveness of the proposed EMTD3 method is verified through two case studies. In the first case study, our proposed method is employed to facilitate the tuning of a proportional-integral-derivation (PID) control for regulating the temperature of a non-linear continuous stirred tank reactor (CSTR) system. Simulation results show that our approach can significantly improve the sample efficiency by almost 45% compared with other DRL (e.g., TD3 and DDPG) methods in discovering the global solution. For the second case study, we applied the proposed EMTD3 method to design superior fast-charging protocols for Lithium-ion batteries. Results show that with our method, the optimal charging strategy can be rapidly discovered with much less episodes compared with other DRL methods.

References:

[1] Spielberg, S., Tulsyan, A., Lawrence, N.P., Loewen, P., Gopaluni, B., Deep reinforcement learning for process control: A primer for beginners. arXiv preprint arXiv:2004.05490, 2020.
[2] Paternina-Arboleda, C.D., J.R. Montoya-Torres, and A. Fabregas-Ariza. Simulation-optimization using a reinforcement learning approach. in 2008 Winter Simulation Conference. 2008. IEEE.
[3] Grondman, I., et al., A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012. 42(6): p. 1291-1307.
[4] Fujimoto, S., H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. in International conference on machine learning. 2018. PMLR.
[5] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M., Deterministic policy gradient algorithms. in International Conference on Machine Learning. 2014.
[6] Haarnoja, T., et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. in International Conference on Machine Learning. 2018. PMLR.
[7] Chowdhury, M.A. and Q. Lu, A novel entropy-maximizing TD3-based reinforcement learning for automatic PID tuning. arXiv preprint arXiv:2210.02381, 2022.