(625b) Distributional Reinforcement Learning for Chemical Production Scheduling and Supply Chain Optimization | AIChE

(625b) Distributional Reinforcement Learning for Chemical Production Scheduling and Supply Chain Optimization

Authors 

Reinforcement Learning (RL) has received significant attention from the process systems engineering community in the context of sequential decision making under uncertainty. A major benefit of RL methods is that they can identify approximately optimal policies for Markov decision processes (MDPs), independently of exact expressions of the uncertain system dynamics. These expressions are typically unavailable for industrial process systems. Instead, RL methods learn function approximations of decision policies via generalized policy iteration algorithms and rely on offline simulation of an approximate process model. Once identified through offline simulation, RL policies can then be transferred to optimize the real system online.

Recent works have investigated the application of RL to identify optimal production scheduling and supply chain management decisions under uncertainty. Here, we explore the development of a zero-order RL methodology to address uncertain online production environments and supply chain optimization. We consider common constraints on scheduling problems, including precedence and disjunctive constraints, which are not naturally accounted for within the MDP framework. Further, the framework enables the optimization of risk-sensitive measures, such as the conditional value-at-risk (CVaR), which are essential to consider in industrial practice. The strategy is investigated in a parallel, sequential production scheduling environment, and in a multi-echelon supply chain inventory management problem.

In the scheduling problem, both batch processing time and due date uncertainties are considered. Objective performance is benchmarked against an online mixed integer linear programming (MILP) formulation. The RL approach is competitive with the MILP in expected performance, but demonstrates an improvement in the worst case performance when risk sensitive measures are incentivized. Across the problem instances investigated, an average improvement of 2.5% is observed in the CVaR of the policy identified compared to the MILP. Also, RL-based constraint satisfaction is assessed probabilistically via a Monte Carlo method, demonstrating constraint satisfaction with high probability when applied to the system online (i.e. greater or equal to 0.95).

In the multi-echelon supply chain inventory management problem, we seek to achieve coordination in the reorder policies of each stage within the supply chain. The RL method proposed is benchmarked to a policy gradient RL algorithm and mathematical programming formulations. Again, we show the policies identified by our method account for uncertainties, with expected performance comparable to mathematical programming methods and superior to the policy gradient RL. Additionally, the sample efficiency of the zero-order RL approach is benchmarked against the policy gradient method, proximal policy optimization, and exhibits a reduced sample complexity in offline policy identification. Specifically, the method proposed demonstrates an average improvement of 11% in the performance of the policy identified for a given computational budget. The optimization of risk-sensitive formulations are also explored, together with thorough analysis of the algorithm sensitivities.

Further, the framework enables identification of online decisions orders of magnitude faster than the most efficient optimization methods. As a result, the methodology promises means to handle practical issues associated with online decision making, and ease in handling uncertainty in online production environments and supply chains.