(365f) A Reinforcement Learning Approach for Optimal Scheduling of Multiproduct Batch Plants | AIChE

(365f) A Reinforcement Learning Approach for Optimal Scheduling of Multiproduct Batch Plants

Authors 

Gupta, P., Indian Institute of Technology Tirupati
Batch chemical plants are quite popular due to their flexibility in producing multiple products in a single plant. The flexibility in the batch plant is due to the use of the same equipment in the production of different products. However, this poses a profound challenge in how and when we allocate particular equipment for a particular task. Multiproduct batch plants are often represented using a recipe network or state task network. In this work, we represent the multiproduct batch plant using a State-Task Network (STN) in which state nodes represent various feeds, intermediates, and products in the process, whereas task nodes represent various unit operations that convert input states to output states.

In general, there exist different approaches to determine the optimal scheduling of batch operations viz mathematical programming, heuristic methods, artificial intelligence, etc. In this study, we present an optimization formulation of a batch scheduling problem based on discrete-time representation. This optimization problem involves both binary and continuous decision variables. The binary variable signifies the allocation of equipment for a specific task at a given time, whereas continuous variables signify the quantity of material that is being processed in a equipment for a specific task at a given time, and the amount of material that is stored in a particular state at a given time. The resulting optimization problem is a Mixed Integer Linear Programming (MILP) problem. However, the number of binary decision variables increases with an increase in the number of products to be processed, longer processing stages, availability of multiple equipment for a task, and longer scheduling time horizon. Therefore, for large-scale batch plants, the challenge with this formulation is that the computation time for solving these problems increases with the number of binary variables. Therefore, the focus of this work is to optimally schedule a multiproduct batch plant using reinforcement learning.

Batch scheduling problems that are formulated as a MILP can be solved using branch and bound, cutting plane, or branch and cut methods. In this work, we restrict our discussion to cutting plane-based solution methodology. In the cutting plane method, the MILP is relaxed to a linear programming (LP) problem, by dropping the integer constraint on the variables, for which efficient solvers are available. Then, a linear constraint (known as a cut) is added to the LP iteratively such that some part of the feasible region of the previous LP is removed while retaining the integer solution of the MILP problem within the feasible region. However, at each iteration of the method, a number of cuts are possible from which a single cut is added to the set of previously available constraints. The decisions on cut additions (such as the number of cuts to be added, the type of cut, etc) at any particular iteration are often based on heuristic rules. Since we are dealing with the MILP problem, we use Gomory’s mixed-integer (GMI) cuts. GMI cuts ensure the convergence of a problem in a fixed number of iterations. These heuristics will have a significant impact in the computational cost. Therefore, in this work, we aim at developing a reinforcement learning (RL) framework that aids in the selection of a suitable GMI cut such that an optimal batch schedule is obtained with a low computational cost.

To this end, we reformulate the cutting plane method as a Markov decision process (MDP) in which the solver acts as an environment and the deep neural network acts as an agent. In MDP, we define the following: state space, action space, reward, and state-action policy. The state space is defined by a set of constraints of the current LP relaxation, the optimal solution of the relaxed LP problem, the objective function, and all possible GMI cuts that can be generated using the current simplex tableau. The action space is defined by all possible GMI cuts that can be generated using the current simplex tableau. The reward is defined as the difference in objective function values of two successive iterates. The state-action policy is a deep neural network architecture that relates a cut to a quantifiable metric. Using the policy, the agent selects the most appropriate cut for the next iteration, and send it to the environment. In the next iteration, the environment computes the new state, and the reward. This aids the agent to add cuts that reduces the gap such that the optimal solution is achieved faster. The agent is based on a neural network-based architecture known as an attention network. The attention network ensures that the selection of the cutting plane is not dependent on the order of the cut in the set. This network is trained over a number of unlabeled MILP problems. We demonstrate the computational efficacy of the proposed RL-based solution methodology in a benchmark case study of hydrolubes batch plant.