(712b) Gradient Information Bayesian Optimization for Composite Functions (GIBO-CF): Local Physics-Based Grey-Box Optimization in High Dimensions | AIChE

(712b) Gradient Information Bayesian Optimization for Composite Functions (GIBO-CF): Local Physics-Based Grey-Box Optimization in High Dimensions

Authors 

Makrygiorgos, G., UC Berkeley
Paulson, J., The Ohio State University
Mesbah, A., University of California, Berkeley
Bayesian optimization (BO) [1] has emerged as an efficient method for learning
and optimizing expensive black-box functions in numerous applications such as
hyperparameter tuning for deep neural networks [2] and experimental design [3].
This is in part due to its ability to locate global optima with a small number of
noisy objective queries. BO constructs a surrogate, typically a Gaussian Process
(GP) [4], as a model of the true black-box objective, and updates the poste-
rior when new observations are made. The acquisition function (e.g. expected
improvement [5], upper confidence bound [6]) maintains a tradeoff between ex-
ploration and exploitation, and determines the location of where to query next.


BO is generally a derivative-free/zeroth order optimization algorithm. However,
leveraging gradient information is advantageous [7]: for an input of dimension
d, there are d + 1 pieces of information, which will yield a more complete view of
function behavior, and derivatives are relatively low in cost when computed via
adjoint methods [8] or approximated via finite differences. Currently, there is
significant interest in using gradients in BO; [9] first suggested using gradients
by modifying the expected improvement and probability of improvement acqui-
sition functions under the assumption of full gradient information. They are
also explicitly obtained and used as a constraint for local optimality for global
optimization [10]. In [7], the property that the derivative of a GP is also a GP
is utilized to create a joint kernel function that correlates all possible combina-
tions of zeroth-order values and gradients. Unfortunately, this is prohibitively
expensive due to the computational bottleneck of the Cholesky decomposition
rendering it not scalable, since the resulting computational complexity is cu-
bic with respect to the number of datapoints and d. Furthermore, gradients
are not always readily accessible and the curse of dimensionality prohibits BO
from providing satisfactory empirical results for problems with 10 or more di-
mensions [11]. Gradient information BO (GIBO) circumvents these issues by
only focusing on locating local optima and using zeroth-order queries to infer

gradients [12]. The strategy forgoes learning the zeroth order function itself but
its Jacobian instead, and the associated acquisition function GI maximizes the
expected difference in Jacobian variance before and after the query. Based on
the data in the vicinity of the current location, the approach constructs a local
approximation of the Jacobian for gradient descent to find the optima. In the
same vein, [13] maximizes the probability of descent when querying points to
approximate the Jacobian for gradient descent.
Additionally, recent research has shown promising performance by exploiting
the composite nature of the objective function [14]. By taking advantage of the
partially known structure of the objective, the limit on the convergence rate
imposed by a black-box/no-prior information assumption can be overcome [15].
Therefore, we extend GIBO by integrating partial information in the form of a
grey-box composite function [16], which we term GIBO for composite functions
(GIBO-CF). In the computation of the Jacobian variance for the acquisition
function, the derivatives of the known portion of the function can be extracted
as constants at every iteration of gradient descent to reduce complexity, which
will greatly benefit applications for higher-dimensional problems. We demon-
strate that due to the known and inferred gradients, composite GIBO outper-
forms traditional BO as well as other algorithms such as augmented random
search [17] and CMA-ES [18]. Furthermore, the composite nature of the pro-
posed method leads to a lower number of objective calls to locate the optima
as compared to GIBO.

[1] D. R. Jones, M. Schonlau, and W. J. Welch, J. Glob. Optim., vol. 13, no. 4,
pp. 455–492, 1998.
[2] H. Cho, Y. Kim, E. Lee, D. Choi, Y. Lee, and W. Rhee, “Basic enhancement
strategies when using bayesian optimization for hyperparameter tuning of
deep neural networks,” IEEE Access, vol. 8, pp. 52 588–52 608, 2020.
[3] S. Greenhill, S. Rana, S. Gupta, P. Vellanki, and S. Venkatesh, “Bayesian
optimization for adaptive experimental design: A review,” IEEE Access,
vol. 8, pp. 13 937–13 948, 2020.
[4] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine
learning., ser. Adaptive computation and machine learning. MIT Press,
2006.
[5] D. Huang, T. Allen, W. Notz, and N. Zeng, “Global Optimization
of Stochastic Black-Box Systems via Sequential Kriging Meta-Models,”
Journal of Global Optimization, vol. 34, no. 3, pp. 441–466, March 2006.
[Online]. Available: https://ideas.repec.org/a/spr/jglopt/v34y2006i3p441-
466.html

[6] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger,
“Information-theoretic regret bounds for gaussian process optimiza-
tion in the bandit setting,” IEEE Transactions on Information
Theory, vol. 58, no. 5, p. 3250–3265, May 2012. [Online]. Available:
http://dx.doi.org/10.1109/TIT.2011.2182033
[7] J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier, “Bayesian optimiza-
tion with gradients,” in Advances in Neural Information Processing Sys-
tems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
[8] R.-E. Plessix, “A review of the adjoint-state method for computing
the gradient of a functional with geophysical applications,” Geophysical
Journal International, vol. 167, no. 2, pp. 495–503, 11 2006. [Online].
Available: https://doi.org/10.1111/j.1365-246X.2006.02978.x
[9] D. J. Lizotte, “Practical bayesian optimization,” 2008.
[10] G. Makrygiorgos, J. A. Paulson, and A. Mesbah, “No-regret bayesian op-
timization with gradients using local optimality-based constraints: Appli-
cation to closed-loop policy search,” in 2023 62nd IEEE Conference on
Decision and Control (CDC), 2023, pp. 20–25.
[11] W.-T. Tang and J. A. Paulson, “Cages: Cost-aware gradient entropy search
for efficient local multi-fidelity bayesian optimization,” in 2024 63rd IEEE
Conference on Decision and Control (CDC), 2024.
[12] S. M ̈uller, A. von Rohr, and S. Trimpe, “Local policy search with bayesian
optimization,” in Neural Information Processing Systems, 2021. [Online].
Available: https://api.semanticscholar.org/CorpusID:235593254
[13] Q. Nguyen, K. Wu, J. R. Gardner, and R. Garnett, “Local bayesian opti-
mization via maximizing probability of descent,” 2023.
[14] A. Kudva, F. Sorouifar, and J. A. Paulson, “Efficient robust global op-
timization for simulation-based problems using decomposed gaussian pro-
cesses: Application to mpc calibration,” in 2022 American Control Con-
ference (ACC), 2022, pp. 2091–2097.
[15] H. Chen, “Lower rate of convergence for locating a maximum of a
function,” The Annals of Statistics, vol. 16, no. 3, pp. 1330–1334, 1988.
[Online]. Available: http://www.jstor.org/stable/2241636
[16] J. A. Paulson and C. Lu, “Cobalt: Constrained bayesian optimization of
computationally expensive grey-box models exploiting derivative informa-
tion,” 2022.
[17] H. Mania, A. Guy, and B. Recht, “Simple random search provides a com-
petitive approach to reinforcement learning,” 2018.
[18] N. Hansen, “The cma evolution strategy: A tutorial,” 2023.