Biomass is a diverse feedstock composed of several types of polysaccharides, proteins, lignin, and lipids. Moreover, these components also interact with each other to alter oil compositions. Due to this, developing compositional models that are generic to a wide variety of feedstocks has proven challenging. Also, since HTL is a high temperature â high-pressure process, performing high throughput experimentation is difficult. Hence, we utilize data mined from HTL literature and statistical methods (multivariate linear regression, regression tree, random forest) to predict oil compositions.
The composition of HTL oil is mostly characterized using GC/MS. This information is predominantly classified into product classes (esters, long-chain fatty acids, hydrocarbons) and given in bar charts in literature. We developed an in-house ChartReadermodel that automatically parses data from bar charts in literature. Using the ChartReadermodel and human inspection, a data set with 682 data points is prepared. Each of these data points contains oil composition (in functional classes), feedstock composition (biochemical and elemental), and process variable (temperature, time). The products formed in the oil phase are lumped into 9 product classes: esters, oxygenated single ring aromatics, furans, long chain fatty acids, long chain alcohols, aldehydes, and ketones, N-containing compounds, aliphatics, and polycyclic aromatics. Note that the sum of all product class percentages is 100%.
For modeling the oil composition, the biochemical/ elemental compositions (and all two-way interactions involved), temperature, and time are considered the full set of predictors. In cases where the biochemical or elemental composition was missing, the EM algorithm was used to impute this information from the other. The response (oil composition) was represented by percentages for the 9 classes. Using MANOVA in the context of multivariate multiple linear regression (MLR), important predictors were identified. The same predictors were also used to fit regression tree (RT) random forest (RF) models. The MLR, RT, and RF models were subsequently compared using various measures of prediction error.
Our initial analysis indicates, MLR model predicting all product classes with R2 > 0.8 (sum of squared residual between experimental and predicted values). The mean absolute error and RMSE of all product classes are 11.6 and 15.9, respectively. All binary interactional effects except for protein-lipid interactions are significant in influencing oil composition. The MAE and RMSE values follow the following trend: MLR> RT> RF. Hence, RF predicts best among the models tested.