Generating Polynomial Models with PySMO

The pysmo.polynomial_regression method learns polynomial models from data. Presented with a small number of samples generated from experiments or computer simulations, the approach determines the most accurate polynomial approximation by comparing the accuracy and performance of polynomials of different orders and basis function forms.

pysmo.polynomial_regression considers three types of basis functions

  • univariate polynomials,
  • second-degree bivariate pilynomials, and
  • user-specified basis functions.

Thus, for a problem with \(m\) sample points and \(n\) input variables, the resulting polynomial is of the form

\[\begin{equation} y_{k}={\displaystyle \sum_{i=1}^{n}\beta_{i}x_{ik}^{\alpha}}+\sum_{i,j>i}^{n}\beta_{ij}x_{ik}x_{jk}+\beta_{\Phi}\Phi\left(x_{ik}\right)\qquad i,j=1,\ldots,n;i\neq j;k=1,\ldots,m;\alpha \leq 10\qquad\quad\label{eq:poly_eq} \end{equation}\]

Basic Usage

To generate a polynomial model with PySMO, the pysmo.polynomial_regression class is first initialized, and then the method poly_training is called on the initialized object:

# Required imports
>>> from idaes.surrogates.pysmo import polynomial_regression
>>> import pandas as pd

# Load dataset from a csv file
>>> xy_data = pd.read_csv('data.csv', header=None, index_col=0)

# Initialize the PolynomialRegression class, extract the list of features and train the model
>>> pr_init = polynomial_regression.PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=3, *kwargs)
>>> features = pr_init.get_feature_vector()
>>> polyfit = pr_init.poly_training()
  • xy_data is a two-dimensional python data structure containing the input and output training data. The output values MUST be in the last column.
  • maximum_polynomial_order refers to the maximum polynomial order to be considered when training the surrogate.

Optional Arguments

  • multinomials - boolean option which determines whether bivariate terms are considered in polynomial generation.
  • training_split - option which determines fraction of training data to be used for training (the rest will be for testing). Default is 0.8.
  • number_of_crossvalidations - Number of cross-validations during training. Default number is 3.

pysmo.polynomial_regression Output

The result of the pysmo.polynomial_regression method (polyfit in above example) is a python object containing information about the optimal polynomial order, the polynomial coefficients and different error and quality-of-fit metrics such as the mean-squared-error (MSE) and the \(R^{2}\) coefficient-of-fit. A Pyomo expression can be generated from the object simply passing a list of variables into the function generate_expression:

# Create a python list from the headers of the dataset supplied for training
>>> list_vars = []
>>> for i in features.keys():
>>>     list_vars.append(features[i])
# Pass list to generate_expression function to obtain a Pyomo expression as output
>>> print(polyfit.generate_expression(list_vars))

Prediction with pysmo.polynomial_regression models

Once a polynomial model has been trained, predictions for values at previously unsampled points :math:x_unsampled can be evaluated by calling the poly_predict_output() method on the resulting model object and the unsampled points:

# Create a python list from the headers of the dataset supplied for training
>>> y_unsampled = pr_init.poly_predict_output(rbf_fit, x_unsampled)

Flowsheet Integration

The result of the polynomial training process can be passed directly into a process flowsheet as an objective or a constraint. The following code snippet demonstrates how an output polynomial model may be integrated directly into a Pyomo flowsheet as an objective:

# Required imports
>>> import pyomo.environ as pyo
>>> from idaes.surrogates.pysmo import polynomial_regression
>>> import pandas as pd

# Create a Pyomo model
>>> m = pyo.ConcreteModel()
>>> i = pyo.Set(initialize=[1, 2])

# Create a Pyomo variable with indexed by the 2D-set i with initial values {0, 0}
>>> init_x = {1: 0, 2: 0}
>>> def x_init(m, i):
>>>     return (init_x[i])
>>> m.x = pyo.Var(i, initialize=x_init)

# Train a simple polynomial model on data available in csv format, resulting in the Python object polyfit
>>> xy_data = pd.read_csv('data.csv', header=None, index_col=0)
>>> pr_init = polynomial_regression.PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=3)
>>> features = pr_init.get_feature_vector()
>>> polyfit = pr_init.poly_training()

# Use the resulting polynomial as an objective, passing in the Pyomo variable x
>>> m.obj = pyo.Objective(expr=polyfit.generate_expression([m.x[1], m.x[2]]))

# Solve the model
>>> instance = m
>>> opt = pyo.SolverFactory("ipopt")
>>> result = opt.solve(instance, tee=True)

Further details about pysmo.polynomial_regression may be found by consulting the examples or reading the paper […]

Available Methods

class idaes.surrogate.pysmo.polynomial_regression.FeatureScaling[source]

A class for scaling and unscaling input and output data. The class contains two main methods: data_scaling and data_unscaling

static data_scaling(data)[source]

data_scaling performs column-wise minimax scaling on the input dataset.

Parameters:data – The input data set to be scaled. Must be a numpy array or dataframe.
Returns:
tuple containing:
  • scaled_data : A 2-D Numpy Array containing the scaled data. All array values will be between [0, 1].
  • data_minimum : A 2-D row vector containing the column-wise minimums of the input data.
  • data_maximum : A 2-D row vector containing the column-wise maximums of the input data.
Return type:(tuple)
Raises:TypeError – Raised when the input data is not a numpy array or dataframe
static data_unscaling(x_scaled, x_min, x_max)[source]

data_unscaling performs column-wise un-scaling on the a minmax-scaled input dataset.

Parameters:
  • x_scaled (NumPy Array) – Data to be un-scaled. Data values should be between 0 and 1.
  • x_min (NumPy vector) – \(n \times 1\) vector containing the actual minimum value for each column. Must contain same number of elements as the number of columns in x_scaled.
  • x_max (NumPy vector) – \(n \times 1\) vector vector containing the actual minimum value for each column. Must contain same number of elements as the number of columns in x_scaled.
Returns:

A 2-D numpy array containing the scaled data, \(x_{min} + x_{scaled} * (x_{max} - x_{min})\)

Return type:

NumPy Array

Raises:

IndexError – Raised when the dimensions of the arrays are inconsistent.

class idaes.surrogate.pysmo.polynomial_regression.PolynomialRegression(original_data_input, regression_data_input, maximum_polynomial_order, number_of_crossvalidations=None, no_adaptive_samples=None, training_split=None, max_fraction_training_samples=None, max_iter=None, solution_method=None, multinomials=None, fname=None, overwrite=False)[source]

The PolynomialRegression class performs polynomial regression on a training data set.

The class must first be initialized by calling PolynomialRegression. Regression is then carried out by calling poly_training.

For a given dataset with \(n\) features \(x_{1},x_{2},\ldots,x_{n}\), Polyregression is able to consider three types of basis functions:
  1. Mononomial terms (\(x_{i}^{p},p \leq 10\)) for all individual features. The maximum degree to be considered can be set by the user (maximum_polynomial_order)
  2. All first order interaction terms \(x_{1}x_{2}\), \(x_{1}x_{3}\) etc. This can be turned on or off by the user (set multinomials)
  3. User defined input features, e.g. \(\sin(x_{1})\). These must be Pyomo functions and should be provided as a list by the user calling set_additional_terms method before the polynomial training is done.

Example:

# Initialize the class and set additional terms
>>> d = PolynomialRegression(full_data, training_data, maximum_polynomial_order=2, max_iter=20, multinomials=1, solution_method='pyomo')
>>> p = d.get_feature_vector()
>>> d.set_additional_terms([...extra terms...])

# Train polynomial model and predict output for an test data x_test
>>> results = d.poly_training()
>>> predictions = d.poly_predict_output(results, x_test)
Parameters:
  • regression_data_input (NumPy Array of Pandas Dataframe) – The dataset for regression training. It is expected to contain the features (X) and output (Y) data, with the output values (Y) in the last column.
  • original_data_input (NumPy Array of Pandas Dataframe) – If regression_data_input was drawn from a larger dataset by some sampling approach, the larger dataset may be provided here. When additional data is not available, the same data supplied for training_data can be supplied - this tells the algorithm not to carry out adaptive sampling.
  • maximum_polynomial_order (int) – The maximum polynomial order to be considered.

Further details about the optional inputs may be found under the __init__ method.

__init__(original_data_input, regression_data_input, maximum_polynomial_order, number_of_crossvalidations=None, no_adaptive_samples=None, training_split=None, max_fraction_training_samples=None, max_iter=None, solution_method=None, multinomials=None, fname=None, overwrite=False)[source]

Initialization of PolynomialRegression class.

Parameters:
  • regression_data_input (NumPy Array of Pandas Dataframe) – The dataset for regression training. It is expected to contain features and output data, with the output values (Y) in the last column.
  • original_data_input (NumPy Array of Pandas Dataframe) – If regression_data_input was drawn from a larger dataset by some sampling approach, the larger dataset may be provided here.
  • maximum_polynomial_order (int) – The maximum polynomial order to be considered.
Keyword Arguments:
 
  • number_of_crossvalidations (int) – The number of polynomial fittings and cross-validations to be carried out for each polynomial function/expression. Must be a positive, non-zero integer. Default=3.
  • training_split (float) – The training/test split to be used for regression_data_input. Must be between 0 and 1. Default = 0.75
  • solution_method (str) –

    The method to be used for solving the least squares optimization problem for polynomial regression. Three options are available:

    1. ”MLE” : The mle (maximum likelihood estimate) method solves the least squares problem using linear algebra. Details of the method may be found in Forrester et al.
    2. ”BFGS” : This approach solves the least squares problem using scipy’s BFGS algorithm.
    3. ”pyomo”: This option solves the optimization problem in pyomo with IPOPT as solver. This is the default option.
  • multinomials (bool) – This option determines whether or not multinomial terms are considered during polynomial fitting. Takes 0 for No and 1 for Yes. Default = 1.
Returns:

self object containing all the input information.

Raises:
  • ValueError – - The input datasets (original_data_input or regression_data_input) are of the wrong type (not Numpy arrays or Pandas Dataframes)
  • Exception – * maximum_polynomial_order is not a positive, non-zero integer or maximum_polynomial_order is higher than the number of training samples available
  • Exception – * solution_method is not ‘mle’, ‘pyomo’ or ‘bfgs
  • Exception – - multinomials is not binary (0 or 1)
  • Exception – - training_split is not between 0 and 1
  • Exception – - number_of_crossvalidations is not a positive, non-zero integer
  • Exception – - max_fraction_training_samples is not between 0 and 1
  • Exception – - no_adaptive_samples is not a positive, non-zero integer
  • Exception – - max_iter is not a positive, non-zero integer
  • warnings.warn – - When the number of cross-validations is too high, i.e. number_of_crossvalidations > 10
get_feature_vector()[source]

The get_feature_vector method generates the list of regression features from the column headers of the input dataset.

Returns:An indexed parameter list of the variables supplied in the original data
Return type:Pyomo IndexedParam

Example:

# Create a small dataframe with three columns ('one', 'two', 'three') and two rows (A, B)
>>> xy_data = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['one', 'two', 'three'])

# Initialize the **PolynomialRegression** class and print the column headers for the variables
>>> f = PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=1, multinomials=True, training_split=0.8)
>>> p = f.get_feature_vector()
>>> for i in p.keys():
>>>     print(i)
one
two
poly_predict_output(results_vector, x_data)[source]

The poly_predict_output method generates output predictions for input data x_data based a previously generated polynomial fitting.

Parameters:
  • results_vector – Python object containing results of polynomial fit generated by calling the poly_training function.
  • x_data – Numpy array of designs for which the output is to be evaluated/predicted.
Returns:

Output variable predictions based on the polynomial fit.

Return type:

Numpy Array

poly_training()[source]

The poly_training method trains a polynomial model to an input dataset. It calls the core method which is called in the PolynomialRegression class (polynomial_regression_fitting). It accepts no user input, inheriting the information passed in class initialization.

Returns:
Python Object (results) containing the results of the polynomial regression process including:
  • the polynomial order (self.polynomial_order)
  • polynomial coefficients (self.optimal_weights_array), and
  • MAE and MSE errors as well as the \(R^{2}\) (results.errors).
Return type:tuple
set_additional_terms(term_list)[source]

set_additional_terms accepts additional user-defined features for consideration during regression.

Parameters:term_list (list) – List of additional terms to be considered as regression features. Each term in the list must be a Pyomo-supported intrinsic function.

Example:

# To add the sine and cosine of a variable with header 'X1' in the dataset as additional regression features:
>>> xy_data = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['X1', 'X2', 'Y'])
>>> A = PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=5)
>>> p = A.get_feature_vector()
>>> A.set_additional_terms([ pyo.sin(p['X1']) , pyo.cos(p['X1']) ])
class idaes.surrogate.pysmo.polynomial_regression.ResultReport(optimal_weight_vector, polynomial_order, multinomials, mae_error, mse_error, R2, adjusted_R2, number_of_iterations, results_vector, additional_features_array, final_regression_data, df_coefficients, extra_terms_coeffs, extra_terms_feature_vector, extra_terms_expressions)[source]
generate_expression(variable_list)[source]

The generate_expression method returns the Pyomo expression for the polynomial model trained.

The expression is constructed based on a supplied list of variables variable_list and the output of poly_training.

Parameters:variable_list (list) – List of input variables to be used in generating expression. This can be the a list generated from the results of get_feature_vector. The user can also choose to supply a new list of the appropriate length.
Returns:Pyomo expression of the polynomial model based on the variables provided in variable_list.
Return type:Pyomo Expression

References:

[1] Forrester et al.’s book “Engineering Design via Surrogate Modelling: A Practical Guide”, https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470770801