Generating Polynomial Models with PySMO

The pysmo.polynomial_regression method learns polynomial models from data. Presented with a small number of samples generated from experiments or computer simulations, the approach determines the most accurate polynomial approximation by comparing the accuracy and performance of polynomials of different orders and basis function forms.

pysmo.polynomial_regression considers three types of basis functions

  • univariate polynomials,

  • second-degree bivariate polynomials, and

  • user-specified basis functions.

Thus, for a problem with \(m\) sample points and \(n\) input variables, the resulting polynomial is of the form

\[\begin{equation} y_{k}={\displaystyle \sum_{i=1}^{n}\beta_{i}x_{ik}^{\alpha}}+\sum_{i,j>i}^{n}\beta_{ij}x_{ik}x_{jk}+\beta_{\Phi}\Phi\left(x_{ik}\right)\qquad i,j=1,\ldots,n;i\neq j;k=1,\ldots,m;\alpha \leq 10\qquad\quad\label{eq:poly_eq} \end{equation}\]

Basic Usage

To generate a polynomial model with PySMO, the pysmo.polynomial_regression class is first initialized, and then the method training is called on the initialized object:

# Required imports
>>> from idaes.surrogate.pysmo import polynomial_regression
>>> import pandas as pd

# Load dataset from a csv file
>>> xy_data = pd.read_csv('data.csv', header=None, index_col=0)

# Initialize the PolynomialRegression class, extract the list of features and train the model
>>> pr_init = polynomial_regression.PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=3, *kwargs)
>>> features = pr_init.get_feature_vector()
>>> pr_init.training()
  • xy_data is a two-dimensional python data structure containing the input and output training data. The output values MUST be in the last column.

  • maximum_polynomial_order refers to the maximum polynomial order to be considered when training the surrogate.

Optional Arguments

  • multinomials - boolean option which determines whether bivariate terms are considered in polynomial generation.

  • training_split - option which determines fraction of training data to be used for training (the rest will be for testing). Default is 0.8.

  • number_of_crossvalidations - Number of cross-validations during training. Default number is 3.

pysmo.polynomial_regression Output

The result of the pysmo.polynomial_regression method is a python object containing information about the problem set-up, the final optimal polynomial order, the polynomial coefficients and different error and quality-of-fit metrics such as the mean-squared-error (MSE) and the \(R^{2}\) coefficient-of-fit. A Pyomo expression can be generated from the object simply passing a list of variables into the function generate_expression:

# Create a python list from the headers of the dataset supplied for training
>>> list_vars = []
>>> for i in features.keys():
>>>     list_vars.append(features[i])
# Pass list to generate_expression function to obtain a Pyomo expression as output
>>> print(pr_init.generate_expression(list_vars))

Prediction with pysmo.polynomial_regression models

Once a polynomial model has been trained, predictions for values at previously unsampled points :math:x_unsampled can be evaluated by calling the predict_output() method on the unsampled points:

# Create a python list from the headers of the dataset supplied for training
>>> y_unsampled = pr_init.predict_output(x_unsampled)

The confidence intervals for the regression paramaters may be viewed using the method confint_regression.

Flowsheet Integration

The result of the polynomial training process can be passed directly into a process flowsheet as an objective or a constraint. The following code snippet demonstrates how an output polynomial model may be integrated directly into a Pyomo flowsheet as an objective:

# Required imports
>>> import pyomo.environ as pyo
>>> from idaes.surrogate.pysmo import polynomial_regression
>>> import pandas as pd

# Create a Pyomo model
>>> m = pyo.ConcreteModel()
>>> i = pyo.Set(initialize=[1, 2])

# Create a Pyomo variable with indexed by the 2D-set i with initial values {0, 0}
>>> init_x = {1: 0, 2: 0}
>>> def x_init(m, i):
>>>     return (init_x[i])
>>> m.x = pyo.Var(i, initialize=x_init)

# Train a simple polynomial model on data available in csv format, resulting in the Python object polyfit
>>> xy_data = pd.read_csv('data.csv', header=None, index_col=0)
>>> pr_init = polynomial_regression.PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=3)
>>> features = pr_init.get_feature_vector()
>>> polyfit = pr_init.training()

# Use the resulting polynomial as an objective, passing in the Pyomo variable x
>>> m.obj = pyo.Objective(expr=polyfit.generate_expression([m.x[1], m.x[2]]))

# Solve the model
>>> instance = m
>>> opt = pyo.SolverFactory("ipopt")
>>> result = opt.solve(instance, tee=True)

Further details about pysmo.polynomial_regression may be found by consulting the examples or reading the paper […]

Available Methods

class idaes.surrogate.pysmo.polynomial_regression.FeatureScaling[source]

A class for scaling and unscaling input and output data. The class contains two main methods: data_scaling and data_unscaling

static data_scaling(data)[source]

data_scaling performs column-wise minimax scaling on the input dataset.

Parameters

data – The input data set to be scaled. Must be a numpy array or dataframe.

Returns

tuple containing:
  • scaled_data : A 2-D Numpy Array containing the scaled data. All array values will be between [0, 1].

  • data_minimum : A 2-D row vector containing the column-wise minimums of the input data.

  • data_maximum : A 2-D row vector containing the column-wise maximums of the input data.

Return type

(tuple)

Raises

TypeError – Raised when the input data is not a numpy array or dataframe

static data_unscaling(x_scaled, x_min, x_max)[source]

data_unscaling performs column-wise un-scaling on the a minmax-scaled input dataset.

Parameters
  • x_scaled (NumPy Array) – Data to be un-scaled. Data values should be between 0 and 1.

  • x_min (NumPy vector) – \(n \times 1\) vector containing the actual minimum value for each column. Must contain same number of elements as the number of columns in x_scaled.

  • x_max (NumPy vector) – \(n \times 1\) vector vector containing the actual minimum value for each column. Must contain same number of elements as the number of columns in x_scaled.

Returns

A 2-D numpy array containing the scaled data, \(x_{min} + x_{scaled} * (x_{max} - x_{min})\)

Return type

NumPy Array

Raises

IndexError – Raised when the dimensions of the arrays are inconsistent.

class idaes.surrogate.pysmo.polynomial_regression.PolynomialRegression(original_data_input, regression_data_input, maximum_polynomial_order, number_of_crossvalidations=None, no_adaptive_samples=None, training_split=None, max_fraction_training_samples=None, max_iter=None, solution_method=None, multinomials=None, fname=None, overwrite=False)[source]

The PolynomialRegression class performs polynomial regression on a training data set.

The class must first be initialized by calling PolynomialRegression. Regression is then carried out by calling training.

For a given dataset with \(n\) features \(x_{1},x_{2},\ldots,x_{n}\), Polyregression is able to consider three types of basis functions:
  1. Mononomial terms (\(x_{i}^{p},p \leq 10\)) for all individual features. The maximum degree to be considered can be set by the user (maximum_polynomial_order)

  2. All first order interaction terms \(x_{1}x_{2}\), \(x_{1}x_{3}\) etc. This can be turned on or off by the user (set multinomials)

  3. User defined input features, e.g. \(\sin(x_{1})\). These must be Pyomo functions and should be provided as a list by the user calling set_additional_terms method before the polynomial training is done.

Example:

# Initialize the class and set additional terms
>>> d = PolynomialRegression(full_data, training_data, maximum_polynomial_order=2, max_iter=20, multinomials=1, solution_method='pyomo')
>>> p = d.get_feature_vector()
>>> d.set_additional_terms([...extra terms...])

# Train polynomial model and predict output for an test data x_test
>>> d.training()
>>> predictions = d.predict_output(x_test)
Parameters
  • regression_data_input (NumPy Array of Pandas Dataframe) – The dataset for regression training. It is expected to contain the features (X) and output (Y) data, with the output values (Y) in the last column.

  • original_data_input (NumPy Array of Pandas Dataframe) – If regression_data_input was drawn from a larger dataset by some sampling approach, the larger dataset may be provided here. When additional data is not available, the same data supplied for training_data can be supplied - this tells the algorithm not to carry out adaptive sampling.

  • maximum_polynomial_order (int) – The maximum polynomial order to be considered.

Further details about the optional inputs may be found under the __init__ method.

__init__(original_data_input, regression_data_input, maximum_polynomial_order, number_of_crossvalidations=None, no_adaptive_samples=None, training_split=None, max_fraction_training_samples=None, max_iter=None, solution_method=None, multinomials=None, fname=None, overwrite=False)[source]

Initialization of PolynomialRegression class.

Parameters
  • regression_data_input (NumPy Array of Pandas Dataframe) – The dataset for regression training. It is expected to contain features and output data, with the output values (Y) in the last column.

  • original_data_input (NumPy Array of Pandas Dataframe) – If regression_data_input was drawn from a larger dataset by some sampling approach, the larger dataset may be provided here.

  • maximum_polynomial_order (int) – The maximum polynomial order to be considered.

Keyword Arguments
  • number_of_crossvalidations (int) – The number of polynomial fittings and cross-validations to be carried out for each polynomial function/expression. Must be a positive, non-zero integer. Default=3.

  • training_split (float) – The training/test split to be used for regression_data_input. Must be between 0 and 1. Default = 0.75

  • solution_method (str) –

    The method to be used for solving the least squares optimization problem for polynomial regression. Three options are available:

    1. ”MLE” : The mle (maximum likelihood estimate) method solves the least squares problem using linear algebra. Details of the method may be found in Forrester et al.

    2. ”BFGS” : This approach solves the least squares problem using scipy’s BFGS algorithm.

    3. ”pyomo”: This option solves the optimization problem in pyomo with IPOPT as solver. This is the default option.

  • multinomials (bool) – This option determines whether or not multinomial terms are considered during polynomial fitting. Takes 0 for No and 1 for Yes. Default = 1.

Returns

self object containing all the input information.

Raises
  • ValueError

    • The input datasets (original_data_input or regression_data_input) are of the wrong type (not Numpy arrays or Pandas Dataframes)

  • Exception

    • maximum_polynomial_order is not a positive, non-zero integer or maximum_polynomial_order is higher than the number of training samples available

  • Exception

    • solution_method is not ‘mle’, ‘pyomo’ or ‘bfgs

  • Exception

    • multinomials is not binary (0 or 1)

  • Exception

    • training_split is not between 0 and 1

  • Exception

    • number_of_crossvalidations is not a positive, non-zero integer

  • Exception

    • max_fraction_training_samples is not between 0 and 1

  • Exception

    • no_adaptive_samples is not a positive, non-zero integer

  • Exception

    • max_iter is not a positive, non-zero integer

  • warnings.warn

    • When the number of cross-validations is too high, i.e. number_of_crossvalidations > 10

confint_regression(confidence=0.95)[source]

The confint_regression method prints the confidence intervals for the regression patamaters.

Parameters

confidence – Required confidence interval level, default = 0.95 (95%)

generate_expression(variable_list)[source]

The generate_expression method returns the Pyomo expression for the polynomial model trained.

The expression is constructed based on a supplied list of variables variable_list and the output of training.

Parameters

variable_list (list) – List of input variables to be used in generating expression. This can be the a list generated from the results of get_feature_vector. The user can also choose to supply a new list of the appropriate length.

Returns

Pyomo expression of the polynomial model based on the variables provided in variable_list.

Return type

Pyomo Expression

get_feature_vector()[source]

The get_feature_vector method generates the list of regression features from the column headers of the input dataset.

Returns

An indexed parameter list of the variables supplied in the original data

Return type

Pyomo IndexedParam

Example:

# Create a small dataframe with three columns ('one', 'two', 'three') and two rows (A, B)
>>> xy_data = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['one', 'two', 'three'])

# Initialize the **PolynomialRegression** class and print the column headers for the variables
>>> f = PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=1, multinomials=True, training_split=0.8)
>>> p = f.get_feature_vector()
>>> for i in p.keys():
>>>     print(i)
one
two
predict_output(x_data)[source]

The predict_output method generates output predictions for input data x_data based a previously generated polynomial fitting.

Parameters

x_data – Numpy array of designs for which the output is to be evaluated/predicted.

Returns

Output variable predictions based on the polynomial fit.

Return type

Numpy Array

set_additional_terms(term_list)[source]

set_additional_terms accepts additional user-defined features for consideration during regression.

Parameters

term_list (list) – List of additional terms to be considered as regression features. Each term in the list must be a Pyomo-supported intrinsic function.

Example:

# To add the sine and cosine of a variable with header 'X1' in the dataset as additional regression features:
>>> xy_data = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['X1', 'X2', 'Y'])
>>> A = PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=5)
>>> p = A.get_feature_vector()
>>> A.set_additional_terms([ pyo.sin(p['X1']) , pyo.cos(p['X1']) ])
training()[source]

The training method trains a polynomial model to an input dataset. It calls the core method which is called in the PolynomialRegression class (polynomial_regression_fitting). It accepts no user input, inheriting the information passed in class initialization.

Returns

Python Object (results) containing the results of the polynomial regression process including:
  • the polynomial order (self.final_polynomial_order)

  • polynomial coefficients (self.optimal_weights_array), and

  • MAE and MSE errors as well as the \(R^{2}\) (results.errors).

Return type

tuple

References:

[1] Forrester et al.’s book “Engineering Design via Surrogate Modelling: A Practical Guide”, https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470770801