Generating Polynomial Models with PySMO¶
The pysmo.polynomial_regression method learns polynomial models from data. Presented with a small number of samples generated from experiments or computer simulations, the approach determines the most accurate polynomial approximation by comparing the accuracy and performance of polynomials of different orders and basis function forms.
pysmo.polynomial_regression considers three types of basis functions
univariate polynomials,
second-degree bivariate polynomials, and
user-specified basis functions.
Thus, for a problem with \(m\) sample points and \(n\) input variables, the resulting polynomial is of the form
Basic Usage¶
To generate a polynomial model with PySMO, the pysmo.polynomial_regression class is first initialized,
and then the method training
is called on the initialized object:
# Required imports
>>> from idaes.surrogate.pysmo import polynomial_regression
>>> import pandas as pd
# Load dataset from a csv file
>>> xy_data = pd.read_csv('data.csv', header=None, index_col=0)
# Initialize the PolynomialRegression class, extract the list of features and train the model
>>> pr_init = polynomial_regression.PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=3, *kwargs)
>>> features = pr_init.get_feature_vector()
>>> pr_init.training()
xy_data is a two-dimensional python data structure containing the input and output training data. The output values MUST be in the last column.
maximum_polynomial_order refers to the maximum polynomial order to be considered when training the surrogate.
Optional Arguments
multinomials - boolean option which determines whether bivariate terms are considered in polynomial generation.
training_split - option which determines fraction of training data to be used for training (the rest will be for testing). Default is 0.8.
number_of_crossvalidations - Number of cross-validations during training. Default number is 3.
pysmo.polynomial_regression Output¶
The result of the pysmo.polynomial_regression
method is a python object containing information
about the problem set-up, the final optimal polynomial order, the polynomial coefficients and different error and quality-of-fit metrics such as
the mean-squared-error (MSE) and the \(R^{2}\) coefficient-of-fit. A Pyomo expression can be generated from the
object simply passing a list of variables into the function generate_expression:
# Create a python list from the headers of the dataset supplied for training
>>> list_vars = []
>>> for i in features.keys():
>>> list_vars.append(features[i])
# Pass list to generate_expression function to obtain a Pyomo expression as output
>>> print(pr_init.generate_expression(list_vars))
Prediction with pysmo.polynomial_regression models¶
Once a polynomial model has been trained, predictions for values at previously unsampled points :math:x_unsampled can be evaluated by calling the
predict_output()
method on the unsampled points:
# Create a python list from the headers of the dataset supplied for training
>>> y_unsampled = pr_init.predict_output(x_unsampled)
The confidence intervals for the regression paramaters may be viewed using the method confint_regression
.
Flowsheet Integration¶
The result of the polynomial training process can be passed directly into a process flowsheet as an objective or a constraint. The following code snippet demonstrates how an output polynomial model may be integrated directly into a Pyomo flowsheet as an objective:
# Required imports
>>> import pyomo.environ as pyo
>>> from idaes.surrogate.pysmo import polynomial_regression
>>> import pandas as pd
# Create a Pyomo model
>>> m = pyo.ConcreteModel()
>>> i = pyo.Set(initialize=[1, 2])
# Create a Pyomo variable with indexed by the 2D-set i with initial values {0, 0}
>>> init_x = {1: 0, 2: 0}
>>> def x_init(m, i):
>>> return (init_x[i])
>>> m.x = pyo.Var(i, initialize=x_init)
# Train a simple polynomial model on data available in csv format, resulting in the Python object polyfit
>>> xy_data = pd.read_csv('data.csv', header=None, index_col=0)
>>> pr_init = polynomial_regression.PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=3)
>>> features = pr_init.get_feature_vector()
>>> polyfit = pr_init.training()
# Use the resulting polynomial as an objective, passing in the Pyomo variable x
>>> m.obj = pyo.Objective(expr=polyfit.generate_expression([m.x[1], m.x[2]]))
# Solve the model
>>> instance = m
>>> opt = pyo.SolverFactory("ipopt")
>>> result = opt.solve(instance, tee=True)
Further details about pysmo.polynomial_regression may be found by consulting the examples or reading the paper […]
Available Methods¶
- class idaes.surrogate.pysmo.polynomial_regression.FeatureScaling[source]¶
A class for scaling and unscaling input and output data. The class contains two main methods:
data_scaling
anddata_unscaling
- static data_scaling(data)[source]¶
data_scaling
performs column-wise minimax scaling on the input dataset.- Parameters
data – The input data set to be scaled. Must be a numpy array or dataframe.
- Returns
- tuple containing:
scaled_data : A 2-D Numpy Array containing the scaled data. All array values will be between [0, 1].
data_minimum : A 2-D row vector containing the column-wise minimums of the input data.
data_maximum : A 2-D row vector containing the column-wise maximums of the input data.
- Return type
(tuple)
- Raises
TypeError – Raised when the input data is not a numpy array or dataframe
- static data_unscaling(x_scaled, x_min, x_max)[source]¶
data_unscaling
performs column-wise un-scaling on the a minmax-scaled input dataset.- Parameters
x_scaled (NumPy Array) – Data to be un-scaled. Data values should be between 0 and 1.
x_min (NumPy vector) – \(n \times 1\) vector containing the actual minimum value for each column. Must contain same number of elements as the number of columns in x_scaled.
x_max (NumPy vector) – \(n \times 1\) vector vector containing the actual minimum value for each column. Must contain same number of elements as the number of columns in x_scaled.
- Returns
A 2-D numpy array containing the scaled data, \(x_{min} + x_{scaled} * (x_{max} - x_{min})\)
- Return type
NumPy Array
- Raises
IndexError – Raised when the dimensions of the arrays are inconsistent.
- class idaes.surrogate.pysmo.polynomial_regression.PolynomialRegression(original_data_input, regression_data_input, maximum_polynomial_order, number_of_crossvalidations=None, no_adaptive_samples=None, training_split=None, max_fraction_training_samples=None, max_iter=None, solution_method=None, multinomials=None, fname=None, overwrite=False)[source]¶
The PolynomialRegression class performs polynomial regression on a training data set.
The class must first be initialized by calling PolynomialRegression. Regression is then carried out by calling
training
.- For a given dataset with \(n\) features \(x_{1},x_{2},\ldots,x_{n}\), Polyregression is able to consider three types of basis functions:
Mononomial terms (\(x_{i}^{p},p \leq 10\)) for all individual features. The maximum degree to be considered can be set by the user (maximum_polynomial_order)
All first order interaction terms \(x_{1}x_{2}\), \(x_{1}x_{3}\) etc. This can be turned on or off by the user (set multinomials)
User defined input features, e.g. \(\sin(x_{1})\). These must be Pyomo functions and should be provided as a list by the user calling
set_additional_terms
method before the polynomial training is done.
Example:
# Initialize the class and set additional terms >>> d = PolynomialRegression(full_data, training_data, maximum_polynomial_order=2, max_iter=20, multinomials=1, solution_method='pyomo') >>> p = d.get_feature_vector() >>> d.set_additional_terms([...extra terms...]) # Train polynomial model and predict output for an test data x_test >>> d.training() >>> predictions = d.predict_output(x_test)
- Parameters
regression_data_input (NumPy Array of Pandas Dataframe) – The dataset for regression training. It is expected to contain the features (X) and output (Y) data, with the output values (Y) in the last column.
original_data_input (NumPy Array of Pandas Dataframe) – If regression_data_input was drawn from a larger dataset by some sampling approach, the larger dataset may be provided here. When additional data is not available, the same data supplied for training_data can be supplied - this tells the algorithm not to carry out adaptive sampling.
maximum_polynomial_order (int) – The maximum polynomial order to be considered.
Further details about the optional inputs may be found under the
__init__
method.- __init__(original_data_input, regression_data_input, maximum_polynomial_order, number_of_crossvalidations=None, no_adaptive_samples=None, training_split=None, max_fraction_training_samples=None, max_iter=None, solution_method=None, multinomials=None, fname=None, overwrite=False)[source]¶
Initialization of PolynomialRegression class.
- Parameters
regression_data_input (NumPy Array of Pandas Dataframe) – The dataset for regression training. It is expected to contain features and output data, with the output values (Y) in the last column.
original_data_input (NumPy Array of Pandas Dataframe) – If regression_data_input was drawn from a larger dataset by some sampling approach, the larger dataset may be provided here.
maximum_polynomial_order (int) – The maximum polynomial order to be considered.
- Keyword Arguments
number_of_crossvalidations (int) – The number of polynomial fittings and cross-validations to be carried out for each polynomial function/expression. Must be a positive, non-zero integer. Default=3.
training_split (float) – The training/test split to be used for regression_data_input. Must be between 0 and 1. Default = 0.75
solution_method (str) –
The method to be used for solving the least squares optimization problem for polynomial regression. Three options are available:
”MLE” : The mle (maximum likelihood estimate) method solves the least squares problem using linear algebra. Details of the method may be found in Forrester et al.
”BFGS” : This approach solves the least squares problem using scipy’s BFGS algorithm.
”pyomo”: This option solves the optimization problem in pyomo with IPOPT as solver. This is the default option.
multinomials (bool) – This option determines whether or not multinomial terms are considered during polynomial fitting. Takes 0 for No and 1 for Yes. Default = 1.
- Returns
self object containing all the input information.
- Raises
The input datasets (original_data_input or regression_data_input) are of the wrong type (not Numpy arrays or Pandas Dataframes)
maximum_polynomial_order is not a positive, non-zero integer or maximum_polynomial_order is higher than the number of training samples available
solution_method is not ‘mle’, ‘pyomo’ or ‘bfgs
multinomials is not binary (0 or 1)
training_split is not between 0 and 1
number_of_crossvalidations is not a positive, non-zero integer
max_fraction_training_samples is not between 0 and 1
no_adaptive_samples is not a positive, non-zero integer
max_iter is not a positive, non-zero integer
warnings.warn –
When the number of cross-validations is too high, i.e. number_of_crossvalidations > 10
- confint_regression(confidence=0.95)[source]¶
The
confint_regression
method prints the confidence intervals for the regression patamaters.- Parameters
confidence – Required confidence interval level, default = 0.95 (95%)
- generate_expression(variable_list)[source]¶
The
generate_expression
method returns the Pyomo expression for the polynomial model trained.The expression is constructed based on a supplied list of variables variable_list and the output of
training
.- Parameters
variable_list (list) – List of input variables to be used in generating expression. This can be the a list generated from the results of
get_feature_vector
. The user can also choose to supply a new list of the appropriate length.- Returns
Pyomo expression of the polynomial model based on the variables provided in variable_list.
- Return type
Pyomo Expression
- get_feature_vector()[source]¶
The
get_feature_vector
method generates the list of regression features from the column headers of the input dataset.- Returns
An indexed parameter list of the variables supplied in the original data
- Return type
Pyomo IndexedParam
Example:
# Create a small dataframe with three columns ('one', 'two', 'three') and two rows (A, B) >>> xy_data = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['one', 'two', 'three']) # Initialize the **PolynomialRegression** class and print the column headers for the variables >>> f = PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=1, multinomials=True, training_split=0.8) >>> p = f.get_feature_vector() >>> for i in p.keys(): >>> print(i) one two
- predict_output(x_data)[source]¶
The
predict_output
method generates output predictions for input data x_data based a previously generated polynomial fitting.- Parameters
x_data – Numpy array of designs for which the output is to be evaluated/predicted.
- Returns
Output variable predictions based on the polynomial fit.
- Return type
Numpy Array
- set_additional_terms(term_list)[source]¶
set_additional_terms
accepts additional user-defined features for consideration during regression.- Parameters
term_list (list) – List of additional terms to be considered as regression features. Each term in the list must be a Pyomo-supported intrinsic function.
Example:
# To add the sine and cosine of a variable with header 'X1' in the dataset as additional regression features: >>> xy_data = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['X1', 'X2', 'Y']) >>> A = PolynomialRegression(xy_data, xy_data, maximum_polynomial_order=5) >>> p = A.get_feature_vector() >>> A.set_additional_terms([ pyo.sin(p['X1']) , pyo.cos(p['X1']) ])
- training()[source]¶
The
training
method trains a polynomial model to an input dataset. It calls the core method which is called in the PolynomialRegression class (polynomial_regression_fitting). It accepts no user input, inheriting the information passed in class initialization.- Returns
- Python Object (results) containing the results of the polynomial regression process including:
the polynomial order (self.final_polynomial_order)
polynomial coefficients (self.optimal_weights_array), and
MAE and MSE errors as well as the \(R^{2}\) (results.errors).
- Return type
References:¶
[1] Forrester et al.’s book “Engineering Design via Surrogate Modelling: A Practical Guide”, https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470770801