Custom Sampling

Custom Sampling#

With this method, users can explicitly define the distribution for the sampling of each input variable explicitly.

The pysmo.sampling.CustomSampling method carries out the user-defined sampling strategy. This can be done in two modes:

  • The samples can be selected from a user-provided dataset, or

  • The samples can be generated from a set of provided bounds.

We currently support three distributions options for sampling:

  • “random”, for sampling from a random distribution.

  • “uniform”, for sampling from a uniform distribution.

  • “normal”, for sampling from a normal (i.e. Gaussian) distribution.

Warning

A note on Gaussian-based sampling

To remain consistent with the other sampling methods and distributions, bounds are required for specifying normal distributions, rather than the mean (\(\bar{x}\)) and standard deviation (\(\sigma\)). For a normal distribution, 99.7% of the points/sample fall within three standard deviations of the mean. Thus, the bounds of the distribution ay be computed as:

\[\begin{equation} LB = \bar{x} - 3\sigma \end{equation}\]
\[\begin{equation} UB = \bar{x} + 3\sigma \end{equation}\]

While almost all of the points generated will typically fall between LB and UB, a few points may be generated outside the bounds (as should be expected from a normal distribution). However, users can choose to enforce the bounds as hard constraints by setting the boolean option strictly_enforce_gaussian_bounds to True during initialization. In that case, values exceeding the bounds are replaced by new values generated from the distributions. However, this may affect the underlying distribution.

Available Methods#

class idaes.core.surrogate.pysmo.sampling.CustomSampling(data_input, number_of_samples=None, list_of_distributions=None, sampling_type=None, xlabels=None, ylabels=None, strictly_enforce_gaussian_bounds=False, rand_seed=None)[source]#

A class that performs custom sampling per dimension as specified by the user. The distribution to be applied per dimension must be specified by the user.

  • The distribution to be used per variable needs to be specified in a list.

  • Users are urged to visit the documentation for more information about normal distribution-based sampling.

To use: call class with inputs, and then sample_points function

Example:

# To select 50 samples drom a dataset:
>>> b = rbf.CustomSampling(data, 50, list_of_distributions= ['normal', 'uniform'], sampling_type="selection")
>>> samples = b.sample_points()

Note

To remain consistent with the other sampling methods and distributions, bounds are required for specifying normal distributions, rather than the mean and standard deviation.

Given the mean (\(\bar{x}\)) and standard deviation (\(\sigma\)), the bounds of the normal distribution may be computed as:

Lower bound = \(\bar{x} - 3\sigma\) ; Upper bound = \(\bar{x} + 3\sigma\)

Users should visit the documentation for more information.

__init__(data_input, number_of_samples=None, list_of_distributions=None, sampling_type=None, xlabels=None, ylabels=None, strictly_enforce_gaussian_bounds=False, rand_seed=None)[source]#

Initialization of CustomSampling class. Four inputs are required.

Parameters:
  • data_input (NumPy Array, Pandas Dataframe or list) –

    The input data set or range to be sampled.

    • When the aim is to select a set of samples from an existing dataset, the dataset must be a NumPy Array or a Pandas Dataframe and sampling_type option must be set to “selection”. A single output variable (y) is assumed to be supplied in the last column if xlabels and ylabels are not supplied.

    • When the aim is to generate a set of samples from a data range, the dataset must be a list containing two lists of equal lengths which contain the variable bounds and sampling_type option must be set to “creation”. It is assumed that the range contains no output variable information in this case.

  • number_of_samples (int) – The number of samples to be generated. Should be a positive integer less than or equal to the number of entries (rows) in data_input.

  • list_of_distributions (list) – The list containing the probability distribution for each variable. The length of the list must match the number of input (i.e. dependent) variables to be sampled. We currently support random, uniform and normal (i.e. Gaussian) distributions.

  • sampling_type (str) – Option which determines whether the algorithm selects samples from an existing dataset (“selection”) or attempts to generate sample from a supplied range (“creation”). Default is “creation”.

Keyword Arguments:
  • xlabels (list) – List of column names (if data_input is a dataframe) or column numbers (if data_input is an array) for the independent/input variables. Only used in “selection” mode. Default is None.

  • ylabels (list) – List of column names (if data_input is a dataframe) or column numbers (if data_input is an array) for the dependent/output variables. Only used in “selection” mode. Default is None.

  • rand_seed (int) – Option that allows users to fix the numpy random seed generator for reproducibility (if required).

  • strictly_enforce_gaussian_bounds (bool) – Boolean specifying whether the provided bounds for normal distributions should be strictly enforced. Note that selecting this option may affect the underlying distribution. Default is False.

Returns:

self function containing the input information

Raises:
  • ValueError – The input data (data_input) is the wrong type/dimension, or number_of_samples is invalid (too large, zero, or negative), list_of_distributions is the wrong length, or a non-implemented distribution is supplied in list_of_distributions.

  • TypeError – When number_of_samples is not an integer, list_of_distributions is not a list, or sampling_type entry is not a string

  • IndexError – When invalid column names are supplied in xlabels or ylabels