Centroidal voronoi tessellation (CVT) sampling

Centroidal voronoi tessellation (CVT) sampling#

In CVT, the generating point of each Voronoi cell coincides with its center of mass; CVT sampling locates the design samples at the centroids of each Voronoi cell in the input space. CVT sampling is a geometric, space-filling sampling method which is similar to k-means clustering in its simplest form.

The pysmo.sampling.CVTSampling method carries out CVT sampling. This can be done in two modes:

  • The samples can be selected from a user-provided dataset, or

  • The samples can be generated from a set of provided bounds.

The CVT sampling algorithm implemented here is based on McQueen’s method which involves a series of random sampling and averaging steps, see http://kmh-lanl.hansonhub.com/uncertainty/meetings/gunz03vgr.pdf.

Available Methods#

class idaes.core.surrogate.pysmo.sampling.CVTSampling(data_input, number_of_samples=None, tolerance=None, sampling_type=None, xlabels=None, ylabels=None, rand_seed=None)[source]#

A class that constructs Centroidal Voronoi Tessellation (CVT) samples.

CVT sampling is based on the generation of samples in which the generators of the Voronoi tessellations and the mass centroids coincide.

To use: call class with inputs, and then sample_points function.

Example:

# For the first 10 CVT samples in a 2-D space:
>>> b = rbf.CVTSampling(data_bounds, 10, tolerance = 1e-5, sampling_type="creation")
>>> samples = b.sample_points()
__init__(data_input, number_of_samples=None, tolerance=None, sampling_type=None, xlabels=None, ylabels=None, rand_seed=None)[source]#

Initialization of CVTSampling class. Two inputs are required, while an optional option to control the solution accuracy may be specified.

Parameters:
  • data_input (NumPy Array, Pandas Dataframe or list) – The input data set or range to be sampled. - When the aim is to select a set of samples from an existing dataset, the dataset must be a NumPy Array or a Pandas Dataframe and sampling_type option must be set to “selection”. A single output variable (y) is assumed to be supplied in the last column if xlabels and ylabels are not supplied. - When the aim is to generate a set of samples from a data range, the dataset must be a list containing two lists of equal lengths which contain the variable bounds and sampling_type option must be set to “creation”. It is assumed that the range contains no output variable information in this case.

  • number_of_samples (int) – The number of samples to be generated. Should be a positive integer less than or equal to the number of entries (rows) in data_input.

  • sampling_type (str) – Option which determines whether the algorithm selects samples from an existing dataset (“selection”) or attempts to generate sample from a supplied range (“creation”). Default is “creation”.

Keyword Arguments:
  • xlabels (list) – List of column names (if data_input is a dataframe) or column numbers (if data_input is an array) for the independent/input variables. Only used in “selection” mode. Default is None.

  • ylabels (list) – List of column names (if data_input is a dataframe) or column numbers (if data_input is an array) for the dependent/output variables. Only used in “selection” mode. Default is None.

  • rand_seed (int) – Option that allows users to fix the numpy random seed generator for reproducibility (if required).

  • tolerance (float) –

    Maximum allowable Euclidean distance between centres from consecutive iterations of the algorithm. Termination condition for algorithm.

    • The smaller the value of tolerance, the better the solution but the longer the algorithm requires to converge. Default value is \(10^{-7}\).

Returns:

self function containing the input information.

Raises:
  • ValueError – The input data (data_input) is the wrong type/dimension, or number_of_samples is invalid (too large, zero, or negative)

  • ValueError – When the tolerance specified is too loose (tolerance > 0.1)

  • TypeError – When number_of_samples is not the right type, or sampling_type entry is not a string

  • IndexError – When invalid column names are supplied in xlabels or ylabels

  • Exception – When the tolerance specified is invalid

  • warnings.warn – when the tolerance specified by the user is too tight (tolerance < \(10^{-9}\))

sample_points()[source]#

The sample_points method determines the best/optimal centre points (centroids) for a data set based on the minimization of the total distance between points and centres.

Procedure based on McQueen’s algorithm: iteratively minimize distance, and re-position centroids. Centre re-calculation done as the mean of each data cluster around each centre.

Returns:

A numpy array or Pandas dataframe containing the final number_of_samples centroids obtained by the CVT algorithm.

Return type:

NumPy Array or Pandas Dataframe

References#

[1] Loeven et al paper titled “A Probabilistic Radial Basis Function Approach for Uncertainty Quantification” https://pdfs.semanticscholar.org/48a0/d3797e482e37f73e077893594e01e1c667a2.pdf

[2] Centroidal Voronoi Tessellations: Applications and Algorithms by Qiang Du, Vance Faber, and Max Gunzburger https://doi.org/10.1137/S0036144599352836

[3] D. G. Loyola, M. Pedergnana, S. G. García, “Smart sampling and incremental function learning for very large high dimensional data” https://www.sciencedirect.com/science/article/pii/S0893608015001768?via%3Dihub