Centroidal voronoi tessellation (CVT) sampling

In CVT, the generating point of each Voronoi cell coincides with its center of mass; CVT sampling locates the design samples at the centroids of each Voronoi cell in the input space. CVT sampling is a geometric, space-filling sampling method which is similar to k-means clustering in its simplest form.

The pysmo.sampling.CVTSampling method carries out CVT sampling. This can be done in two modes:

  • The samples can be selected from a user-provided dataset, or

  • The samples can be generated from a set of provided bounds.

The CVT sampling algorithm implemented here is based on McQueen’s method which involves a series of random sampling and averaging steps, see http://kmh-lanl.hansonhub.com/uncertainty/meetings/gunz03vgr.pdf.

Available Methods

class idaes.core.surrogate.pysmo.sampling.CVTSampling(data_input, number_of_samples=None, tolerance=None, sampling_type=None, xlabels=None, ylabels=None)[source]

A class that constructs Centroidal Voronoi Tessellation (CVT) samples.

CVT sampling is based on the generation of samples in which the generators of the Voronoi tessellations and the mass centroids coincide.

To use: call class with inputs, and then sample_points function.

Example:

# For the first 10 CVT samples in a 2-D space:
>>> b = rbf.CVTSampling(data_bounds, 10, tolerance = 1e-5, sampling_type="creation")
>>> samples = b.sample_points()
__init__(data_input, number_of_samples=None, tolerance=None, sampling_type=None, xlabels=None, ylabels=None)[source]

Initialization of CVTSampling class. Two inputs are required, while an optional option to control the solution accuracy may be specified.

Parameters
  • data_input (NumPy Array, Pandas Dataframe or list) – The input data set or range to be sampled. - When the aim is to select a set of samples from an existing dataset, the dataset must be a NumPy Array or a Pandas Dataframe and sampling_type option must be set to “selection”. A single output variable (y) is assumed to be supplied in the last column if xlabels and ylabels are not supplied. - When the aim is to generate a set of samples from a data range, the dataset must be a list containing two lists of equal lengths which contain the variable bounds and sampling_type option must be set to “creation”. It is assumed that the range contains no output variable information in this case.

  • number_of_samples (int) – The number of samples to be generated. Should be a positive integer less than or equal to the number of entries (rows) in data_input.

  • sampling_type (str) – Option which determines whether the algorithm selects samples from an existing dataset (“selection”) or attempts to generate sample from a supplied range (“creation”). Default is “creation”.

Keyword Arguments
  • xlabels (list) – List of column names (if data_input is a dataframe) or column numbers (if data_input is an array) for the independent/input variables. Only used in “selection” mode. Default is None.

  • ylabels (list) – List of column names (if data_input is a dataframe) or column numbers (if data_input is an array) for the dependent/output variables. Only used in “selection” mode. Default is None.

  • tolerance (float) –

    Maximum allowable Euclidean distance between centres from consectutive iterations of the algorithm. Termination condition for algorithm.

    • The smaller the value of tolerance, the better the solution but the longer the algorithm requires to converge. Default value is \(10^{-7}\).

Returns

self function containing the input information.

Raises
  • ValueError – When data_input is the wrong type.

  • IndexError – When invalid column names are supplied in xlabels or ylabels

  • Exception – When the number_of_samples is invalid (not an integer, too large, zero, negative)

  • Exception – When the tolerance specified is too loose (tolerance > 0.1) or invalid

  • warnings.warn – when the tolerance specified by the user is too tight (tolerance < \(10^{-9}\))

sample_points()[source]

The sample_points method determines the best/optimal centre points (centroids) for a data set based on the minimization of the total distance between points and centres.

Procedure based on McQueen’s algorithm: iteratively minimize distance, and re-position centroids. Centre re-calculation done as the mean of each data cluster around each centre.

Returns

A numpy array or Pandas dataframe containing the final number_of_samples centroids obtained by the CVT algorithm.

Return type

NumPy Array or Pandas Dataframe

References

[1] Loeven et al paper titled “A Probabilistic Radial Basis Function Approach for Uncertainty Quantification” https://pdfs.semanticscholar.org/48a0/d3797e482e37f73e077893594e01e1c667a2.pdf

[2] Centroidal Voronoi Tessellations: Applications and Algorithms by Qiang Du, Vance Faber, and Max Gunzburger https://doi.org/10.1137/S0036144599352836

[3] D. G. Loyola, M. Pedergnana, S. G. García, “Smart sampling and incremental function learning for very large high dimensional data” https://www.sciencedirect.com/science/article/pii/S0893608015001768?via%3Dihub