exarl.candlelib.uq_utils

Module Contents

Functions

generate_index_distribution(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training.

generate_index_distribution_from_fraction(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training.

generate_index_distribution_from_blocks(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training.

generate_index_distribution_from_block_list(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training.

compute_limits(numdata, numblocks, blocksize, blockn)

Generates the limit of indices corresponding to a

fill_array(blocklist, maxsize, numdata, numblocks, blocksize)

Fills a new array of integers with the indices corresponding

compute_statistics_homoscedastic(df_data, col_true=0, col_pred=6, col_std_pred=7)

Extracts ground truth, mean prediction, error and

compute_statistics_homoscedastic_all(df_data, col_true=4, col_pred_start=6)

Extracts ground truth, mean prediction, error and

compute_statistics_heteroscedastic(df_data, col_true=4, col_pred_start=6, col_std_pred_start=7)

Extracts ground truth, mean prediction, error, standard

compute_statistics_quantile(df_data, sigma_divisor=2.56, col_true=4, col_pred_start=6)

Extracts ground truth, 50th percentile mean prediction,

split_data_for_empirical_calibration(Ytrue, Ypred, sigma, cal_split=0.8)

Extracts a portion of the arrays provided for the computation

compute_empirical_calibration(pSigma_cal, pPred_cal, true_cal, bins, coverage_percentile)

Use the arrays provided to estimate an empirical mapping

bining_for_calibration(pSigma_cal_ordered_, minL_sigma, maxL_sigma, Er_vect_cal_orderedSigma_, bins, coverage_percentile)

Bin the values of the standard deviations observed during

computation_of_valid_calibration_interval(error_thresholds, error_thresholds_smooth, err_err)

Function that estimates the empirical range in which a

applying_calibration(pSigma_test, pPred_test, true_test, s_interpolate, minL_sigma_auto, maxL_sigma_auto)

Use the empirical mapping between standard deviation and

overprediction_check(yp_test, eabs_red)

Compute the percentage of overestimated absolute error

exarl.candlelib.uq_utils.generate_index_distribution(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified blocks and that the block indices describe a coherent partition.

Parameters
  • numTrain (int) – Number of training data points

  • numTest (int) – Number of testing data points

  • numValidation (int) – Number of validation data points (may be zero)

  • params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_fr, uq_valid_fr, uq_test_fr for fraction specification, uq_train_vec, uq_valid_vec, uq_test_vec for block list specification, and uq_train_bks, uq_valid_bks, uq_test_bks for block number specification)

Returns

  • indexTrain (int numpy array) – Indices for data in training

  • indexValidation (int numpy array) – Indices for data in validation (if any)

  • indexTest (int numpy array) – Indices for data in testing (if merging)

exarl.candlelib.uq_utils.generate_index_distribution_from_fraction(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training. It checks that the fractions provided are (0, 1) and add up to 1.

Parameters
  • numTrain (int) – Number of training data points

  • numTest (int) – Number of testing data points

  • numValidation (int) – Number of validation data points (may be zero)

  • params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_fr, uq_valid_fr, uq_test_fr)

Returns

  • indexTrain (int numpy array) – Indices for data in training

  • indexValidation (int numpy array) – Indices for data in validation (if any)

  • indexTest (int numpy array) – Indices for data in testing (if merging)

exarl.candlelib.uq_utils.generate_index_distribution_from_blocks(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified block quantities and that the block quantities describe a coherent partition.

Parameters
  • numTrain (int) – Number of training data points

  • numTest (int) – Number of testing data points

  • numValidation (int) – Number of validation data points (may be zero)

  • params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_bks, uq_valid_bks, uq_test_bks)

Returns

  • indexTrain (int numpy array) – Indices for data in training

  • indexValidation (int numpy array) – Indices for data in validation (if any)

  • indexTest (int numpy array) – Indices for data in testing (if merging)

exarl.candlelib.uq_utils.generate_index_distribution_from_block_list(numTrain, numTest, numValidation, params)

Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified list of blocks and that the block indices describe a coherent partition.

Parameters
  • numTrain (int) – Number of training data points

  • numTest (int) – Number of testing data points

  • numValidation (int) – Number of validation data points (may be zero)

  • params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_vec, uq_valid_vec, uq_test_vec)

Returns

  • indexTrain (int numpy array) – Indices for data in training

  • indexValidation (int numpy array) – Indices for data in validation (if any)

  • indexTest (int numpy array) – Indices for data in testing (if merging)

exarl.candlelib.uq_utils.compute_limits(numdata, numblocks, blocksize, blockn)

Generates the limit of indices corresponding to a specific block. It takes into account the non-exact divisibility of numdata into numblocks letting the last block to take the extra chunk.

Parameters
  • numdata (int) – Total number of data points to distribute

  • numblocks (int) – Total number of blocks to distribute into

  • blocksize (int) – Size of data per block

  • blockn (int) – Index of block, from 0 to numblocks-1

Returns

  • start (int) – Position to start assigning indices

  • end (int) – One beyond position to stop assigning indices

exarl.candlelib.uq_utils.fill_array(blocklist, maxsize, numdata, numblocks, blocksize)

Fills a new array of integers with the indices corresponding to the specified block structure.

Parameters
  • blocklist (list) – List of integers describes the block indices that go into the array

  • maxsize (int) – Maximum possible length for the partition (the size of the common block size plus the remainder, if any).

  • numdata (int) – Total number of data points to distribute

  • numblocks (int) – Total number of blocks to distribute into

  • blocksize (int) – Size of data per block

Returns

indexArray (int numpy array) – Indices for specific data partition. Resizes the array to the correct length.

exarl.candlelib.uq_utils.compute_statistics_homoscedastic(df_data, col_true=0, col_pred=6, col_std_pred=7)

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes the statistics over all the inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>_pred.tsv).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 0, index in current CANDLE format).

  • col_pred (integer) – Index of the column in the data frame where the predicted value is stored (Default: 6, index in current CANDLE format).

  • col_std_pred (integer) – Index of the column in the data frame where the standard deviation of the predicted values is stored (Default: 7, index in current CANDLE format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

exarl.candlelib.uq_utils.compute_statistics_homoscedastic_all(df_data, col_true=4, col_pred_start=6)

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>.predicted_INFER.tsv).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HOM format).

  • col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored (Default: 6 index, in current HOM format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

exarl.candlelib.uq_utils.compute_statistics_heteroscedastic(df_data, col_true=4, col_pred_start=6, col_std_pred_start=7)

Extracts ground truth, mean prediction, error, standard deviation of prediction and predicted (learned) standard deviation from inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current heteroscedastic inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_HET.tsv).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HET format).

  • col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with standard deviation predictions (Default: 6 index, step 2, in current HET format).

  • col_std_pred_start (integer) – Index of the column in the data frame where the first predicted standard deviation value is stored. All the predicted values during inference are stored and are interspaced with predictions (Default: 7 index, step 2, in current HET format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

exarl.candlelib.uq_utils.compute_statistics_quantile(df_data, sigma_divisor=2.56, col_true=4, col_pred_start=6)

Extracts ground truth, 50th percentile mean prediction, low percentile and high percentile mean prediction (usually 10th percentile and 90th percentile respectively), error (using 50th percentile), standard deviation of prediction (using 50th percentile) and predicted (learned) standard deviation from interdecile range in inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current quantile inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_QTL.tsv).

  • sigma_divisor (float) – Divisor to convert from the intercedile range to the corresponding standard deviation for a Gaussian distribution. (Default: 2.56, consistent with an interdecile range computed from the difference between the 90th and 10th percentiles).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current QTL format).

  • col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with other percentile predictions (Default: 6 index, step 3, in current QTL format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values (based on the 50th percentile).

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. This corresponds to the interdecile range divided by the sigma divisor.

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

  • Ypred_Lp_mean (numpy array) – Array with predicted values of the lower percentile (usually the 10th percentile).

  • Ypred_Hp_mean (numpy array) – Array with predicted values of the higher percentile (usually the 90th percentile).

exarl.candlelib.uq_utils.split_data_for_empirical_calibration(Ytrue, Ypred, sigma, cal_split=0.8)

Extracts a portion of the arrays provided for the computation of the calibration and reserves the remainder portion for testing.

Parameters
  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • sigma (numpy array) – Array with standard deviations learned with deep learning model (or std value computed from prediction if homoscedastic inference).

  • cal_split (float) – Split of data to use for estimating the calibration relationship. It is assumed that it will be a value in (0, 1). (Default: use 80% of predictions to generate empirical calibration).

Returns

  • index_perm_total (numpy array) – Random permutation of the array indices. The first ‘num_cal’ of the indices correspond to the samples that are used for calibration, while the remainder are the samples reserved for calibration testing.

  • pSigma_cal (numpy array) – Part of the input sigma array to use for calibration.

  • pSigma_test (numpy array) – Part of the input sigma array to reserve for testing.

  • pPred_cal (numpy array) – Part of the input Ypred array to use for calibration.

  • pPred_test (numpy array) – Part of the input Ypred array to reserve for testing.

  • true_cal (numpy array) – Part of the input Ytrue array to use for calibration.

  • true_test (numpy array) – Part of the input Ytrue array to reserve for testing.

exarl.candlelib.uq_utils.compute_empirical_calibration(pSigma_cal, pPred_cal, true_cal, bins, coverage_percentile)

Use the arrays provided to estimate an empirical mapping between standard deviation and absolute value of error, both of which have been observed during inference. Since most of the times the raw statistics per bin are very noisy, a smoothing step (based on scipy’s savgol filter) is performed.

Parameters
  • pSigma_cal (numpy array) – Part of the standard deviations array to use for calibration.

  • pPred_cal (numpy array) – Part of the predictions array to use for calibration.

  • true_cal (numpy array) – Part of the true (observed) values array to use for calibration.

  • bins (int) – Number of bins to split the range of standard deviations included in pSigma_cal array.

  • coverage_percentile (float) – Value to use for estimating coverage when evaluating the percentiles of the observed absolute value of errors.

Returns

  • mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.

  • min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.

  • max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.

  • error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.

  • err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.

  • error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.

  • sigma_start_index (non-negative integer) – Index in the mean_sigma array that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error mapping is obtained).

  • sigma_end_index (non-negative integer) – Index in the mean_sigma array that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error mapping is obtained).

  • s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) constructed to express the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.

exarl.candlelib.uq_utils.bining_for_calibration(pSigma_cal_ordered_, minL_sigma, maxL_sigma, Er_vect_cal_orderedSigma_, bins, coverage_percentile)

Bin the values of the standard deviations observed during inference and estimate a specified coverage percentile in the absolute error (observed during inference as well). Bins that have less than 50 samples are merged until they surpass this threshold.

Parameters
  • pSigma_cal_ordered (numpy array) – Array of standard deviations ordered in ascending way.

  • minL_sigma (float) – Minimum value of standard deviations included in pSigma_cal_ordered_ array.

  • maxL_sigma (numpy array) – Maximum value of standard deviations included in pSigma_cal_ordered_ array.

  • Er_vect_cal_orderedSigma (numpy array) – Array ob absolute value of errors corresponding with the array of ordered standard deviations.

  • bins (int) – Number of bins to split the range of standard deviations included in pSigma_cal_ordered_ array.

  • coverage_percentile (float) – Value to use for estimating coverage when evaluating the percentiles of the observed absolute value of errors.

Returns

  • mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.

  • min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.

  • max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.

  • error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.

  • err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.

exarl.candlelib.uq_utils.computation_of_valid_calibration_interval(error_thresholds, error_thresholds_smooth, err_err)

Function that estimates the empirical range in which a monotonic relation is observed between standard deviation and coverage of absolute value of error. Since the statistics computed per bin are relatively noisy, the application of a greedy criterion (e.g. guarantee a monotonically increasing relationship) does not yield good results. Therefore, a softer version is constructed based on the satisfaction of certain criteria depending on: the values of the error coverage computed per bin, a smoothed version of them and the associated error estimated (based on one standard deviation for a binomial distribution estimated by bin vs. the other bins). A minimal validation requiring the end idex to be largest than the starting index is performed before the function return.

Current criteria: - the smoothed errors are inside the error bars AND

they are almost increasing (a small tolerance is allowed, so a small wobbliness in the smoother values is permitted).

OR - both the raw values for the bins (with a small tolerance)

are increasing, AND the smoothed value is greater than the raw value.

OR - the current smoothed value is greater than the previous AND

the smoothed values for the next been are inside the error bars.

Parameters
  • error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.

  • error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.

  • err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.

Returns

  • sigma_start_index (non-negative integer) – Index estimated in the mean_sigma array corresponding to the value that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error mapping is obtained, according to the criteria explained before).

  • sigma_end_index (non-negative integer) – Index estimated in the mean_sigma array corresponding to the value that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error mapping is obtained, according to the criteria explained before).

exarl.candlelib.uq_utils.applying_calibration(pSigma_test, pPred_test, true_test, s_interpolate, minL_sigma_auto, maxL_sigma_auto)

Use the empirical mapping between standard deviation and absolute value of error estimated during calibration (i.e. apply the univariate spline computed) to estimate the error for the part of the standard deviation array that was reserved for testing the empirical calibration. The resulting error array (yp_test) should overestimate the true observed error (eabs_red). All the computations are restricted to the valid calibration interval: [minL_sigma_auto, maxL_sigma_auto].

Parameters
  • pSigma_test (numpy array) – Part of the standard deviations array to use for calibration testing.

  • pPred_test (numpy array) – Part of the predictions array to use for calibration testing.

  • true_test (numpy array) – Part of the true (observed) values array to use for calibration testing.

  • s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) expressing the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.

  • minL_sigma_auto (float) – Starting value of the valid empirical calibration interval (i.e. smallest std for which a meaningful error mapping is obtained).

  • maxL_sigma_auto (float) – Ending value of the valid empirical calibration interval (i.e. largest std for which a meaningful error mapping is obtained).

Returns

  • index_sigma_range_test (numpy array) – Indices of the pSigma_test array that are included in the valid calibration interval, given by: [minL_sigma_auto, maxL_sigma_auto].

  • xp_test (numpy array) – Array with the mean standard deviations in the calibration testing array.

  • yp_test (numpy array) – Mapping of the given standard deviation to error computed from the interpolation spline constructed by empirical calibration.

  • eabs_red (numpy array) – Array with the observed absolute errors in the part of the testing array for which the observed standard deviations are in the valid interval of calibration.

exarl.candlelib.uq_utils.overprediction_check(yp_test, eabs_red)

Compute the percentage of overestimated absolute error predictions for the arrays reserved for calibration testing and whose corresponding standard deviations are included in the valid calibration interval.

Parameters
  • yp_test (numpy array) – Mapping of the standard deviation to error computed from the interpolation spline constructed by empirical calibration.

  • eabs_red (numpy array) – Array with the observed absolute errors in the part of the testing array for which the observed standard deviations are in the valid interval of calibration.