exarl.candlelib.candle
Package Contents
Classes
Class that converts a python dictionary into an object with |
|
Class that implements an interface to handle configuration options for the |
|
|
Functions
|
Load data from the files specified. |
|
Load training and testing data from the files specified, with a column indicated to use as label. |
|
Load training and testing data from the files specified, with the first column to use as label. |
|
Impute missing values with mean and scale data included in pandas dataframe. |
|
Discretize values of given column in pandas dataframe. |
|
Discretize values of given array. |
|
Dataframe lookup. |
|
Downloads a file from a URL if it not already in the cache. |
|
This is taken from: |
|
Utility to parse parameters in common as well as parameters |
|
Convert URL to file path and download the file |
|
Verify if a directory path exists locally. If the path |
Defines parameters that intervine in different functions using the keras defaults. |
|
|
Set up the event logging system. Two handlers are created. |
|
|
|
|
|
Functionality to plot a 2D histogram of the distribution of observed (ground truth) |
|
Functionality to plot a 2D histogram of the distribution of |
|
Functionality to plot a 1D histogram of the distribution of |
|
Functionality to plot empirical calibration curves |
|
Functionality to plot the mean of the percentiles predicted. |
|
Extracts ground truth, mean prediction, error and |
|
Extracts ground truth, mean prediction, error and |
|
Extracts ground truth, mean prediction, error, standard |
|
Extracts ground truth, 50th percentile mean prediction, |
|
Extracts a portion of the arrays provided for the computation |
|
Use the arrays provided to estimate an empirical mapping |
|
Bin the values of the standard deviations observed during |
|
Function that estimates the empirical range in which a |
|
Use the empirical mapping between standard deviation and |
|
Compute the percentage of overestimated absolute error |
|
Generates a vector of indices to partition the data for training. |
|
|
|
|
|
|
|
- exarl.candlelib.candle.load_csv_data(train_path, test_path=None, sep=',', nrows=None, x_cols=None, y_cols=None, drop_cols=None, onehot_cols=None, n_cols=None, random_cols=False, shuffle=False, scaling=None, dtype=None, validation_split=None, return_dataframe=True, return_header=False, seed=DEFAULT_SEED)
Load data from the files specified. Columns corresponding to data features and labels can be specified. A one-hot encoding can be used for either features or labels. If validation_split is specified, trainig data is further split into training and validation partitions. pandas DataFrames are used to load and pre-process the data. If specified, those DataFrames are returned. Otherwise just values are returned. Labels to output can be integer labels (for classification) or continuous labels (for regression). Columns to load can be specified, randomly selected or a subset can be dropped. Order of rows can be shuffled. Data can be rescaled. This function assumes that the files contain a header with column names.
- Parameters
train_path (filename) – Name of the file to load the training data.
test_path (filename) – Name of the file to load the testing data. (Optional).
sep (character) – Character used as column separator. (Default: ‘,’, comma separated values).
nrows (integer) – Number of rows to load from the files. (Default: None, all the rows are used).
x_cols (list) – List of columns to use as features. (Default: None).
y_cols (list) – List of columns to use as labels. (Default: None).
drop_cols (list) – List of columns to drop from the files being loaded. (Default: None, all the columns are used).
onehot_cols (list) – List of columns to one-hot encode. (Default: None).
n_cols (integer) – Number of columns to load from the files. (Default: None).
random_cols (boolean) – Boolean flag to indicate random selection of columns. If True a number of n_cols columns is randomly selected, if False the specified columns are used. (Default: False).
shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).
scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).
dtype (data type) – Data type to use for the output pandas DataFrames. (Default: None).
validation_split (float) – Fraction of training data to set aside for validation. (Default: None, no validation partition is constructed).
return_dataframe (boolean) – Boolean flag to indicate that the pandas DataFrames used for data pre-processing are to be returned. (Default: True, pandas DataFrames are returned).
return_header (boolean) – Boolean flag to indicate if the column headers are to be returned. (Default: False, no column headers are separetely returned).
seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).
- Returns
Tuples of data features and labels are returned, for train, validation and testing partitions, together with the column names (headers). The specific objects to return depend on the options selected.
- exarl.candlelib.candle.load_Xy_one_hot_data2(train_file, test_file, class_col=None, drop_cols=None, n_cols=None, shuffle=False, scaling=None, validation_split=0.1, dtype=DEFAULT_DATATYPE, seed=DEFAULT_SEED)
Load training and testing data from the files specified, with a column indicated to use as label. Further split trainig data into training and validation partitions, and construct corresponding training, validation and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved, but training is split into training and validation partitions. This function assumes that the files contain a header with column names.
- Parameters
train_file (filename) – Name of the file to load the training data.
test_file (filename) – Name of the file to load the testing data.
class_col (integer) – Index of the column to use as the label. (Default: None, this would cause the function to fail, a label has to be indicated at calling).
drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).
n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).
shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are loaded is preserved. (Default: False, no permutation of the loading row order).
scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).
validation_split (float) – Fraction of training data to set aside for validation. (Default: 0.1, ten percent of the training data is used for the validation partition).
dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).
seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).
- Returns
X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.
y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.
X_val (pandas DataFrame) – Data features for validation loaded in a pandas DataFrame and pre-processed as specified.
y_val (pandas DataFrame) – Data labels for validation loaded in a pandas DataFrame. One-hot encoding (categorical) is used.
X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.
y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.
- exarl.candlelib.candle.load_Xy_data_noheader(train_file, test_file, classes, usecols=None, scaling=None, dtype=DEFAULT_DATATYPE)
Load training and testing data from the files specified, with the first column to use as label. Construct corresponding training and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved. This function assumes that the files do not contain a header.
- Parameters
train_file (filename) – Name of the file to load the training data.
test_file (filename) – Name of the file to load the testing data.
classes (integer) – Number of total classes to consider when building the categorical (one-hot) label encoding.
usecols (list) – List of column indices to load from the files. (Default: None, all the columns are used).
scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).
dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).
- Returns
X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.
Y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.
X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.
Y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.
- exarl.candlelib.candle.drop_impute_and_scale_dataframe(df, scaling='std', imputing='mean', dropna='all')
Impute missing values with mean and scale data included in pandas dataframe.
- Parameters
df (pandas dataframe) – dataframe to process
scaling (string) – String describing type of scaling to apply. ‘maxabs’ [-1,1], ‘minmax’ [0,1], ‘std’, or None, optional (Default ‘std’)
imputing (string) – String describing type of imputation to apply. ‘mean’ replace missing values with mean value along the column, ‘median’ replace missing values with median value along the column, ‘most_frequent’ replace missing values with most frequent value along column (Default: ‘mean’).
dropna (string) – String describing strategy for handling missing values. ‘all’ if all values are NA, drop that column. ‘any’ if any NA values are present, dropt that column. (Default: ‘all’).
- Returns
Returns the data frame after handling missing values and scaling.
- exarl.candlelib.candle.discretize_dataframe(df, col, bins=2, cutoffs=None)
Discretize values of given column in pandas dataframe.
- Parameters
df (pandas dataframe) – dataframe to process.
col (int) – Index of column to bin.
bins (int) – Number of bins for distributing column values.
cutoffs (list) – List of bin limits. If None, the limits are computed as percentiles. (Default: None).
- Returns
Returns the data frame with the values of the specified column binned, i.e. the values
are replaced by the associated bin number.
- exarl.candlelib.candle.discretize_array(y, bins=5)
Discretize values of given array.
- Parameters
y (numpy array) – array to discretize.
bins (int) – Number of bins for distributing column values.
- Returns
Returns an array with the bin number associated to the values in the
original array.
- exarl.candlelib.candle.lookup(df, query, ret, keys, match='match')
Dataframe lookup.
- Parameters
df (pandas dataframe) – dataframe for retrieving values.
query (string) – String for searching.
ret (int/string or list) – Names or indices of columns to be returned.
keys (list) – List of strings or integers specifying the names or indices of columns to look into.
match (string) – String describing strategy for matching keys to query.
- Returns
Returns a list of the values in the dataframe whose columns match
the specified query and have been selected to be returned.
- exarl.candlelib.candle.get_file(fname, origin, untar=False, md5_hash=None, cache_subdir='common', datadir=None)
Downloads a file from a URL if it not already in the cache. Passing the MD5 hash will verify the file after download as well as if it is already present in the cache.
- Parameters
fname (string) – name of the file
origin (string) – original URL of the file
untar (boolean) – whether the file should be decompressed
md5_hash (string) – MD5 hash of the file for verification
cache_subdir (string) – directory being used as the cache
datadir (string) – if set, datadir becomes its setting (which could be e.g. an absolute path) and cache_subdir no longer matters
- Returns
Path to the downloaded file
- class exarl.candlelib.candle.ArgumentStruct(**entries)
Class that converts a python dictionary into an object with named entries given by the dictionary keys. This structure simplifies the calling convention for accessing the dictionary values (corresponding to problem parameters). After the object instantiation both modes of access (dictionary or object entries) can be used.
- class exarl.candlelib.candle.Benchmark(filepath, defmodel, framework, prog=None, desc=None, parser=None)
Class that implements an interface to handle configuration options for the different CANDLE benchmarks. It provides access to all the common configuration options and configuration options particular to each individual benchmark. It describes what minimum requirements should be specified to instantiate the corresponding benchmark. It interacts with the argparser to extract command-line options and arguments from the benchmark’s configuration files.
Initialize Benchmark object.
- Parameters
filepath (./) – os.path.dirname where the benchmark is located. Necessary to locate utils and establish input/ouput paths
defmodel ('p*b*_default_model.txt') – string corresponding to the default model of the benchmark
framework ('keras', 'neon', 'mxnet', 'pytorch') – framework used to run the benchmark
prog ('p*b*_baseline_*') – string for program name (usually associated to benchmark and framework)
desc (' ') – string describing benchmark (usually a description of the neural network model built)
parser (argparser (default None)) – if ‘neon’ framework a NeonArgparser is passed. Otherwise an argparser is constructed.
- parse_from_common(self)
Functionality to parse options common for all benchmarks. This functionality is based on methods ‘get_default_neon_parser’ and ‘get_common_parser’ which are defined previously(above). If the order changes or they are moved, the calling has to be updated.
- parse_from_benchmark(self)
Functionality to parse options specific specific for each benchmark.
- format_benchmark_config_arguments(self, dictfileparam)
Functionality to format the particular parameters of the benchmark.
- Parameters
dictfileparam (python dictionary) – parameters read from configuration file
args (python dictionary) – parameters read from command-line Most of the time command-line overwrites configuration file except when the command-line is using default values and config file defines those values
- read_config_file(self, file)
Functionality to read the configue file specific for each benchmark.
- set_locals(self)
Functionality to set variables specific for the benchmark - required: set of required parameters for the benchmark. - additional_definitions: list of dictionaries describing the additional parameters for the benchmark.
- check_required_exists(self, gparam)
Functionality to verify that the required model parameters have been specified.
- exarl.candlelib.candle.str2bool(v)
This is taken from: https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse Because type=bool is not interpreted as a bool and action=’store_true’ cannot be undone.
- Parameters
v (string) – String to interpret
- Returns
Boolean value. It raises and exception if the provided string cannot be interpreted as a boolean type.
Strings recognized as boolean True – ‘yes’, ‘true’, ‘t’, ‘y’, ‘1’ and uppercase versions (where applicable).
Strings recognized as boolean False – ‘no’, ‘false’, ‘f’, ‘n’, ‘0’ and uppercase versions (where applicable).
- exarl.candlelib.candle.finalize_parameters(bmk)
Utility to parse parameters in common as well as parameters particular to each benchmark.
- Parameters
bmk (benchmark object) – Object that has benchmark filepaths and specifications
- Returns
gParameters (python dictionary) – Dictionary with all the parameters necessary to run the benchmark. Command line overwrites config file specifications
- exarl.candlelib.candle.fetch_file(link, subdir, untar=False, md5_hash=None)
Convert URL to file path and download the file if it is not already present in spedified cache.
- Parameters
link (link path) – URL of the file to download
subdir (directory path) – Local path to check for cached file.
untar (boolean) – Flag to specify if the file to download should be decompressed too. (default: False, no decompression)
md5_hash (MD5 hash) – Hash used as a checksum to verify data integrity. Verification is carried out if a hash is provided. (default: None, no verification)
- Returns
local path to the downloaded, or cached, file.
- exarl.candlelib.candle.verify_path(path)
Verify if a directory path exists locally. If the path does not exist, but is a valid path, it recursivelly creates the specified directory path structure.
- Parameters
path (directory path) – Description of local directory path
- exarl.candlelib.candle.keras_default_config()
Defines parameters that intervine in different functions using the keras defaults. This helps to keep consistency in parameters between frameworks.
- exarl.candlelib.candle.set_up_logger(logfile, logger, verbose)
Set up the event logging system. Two handlers are created. One to send log records to a specified file and one to send log records to the (defaulf) sys.stderr stream. The logger and the file handler are set to DEBUG logging level. The stream handler is set to INFO logging level, or to DEBUG logging level if the verbose flag is specified. Logging messages which are less severe than the level set will be ignored.
- Parameters
logfile (filename) – File to store the log records
logger (logger object) – Python object for the logging interface
verbose (boolean) – Flag to increase the logging level from INFO to DEBUG. It only applies to the stream handler.
- class exarl.candlelib.candle.Progbar(target, width=30, verbose=1, interval=0.01)
Bases:
object- Parameters
target (int) – total number of steps expected
interval (float) – minimum visual progress update interval (in seconds)
- update(self, current, values=[], force=False)
- Parameters
current (int) – index of current step
values (list of tuples (name, value_for_last_step).) – The progress bar will display averages for these values.
force (boolean) – force visual progress update
- add(self, n, values=[])
- exarl.candlelib.candle.plot_history(out, history, metric='loss', title=None, width=8, height=6)
- exarl.candlelib.candle.plot_scatter(data, classes, out, width=10, height=8)
- exarl.candlelib.candle.plot_density_observed_vs_predicted(Ytest, Ypred, pred_name=None, figprefix=None)
Functionality to plot a 2D histogram of the distribution of observed (ground truth) values vs. predicted values. The plot generated is stored in a png file.
- Parameters
Ytest (numpy array) – Array with (true) observed values
Ypred (numpy array) – Array with predicted values.
pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.)
figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_predictions.png’ string will be appended to the figprefix given.
- exarl.candlelib.candle.plot_2d_density_sigma_vs_error(sigma, yerror, method=None, figprefix=None)
Functionality to plot a 2D histogram of the distribution of the standard deviations computed for the predictions vs. the computed errors (i.e. values of observed - predicted). The plot generated is stored in a png file.
- Parameters
sigma (numpy array) – Array with standard deviations computed.
yerror (numpy array) – Array with errors computed (observed - predicted).
method (string) – Method used to compute the standard deviations (i.e. dropout, heteroscedastic, etc.).
figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_sigma_error.png’ string will be appended to the figprefix given.
- exarl.candlelib.candle.plot_histogram_error_per_sigma(sigma, yerror, method=None, figprefix=None)
Functionality to plot a 1D histogram of the distribution of computed errors (i.e. values of observed - predicted) observed for specific values of standard deviations computed. The range of standard deviations computed is split in xbins values and the 1D histograms of error distributions for the smallest six standard deviations are plotted. The plot generated is stored in a png file.
- Parameters
sigma (numpy array) – Array with standard deviations computed.
yerror (numpy array) – Array with errors computed (observed - predicted).
method (string) – Method used to comput the standard deviations (i.e. dropout, heteroscedastic, etc.).
figprefix (string) – String to prefix the filename to store the figure generated. A ‘_histogram_error_per_sigma.png’ string will be appended to the figprefix given.
- exarl.candlelib.candle.plot_calibration_and_errors(mean_sigma, sigma_start_index, sigma_end_index, min_sigma, max_sigma, error_thresholds, error_thresholds_smooth, err_err, s_interpolate, coverage_percentile, method=None, figprefix=None, steps=False)
Functionality to plot empirical calibration curves estimated by binning the statistics of computed standard deviations and errors.
- Parameters
mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.
sigma_start_index (non-negative integer) – Index of the mean_sigma array that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error is obtained).
sigma_end_index (non-negative integer) – Index of the mean_sigma array that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error is obtained).
min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.
max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.
error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.
error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.
err_err (numpy array) – Vertical error bars (usually one standard deviation for a binomial distribution estimated by bin) for the error calibration computed empirically.
s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) constructed to express the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.
coverage_percentile (float) – Value used for the coverage in the percentile estimation of the observed error.
method (string) – Method used to comput the standard deviations (i.e. dropout, heteroscedastic, etc.).
figprefix (string) – String to prefix the filename to store the figure generated. A ‘_empirical_calibration.png’ string will be appended to the figprefix given.
steps (boolean) – Besides the complete empirical calibration (including raw statistics, error bars and smoothing), also generates partial plots with only the raw bin statistics (step1) and with only the raw bin statistics and the smoothing interpolation (step2).
- exarl.candlelib.candle.plot_percentile_predictions(Ypred, Ypred_Lp, Ypred_Hp, percentile_list, pred_name=None, figprefix=None)
Functionality to plot the mean of the percentiles predicted. The plot generated is stored in a png file.
- Parameters
Ypred (numpy array) – Array with mid percentile predicted values.
Ypred_Lp (numpy array) – Array with low percentile predicted values.
Ypred_Hp (numpy array) – Array with high percentile predicted values.
percentile_list (string list) – List of percentiles predicted (e.g. ‘10p’, ‘90p’, etc.)
pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.)
figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_predictions.png’ string will be appended to the figprefix given.
- exarl.candlelib.candle.compute_statistics_homoscedastic(df_data, col_true=0, col_pred=6, col_std_pred=7)
Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes the statistics over all the inference realizations.
- Parameters
df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>_pred.tsv).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 0, index in current CANDLE format).
col_pred (integer) – Index of the column in the data frame where the predicted value is stored (Default: 6, index in current CANDLE format).
col_std_pred (integer) – Index of the column in the data frame where the standard deviation of the predicted values is stored (Default: 7, index in current CANDLE format).
- Returns
Ytrue (numpy array) – Array with true (observed) values
Ypred (numpy array) – Array with predicted values.
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).
- exarl.candlelib.candle.compute_statistics_homoscedastic_all(df_data, col_true=4, col_pred_start=6)
Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes all the individual inference realizations.
- Parameters
df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>.predicted_INFER.tsv).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HOM format).
col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored (Default: 6 index, in current HOM format).
- Returns
Ytrue (numpy array) – Array with true (observed) values
Ypred (numpy array) – Array with predicted values.
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).
- exarl.candlelib.candle.compute_statistics_heteroscedastic(df_data, col_true=4, col_pred_start=6, col_std_pred_start=7)
Extracts ground truth, mean prediction, error, standard deviation of prediction and predicted (learned) standard deviation from inference data frame. The latter includes all the individual inference realizations.
- Parameters
df_data (pandas data frame) – Data frame generated by current heteroscedastic inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_HET.tsv).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HET format).
col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with standard deviation predictions (Default: 6 index, step 2, in current HET format).
col_std_pred_start (integer) – Index of the column in the data frame where the first predicted standard deviation value is stored. All the predicted values during inference are stored and are interspaced with predictions (Default: 7 index, step 2, in current HET format).
- Returns
Ytrue (numpy array) – Array with true (observed) values
Ypred (numpy array) – Array with predicted values.
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).
- exarl.candlelib.candle.compute_statistics_quantile(df_data, sigma_divisor=2.56, col_true=4, col_pred_start=6)
Extracts ground truth, 50th percentile mean prediction, low percentile and high percentile mean prediction (usually 10th percentile and 90th percentile respectively), error (using 50th percentile), standard deviation of prediction (using 50th percentile) and predicted (learned) standard deviation from interdecile range in inference data frame. The latter includes all the individual inference realizations.
- Parameters
df_data (pandas data frame) – Data frame generated by current quantile inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_QTL.tsv).
sigma_divisor (float) – Divisor to convert from the intercedile range to the corresponding standard deviation for a Gaussian distribution. (Default: 2.56, consistent with an interdecile range computed from the difference between the 90th and 10th percentiles).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current QTL format).
col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with other percentile predictions (Default: 6 index, step 3, in current QTL format).
- Returns
Ytrue (numpy array) – Array with true (observed) values
Ypred (numpy array) – Array with predicted values (based on the 50th percentile).
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. This corresponds to the interdecile range divided by the sigma divisor.
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).
Ypred_Lp_mean (numpy array) – Array with predicted values of the lower percentile (usually the 10th percentile).
Ypred_Hp_mean (numpy array) – Array with predicted values of the higher percentile (usually the 90th percentile).
- exarl.candlelib.candle.split_data_for_empirical_calibration(Ytrue, Ypred, sigma, cal_split=0.8)
Extracts a portion of the arrays provided for the computation of the calibration and reserves the remainder portion for testing.
- Parameters
Ytrue (numpy array) – Array with true (observed) values
Ypred (numpy array) – Array with predicted values.
sigma (numpy array) – Array with standard deviations learned with deep learning model (or std value computed from prediction if homoscedastic inference).
cal_split (float) – Split of data to use for estimating the calibration relationship. It is assumed that it will be a value in (0, 1). (Default: use 80% of predictions to generate empirical calibration).
- Returns
index_perm_total (numpy array) – Random permutation of the array indices. The first ‘num_cal’ of the indices correspond to the samples that are used for calibration, while the remainder are the samples reserved for calibration testing.
pSigma_cal (numpy array) – Part of the input sigma array to use for calibration.
pSigma_test (numpy array) – Part of the input sigma array to reserve for testing.
pPred_cal (numpy array) – Part of the input Ypred array to use for calibration.
pPred_test (numpy array) – Part of the input Ypred array to reserve for testing.
true_cal (numpy array) – Part of the input Ytrue array to use for calibration.
true_test (numpy array) – Part of the input Ytrue array to reserve for testing.
- exarl.candlelib.candle.compute_empirical_calibration(pSigma_cal, pPred_cal, true_cal, bins, coverage_percentile)
Use the arrays provided to estimate an empirical mapping between standard deviation and absolute value of error, both of which have been observed during inference. Since most of the times the raw statistics per bin are very noisy, a smoothing step (based on scipy’s savgol filter) is performed.
- Parameters
pSigma_cal (numpy array) – Part of the standard deviations array to use for calibration.
pPred_cal (numpy array) – Part of the predictions array to use for calibration.
true_cal (numpy array) – Part of the true (observed) values array to use for calibration.
bins (int) – Number of bins to split the range of standard deviations included in pSigma_cal array.
coverage_percentile (float) – Value to use for estimating coverage when evaluating the percentiles of the observed absolute value of errors.
- Returns
mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.
min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.
max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.
error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.
err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.
error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.
sigma_start_index (non-negative integer) – Index in the mean_sigma array that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error mapping is obtained).
sigma_end_index (non-negative integer) – Index in the mean_sigma array that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error mapping is obtained).
s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) constructed to express the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.
- exarl.candlelib.candle.bining_for_calibration(pSigma_cal_ordered_, minL_sigma, maxL_sigma, Er_vect_cal_orderedSigma_, bins, coverage_percentile)
Bin the values of the standard deviations observed during inference and estimate a specified coverage percentile in the absolute error (observed during inference as well). Bins that have less than 50 samples are merged until they surpass this threshold.
- Parameters
pSigma_cal_ordered (numpy array) – Array of standard deviations ordered in ascending way.
minL_sigma (float) – Minimum value of standard deviations included in pSigma_cal_ordered_ array.
maxL_sigma (numpy array) – Maximum value of standard deviations included in pSigma_cal_ordered_ array.
Er_vect_cal_orderedSigma (numpy array) – Array ob absolute value of errors corresponding with the array of ordered standard deviations.
bins (int) – Number of bins to split the range of standard deviations included in pSigma_cal_ordered_ array.
coverage_percentile (float) – Value to use for estimating coverage when evaluating the percentiles of the observed absolute value of errors.
- Returns
mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.
min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.
max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.
error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.
err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.
- exarl.candlelib.candle.computation_of_valid_calibration_interval(error_thresholds, error_thresholds_smooth, err_err)
Function that estimates the empirical range in which a monotonic relation is observed between standard deviation and coverage of absolute value of error. Since the statistics computed per bin are relatively noisy, the application of a greedy criterion (e.g. guarantee a monotonically increasing relationship) does not yield good results. Therefore, a softer version is constructed based on the satisfaction of certain criteria depending on: the values of the error coverage computed per bin, a smoothed version of them and the associated error estimated (based on one standard deviation for a binomial distribution estimated by bin vs. the other bins). A minimal validation requiring the end idex to be largest than the starting index is performed before the function return.
Current criteria: - the smoothed errors are inside the error bars AND
they are almost increasing (a small tolerance is allowed, so a small wobbliness in the smoother values is permitted).
OR - both the raw values for the bins (with a small tolerance)
are increasing, AND the smoothed value is greater than the raw value.
OR - the current smoothed value is greater than the previous AND
the smoothed values for the next been are inside the error bars.
- Parameters
error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.
error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.
err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.
- Returns
sigma_start_index (non-negative integer) – Index estimated in the mean_sigma array corresponding to the value that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error mapping is obtained, according to the criteria explained before).
sigma_end_index (non-negative integer) – Index estimated in the mean_sigma array corresponding to the value that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error mapping is obtained, according to the criteria explained before).
- exarl.candlelib.candle.applying_calibration(pSigma_test, pPred_test, true_test, s_interpolate, minL_sigma_auto, maxL_sigma_auto)
Use the empirical mapping between standard deviation and absolute value of error estimated during calibration (i.e. apply the univariate spline computed) to estimate the error for the part of the standard deviation array that was reserved for testing the empirical calibration. The resulting error array (yp_test) should overestimate the true observed error (eabs_red). All the computations are restricted to the valid calibration interval: [minL_sigma_auto, maxL_sigma_auto].
- Parameters
pSigma_test (numpy array) – Part of the standard deviations array to use for calibration testing.
pPred_test (numpy array) – Part of the predictions array to use for calibration testing.
true_test (numpy array) – Part of the true (observed) values array to use for calibration testing.
s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) expressing the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.
minL_sigma_auto (float) – Starting value of the valid empirical calibration interval (i.e. smallest std for which a meaningful error mapping is obtained).
maxL_sigma_auto (float) – Ending value of the valid empirical calibration interval (i.e. largest std for which a meaningful error mapping is obtained).
- Returns
index_sigma_range_test (numpy array) – Indices of the pSigma_test array that are included in the valid calibration interval, given by: [minL_sigma_auto, maxL_sigma_auto].
xp_test (numpy array) – Array with the mean standard deviations in the calibration testing array.
yp_test (numpy array) – Mapping of the given standard deviation to error computed from the interpolation spline constructed by empirical calibration.
eabs_red (numpy array) – Array with the observed absolute errors in the part of the testing array for which the observed standard deviations are in the valid interval of calibration.
- exarl.candlelib.candle.overprediction_check(yp_test, eabs_red)
Compute the percentage of overestimated absolute error predictions for the arrays reserved for calibration testing and whose corresponding standard deviations are included in the valid calibration interval.
- Parameters
yp_test (numpy array) – Mapping of the standard deviation to error computed from the interpolation spline constructed by empirical calibration.
eabs_red (numpy array) – Array with the observed absolute errors in the part of the testing array for which the observed standard deviations are in the valid interval of calibration.
- exarl.candlelib.candle.generate_index_distribution(numTrain, numTest, numValidation, params)
Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified blocks and that the block indices describe a coherent partition.
- Parameters
numTrain (int) – Number of training data points
numTest (int) – Number of testing data points
numValidation (int) – Number of validation data points (may be zero)
params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_fr, uq_valid_fr, uq_test_fr for fraction specification, uq_train_vec, uq_valid_vec, uq_test_vec for block list specification, and uq_train_bks, uq_valid_bks, uq_test_bks for block number specification)
- Returns
indexTrain (int numpy array) – Indices for data in training
indexValidation (int numpy array) – Indices for data in validation (if any)
indexTest (int numpy array) – Indices for data in testing (if merging)
- exarl.candlelib.candle.start_profiling(do_prof)
- exarl.candlelib.candle.stop_profiling(do_prof)
- exarl.candlelib.candle.get_default_exarl_parser(parser)
- exarl.candlelib.candle.search(dictionary, substr)