exarl.candlelib.candle_keras

Package Contents

Classes

ArgumentStruct

Class that converts a python dictionary into an object with

Benchmark

Class that implements an interface to handle configuration options for the

Progbar

param target

total number of steps expected

PermanentDropout

LoggingCallback

CandleRemoteMonitor

Capture Run level output and store/send for monitoring

TerminateOnTimeOut

This class implements timeout on model training. When the script reaches timeout,

Functions

load_csv_data(train_path, test_path=None, sep=',', nrows=None, x_cols=None, y_cols=None, drop_cols=None, onehot_cols=None, n_cols=None, random_cols=False, shuffle=False, scaling=None, dtype=None, validation_split=None, return_dataframe=True, return_header=False, seed=DEFAULT_SEED)

Load data from the files specified.

load_Xy_one_hot_data2(train_file, test_file, class_col=None, drop_cols=None, n_cols=None, shuffle=False, scaling=None, validation_split=0.1, dtype=DEFAULT_DATATYPE, seed=DEFAULT_SEED)

Load training and testing data from the files specified, with a column indicated to use as label.

load_Xy_data_noheader(train_file, test_file, classes, usecols=None, scaling=None, dtype=DEFAULT_DATATYPE)

Load training and testing data from the files specified, with the first column to use as label.

drop_impute_and_scale_dataframe(df, scaling='std', imputing='mean', dropna='all')

Impute missing values with mean and scale data included in pandas dataframe.

discretize_dataframe(df, col, bins=2, cutoffs=None)

Discretize values of given column in pandas dataframe.

discretize_array(y, bins=5)

Discretize values of given array.

lookup(df, query, ret, keys, match='match')

Dataframe lookup.

get_file(fname, origin, untar=False, md5_hash=None, cache_subdir='common', datadir=None)

Downloads a file from a URL if it not already in the cache.

str2bool(v)

This is taken from:

finalize_parameters(bmk)

Utility to parse parameters in common as well as parameters

fetch_file(link, subdir, untar=False, md5_hash=None)

Convert URL to file path and download the file

verify_path(path)

Verify if a directory path exists locally. If the path

keras_default_config()

Defines parameters that intervine in different functions using the keras defaults.

set_up_logger(logfile, logger, verbose)

Set up the event logging system. Two handlers are created.

plot_history(out, history, metric='loss', title=None, width=8, height=6)

plot_scatter(data, classes, out, width=10, height=8)

plot_density_observed_vs_predicted(Ytest, Ypred, pred_name=None, figprefix=None)

Functionality to plot a 2D histogram of the distribution of observed (ground truth)

plot_2d_density_sigma_vs_error(sigma, yerror, method=None, figprefix=None)

Functionality to plot a 2D histogram of the distribution of

plot_histogram_error_per_sigma(sigma, yerror, method=None, figprefix=None)

Functionality to plot a 1D histogram of the distribution of

plot_calibration_and_errors(mean_sigma, sigma_start_index, sigma_end_index, min_sigma, max_sigma, error_thresholds, error_thresholds_smooth, err_err, s_interpolate, coverage_percentile, method=None, figprefix=None, steps=False)

Functionality to plot empirical calibration curves

plot_percentile_predictions(Ypred, Ypred_Lp, Ypred_Hp, percentile_list, pred_name=None, figprefix=None)

Functionality to plot the mean of the percentiles predicted.

compute_statistics_homoscedastic(df_data, col_true=0, col_pred=6, col_std_pred=7)

Extracts ground truth, mean prediction, error and

compute_statistics_homoscedastic_all(df_data, col_true=4, col_pred_start=6)

Extracts ground truth, mean prediction, error and

compute_statistics_heteroscedastic(df_data, col_true=4, col_pred_start=6, col_std_pred_start=7)

Extracts ground truth, mean prediction, error, standard

compute_statistics_quantile(df_data, sigma_divisor=2.56, col_true=4, col_pred_start=6)

Extracts ground truth, 50th percentile mean prediction,

split_data_for_empirical_calibration(Ytrue, Ypred, sigma, cal_split=0.8)

Extracts a portion of the arrays provided for the computation

compute_empirical_calibration(pSigma_cal, pPred_cal, true_cal, bins, coverage_percentile)

Use the arrays provided to estimate an empirical mapping

bining_for_calibration(pSigma_cal_ordered_, minL_sigma, maxL_sigma, Er_vect_cal_orderedSigma_, bins, coverage_percentile)

Bin the values of the standard deviations observed during

computation_of_valid_calibration_interval(error_thresholds, error_thresholds_smooth, err_err)

Function that estimates the empirical range in which a

applying_calibration(pSigma_test, pPred_test, true_test, s_interpolate, minL_sigma_auto, maxL_sigma_auto)

Use the empirical mapping between standard deviation and

overprediction_check(yp_test, eabs_red)

Compute the percentage of overestimated absolute error

build_initializer(type, kerasDefaults, seed=None, constant=0.0)

Set the initializer to the appropriate Keras initializer function

build_optimizer(type, lr, kerasDefaults)

Set the optimizer to the appropriate Keras optimizer function

set_seed(seed)

Set the random number seed to the desired value

set_parallelism_threads()

Set the number of parallel threads according to the number available on the hardware

register_permanent_dropout()

r2(y_true, y_pred)

mae(y_true, y_pred)

mse(y_true, y_pred)

compute_trainable_params(model)

Extract number of parameters from the given Keras model

exarl.candlelib.candle_keras.load_csv_data(train_path, test_path=None, sep=',', nrows=None, x_cols=None, y_cols=None, drop_cols=None, onehot_cols=None, n_cols=None, random_cols=False, shuffle=False, scaling=None, dtype=None, validation_split=None, return_dataframe=True, return_header=False, seed=DEFAULT_SEED)

Load data from the files specified. Columns corresponding to data features and labels can be specified. A one-hot encoding can be used for either features or labels. If validation_split is specified, trainig data is further split into training and validation partitions. pandas DataFrames are used to load and pre-process the data. If specified, those DataFrames are returned. Otherwise just values are returned. Labels to output can be integer labels (for classification) or continuous labels (for regression). Columns to load can be specified, randomly selected or a subset can be dropped. Order of rows can be shuffled. Data can be rescaled. This function assumes that the files contain a header with column names.

Parameters
  • train_path (filename) – Name of the file to load the training data.

  • test_path (filename) – Name of the file to load the testing data. (Optional).

  • sep (character) – Character used as column separator. (Default: ‘,’, comma separated values).

  • nrows (integer) – Number of rows to load from the files. (Default: None, all the rows are used).

  • x_cols (list) – List of columns to use as features. (Default: None).

  • y_cols (list) – List of columns to use as labels. (Default: None).

  • drop_cols (list) – List of columns to drop from the files being loaded. (Default: None, all the columns are used).

  • onehot_cols (list) – List of columns to one-hot encode. (Default: None).

  • n_cols (integer) – Number of columns to load from the files. (Default: None).

  • random_cols (boolean) – Boolean flag to indicate random selection of columns. If True a number of n_cols columns is randomly selected, if False the specified columns are used. (Default: False).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: None).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: None, no validation partition is constructed).

  • return_dataframe (boolean) – Boolean flag to indicate that the pandas DataFrames used for data pre-processing are to be returned. (Default: True, pandas DataFrames are returned).

  • return_header (boolean) – Boolean flag to indicate if the column headers are to be returned. (Default: False, no column headers are separetely returned).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

Tuples of data features and labels are returned, for train, validation and testing partitions, together with the column names (headers). The specific objects to return depend on the options selected.

exarl.candlelib.candle_keras.load_Xy_one_hot_data2(train_file, test_file, class_col=None, drop_cols=None, n_cols=None, shuffle=False, scaling=None, validation_split=0.1, dtype=DEFAULT_DATATYPE, seed=DEFAULT_SEED)

Load training and testing data from the files specified, with a column indicated to use as label. Further split trainig data into training and validation partitions, and construct corresponding training, validation and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved, but training is split into training and validation partitions. This function assumes that the files contain a header with column names.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • class_col (integer) – Index of the column to use as the label. (Default: None, this would cause the function to fail, a label has to be indicated at calling).

  • drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are loaded is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: 0.1, ten percent of the training data is used for the validation partition).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_val (pandas DataFrame) – Data features for validation loaded in a pandas DataFrame and pre-processed as specified.

  • y_val (pandas DataFrame) – Data labels for validation loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

exarl.candlelib.candle_keras.load_Xy_data_noheader(train_file, test_file, classes, usecols=None, scaling=None, dtype=DEFAULT_DATATYPE)

Load training and testing data from the files specified, with the first column to use as label. Construct corresponding training and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved. This function assumes that the files do not contain a header.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • classes (integer) – Number of total classes to consider when building the categorical (one-hot) label encoding.

  • usecols (list) – List of column indices to load from the files. (Default: None, all the columns are used).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • Y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • Y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

exarl.candlelib.candle_keras.drop_impute_and_scale_dataframe(df, scaling='std', imputing='mean', dropna='all')

Impute missing values with mean and scale data included in pandas dataframe.

Parameters
  • df (pandas dataframe) – dataframe to process

  • scaling (string) – String describing type of scaling to apply. ‘maxabs’ [-1,1], ‘minmax’ [0,1], ‘std’, or None, optional (Default ‘std’)

  • imputing (string) – String describing type of imputation to apply. ‘mean’ replace missing values with mean value along the column, ‘median’ replace missing values with median value along the column, ‘most_frequent’ replace missing values with most frequent value along column (Default: ‘mean’).

  • dropna (string) – String describing strategy for handling missing values. ‘all’ if all values are NA, drop that column. ‘any’ if any NA values are present, dropt that column. (Default: ‘all’).

Returns

Returns the data frame after handling missing values and scaling.

exarl.candlelib.candle_keras.discretize_dataframe(df, col, bins=2, cutoffs=None)

Discretize values of given column in pandas dataframe.

Parameters
  • df (pandas dataframe) – dataframe to process.

  • col (int) – Index of column to bin.

  • bins (int) – Number of bins for distributing column values.

  • cutoffs (list) – List of bin limits. If None, the limits are computed as percentiles. (Default: None).

Returns

  • Returns the data frame with the values of the specified column binned, i.e. the values

  • are replaced by the associated bin number.

exarl.candlelib.candle_keras.discretize_array(y, bins=5)

Discretize values of given array.

Parameters
  • y (numpy array) – array to discretize.

  • bins (int) – Number of bins for distributing column values.

Returns

  • Returns an array with the bin number associated to the values in the

  • original array.

exarl.candlelib.candle_keras.lookup(df, query, ret, keys, match='match')

Dataframe lookup.

Parameters
  • df (pandas dataframe) – dataframe for retrieving values.

  • query (string) – String for searching.

  • ret (int/string or list) – Names or indices of columns to be returned.

  • keys (list) – List of strings or integers specifying the names or indices of columns to look into.

  • match (string) – String describing strategy for matching keys to query.

Returns

  • Returns a list of the values in the dataframe whose columns match

  • the specified query and have been selected to be returned.

exarl.candlelib.candle_keras.get_file(fname, origin, untar=False, md5_hash=None, cache_subdir='common', datadir=None)

Downloads a file from a URL if it not already in the cache. Passing the MD5 hash will verify the file after download as well as if it is already present in the cache.

Parameters
  • fname (string) – name of the file

  • origin (string) – original URL of the file

  • untar (boolean) – whether the file should be decompressed

  • md5_hash (string) – MD5 hash of the file for verification

  • cache_subdir (string) – directory being used as the cache

  • datadir (string) – if set, datadir becomes its setting (which could be e.g. an absolute path) and cache_subdir no longer matters

Returns

Path to the downloaded file

class exarl.candlelib.candle_keras.ArgumentStruct(**entries)

Class that converts a python dictionary into an object with named entries given by the dictionary keys. This structure simplifies the calling convention for accessing the dictionary values (corresponding to problem parameters). After the object instantiation both modes of access (dictionary or object entries) can be used.

class exarl.candlelib.candle_keras.Benchmark(filepath, defmodel, framework, prog=None, desc=None, parser=None)

Class that implements an interface to handle configuration options for the different CANDLE benchmarks. It provides access to all the common configuration options and configuration options particular to each individual benchmark. It describes what minimum requirements should be specified to instantiate the corresponding benchmark. It interacts with the argparser to extract command-line options and arguments from the benchmark’s configuration files.

Initialize Benchmark object.

Parameters
  • filepath (./) – os.path.dirname where the benchmark is located. Necessary to locate utils and establish input/ouput paths

  • defmodel ('p*b*_default_model.txt') – string corresponding to the default model of the benchmark

  • framework ('keras', 'neon', 'mxnet', 'pytorch') – framework used to run the benchmark

  • prog ('p*b*_baseline_*') – string for program name (usually associated to benchmark and framework)

  • desc (' ') – string describing benchmark (usually a description of the neural network model built)

  • parser (argparser (default None)) – if ‘neon’ framework a NeonArgparser is passed. Otherwise an argparser is constructed.

parse_from_common(self)

Functionality to parse options common for all benchmarks. This functionality is based on methods ‘get_default_neon_parser’ and ‘get_common_parser’ which are defined previously(above). If the order changes or they are moved, the calling has to be updated.

parse_from_benchmark(self)

Functionality to parse options specific specific for each benchmark.

format_benchmark_config_arguments(self, dictfileparam)

Functionality to format the particular parameters of the benchmark.

Parameters
  • dictfileparam (python dictionary) – parameters read from configuration file

  • args (python dictionary) – parameters read from command-line Most of the time command-line overwrites configuration file except when the command-line is using default values and config file defines those values

read_config_file(self, file)

Functionality to read the configue file specific for each benchmark.

set_locals(self)

Functionality to set variables specific for the benchmark - required: set of required parameters for the benchmark. - additional_definitions: list of dictionaries describing the additional parameters for the benchmark.

check_required_exists(self, gparam)

Functionality to verify that the required model parameters have been specified.

exarl.candlelib.candle_keras.str2bool(v)

This is taken from: https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse Because type=bool is not interpreted as a bool and action=’store_true’ cannot be undone.

Parameters

v (string) – String to interpret

Returns

  • Boolean value. It raises and exception if the provided string cannot be interpreted as a boolean type.

  • Strings recognized as boolean True – ‘yes’, ‘true’, ‘t’, ‘y’, ‘1’ and uppercase versions (where applicable).

  • Strings recognized as boolean False – ‘no’, ‘false’, ‘f’, ‘n’, ‘0’ and uppercase versions (where applicable).

exarl.candlelib.candle_keras.finalize_parameters(bmk)

Utility to parse parameters in common as well as parameters particular to each benchmark.

Parameters

bmk (benchmark object) – Object that has benchmark filepaths and specifications

Returns

gParameters (python dictionary) – Dictionary with all the parameters necessary to run the benchmark. Command line overwrites config file specifications

exarl.candlelib.candle_keras.fetch_file(link, subdir, untar=False, md5_hash=None)

Convert URL to file path and download the file if it is not already present in spedified cache.

Parameters
  • link (link path) – URL of the file to download

  • subdir (directory path) – Local path to check for cached file.

  • untar (boolean) – Flag to specify if the file to download should be decompressed too. (default: False, no decompression)

  • md5_hash (MD5 hash) – Hash used as a checksum to verify data integrity. Verification is carried out if a hash is provided. (default: None, no verification)

Returns

local path to the downloaded, or cached, file.

exarl.candlelib.candle_keras.verify_path(path)

Verify if a directory path exists locally. If the path does not exist, but is a valid path, it recursivelly creates the specified directory path structure.

Parameters

path (directory path) – Description of local directory path

exarl.candlelib.candle_keras.keras_default_config()

Defines parameters that intervine in different functions using the keras defaults. This helps to keep consistency in parameters between frameworks.

exarl.candlelib.candle_keras.set_up_logger(logfile, logger, verbose)

Set up the event logging system. Two handlers are created. One to send log records to a specified file and one to send log records to the (defaulf) sys.stderr stream. The logger and the file handler are set to DEBUG logging level. The stream handler is set to INFO logging level, or to DEBUG logging level if the verbose flag is specified. Logging messages which are less severe than the level set will be ignored.

Parameters
  • logfile (filename) – File to store the log records

  • logger (logger object) – Python object for the logging interface

  • verbose (boolean) – Flag to increase the logging level from INFO to DEBUG. It only applies to the stream handler.

class exarl.candlelib.candle_keras.Progbar(target, width=30, verbose=1, interval=0.01)

Bases: object

Parameters
  • target (int) – total number of steps expected

  • interval (float) – minimum visual progress update interval (in seconds)

update(self, current, values=[], force=False)
Parameters
  • current (int) – index of current step

  • values (list of tuples (name, value_for_last_step).) – The progress bar will display averages for these values.

  • force (boolean) – force visual progress update

add(self, n, values=[])
exarl.candlelib.candle_keras.plot_history(out, history, metric='loss', title=None, width=8, height=6)
exarl.candlelib.candle_keras.plot_scatter(data, classes, out, width=10, height=8)
exarl.candlelib.candle_keras.plot_density_observed_vs_predicted(Ytest, Ypred, pred_name=None, figprefix=None)

Functionality to plot a 2D histogram of the distribution of observed (ground truth) values vs. predicted values. The plot generated is stored in a png file.

Parameters
  • Ytest (numpy array) – Array with (true) observed values

  • Ypred (numpy array) – Array with predicted values.

  • pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.)

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_predictions.png’ string will be appended to the figprefix given.

exarl.candlelib.candle_keras.plot_2d_density_sigma_vs_error(sigma, yerror, method=None, figprefix=None)

Functionality to plot a 2D histogram of the distribution of the standard deviations computed for the predictions vs. the computed errors (i.e. values of observed - predicted). The plot generated is stored in a png file.

Parameters
  • sigma (numpy array) – Array with standard deviations computed.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • method (string) – Method used to compute the standard deviations (i.e. dropout, heteroscedastic, etc.).

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_sigma_error.png’ string will be appended to the figprefix given.

exarl.candlelib.candle_keras.plot_histogram_error_per_sigma(sigma, yerror, method=None, figprefix=None)

Functionality to plot a 1D histogram of the distribution of computed errors (i.e. values of observed - predicted) observed for specific values of standard deviations computed. The range of standard deviations computed is split in xbins values and the 1D histograms of error distributions for the smallest six standard deviations are plotted. The plot generated is stored in a png file.

Parameters
  • sigma (numpy array) – Array with standard deviations computed.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • method (string) – Method used to comput the standard deviations (i.e. dropout, heteroscedastic, etc.).

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_histogram_error_per_sigma.png’ string will be appended to the figprefix given.

exarl.candlelib.candle_keras.plot_calibration_and_errors(mean_sigma, sigma_start_index, sigma_end_index, min_sigma, max_sigma, error_thresholds, error_thresholds_smooth, err_err, s_interpolate, coverage_percentile, method=None, figprefix=None, steps=False)

Functionality to plot empirical calibration curves estimated by binning the statistics of computed standard deviations and errors.

Parameters
  • mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.

  • sigma_start_index (non-negative integer) – Index of the mean_sigma array that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error is obtained).

  • sigma_end_index (non-negative integer) – Index of the mean_sigma array that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error is obtained).

  • min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.

  • max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.

  • error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.

  • error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.

  • err_err (numpy array) – Vertical error bars (usually one standard deviation for a binomial distribution estimated by bin) for the error calibration computed empirically.

  • s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) constructed to express the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.

  • coverage_percentile (float) – Value used for the coverage in the percentile estimation of the observed error.

  • method (string) – Method used to comput the standard deviations (i.e. dropout, heteroscedastic, etc.).

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_empirical_calibration.png’ string will be appended to the figprefix given.

  • steps (boolean) – Besides the complete empirical calibration (including raw statistics, error bars and smoothing), also generates partial plots with only the raw bin statistics (step1) and with only the raw bin statistics and the smoothing interpolation (step2).

exarl.candlelib.candle_keras.plot_percentile_predictions(Ypred, Ypred_Lp, Ypred_Hp, percentile_list, pred_name=None, figprefix=None)

Functionality to plot the mean of the percentiles predicted. The plot generated is stored in a png file.

Parameters
  • Ypred (numpy array) – Array with mid percentile predicted values.

  • Ypred_Lp (numpy array) – Array with low percentile predicted values.

  • Ypred_Hp (numpy array) – Array with high percentile predicted values.

  • percentile_list (string list) – List of percentiles predicted (e.g. ‘10p’, ‘90p’, etc.)

  • pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.)

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_predictions.png’ string will be appended to the figprefix given.

exarl.candlelib.candle_keras.compute_statistics_homoscedastic(df_data, col_true=0, col_pred=6, col_std_pred=7)

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes the statistics over all the inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>_pred.tsv).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 0, index in current CANDLE format).

  • col_pred (integer) – Index of the column in the data frame where the predicted value is stored (Default: 6, index in current CANDLE format).

  • col_std_pred (integer) – Index of the column in the data frame where the standard deviation of the predicted values is stored (Default: 7, index in current CANDLE format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

exarl.candlelib.candle_keras.compute_statistics_homoscedastic_all(df_data, col_true=4, col_pred_start=6)

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>.predicted_INFER.tsv).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HOM format).

  • col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored (Default: 6 index, in current HOM format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

exarl.candlelib.candle_keras.compute_statistics_heteroscedastic(df_data, col_true=4, col_pred_start=6, col_std_pred_start=7)

Extracts ground truth, mean prediction, error, standard deviation of prediction and predicted (learned) standard deviation from inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current heteroscedastic inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_HET.tsv).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HET format).

  • col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with standard deviation predictions (Default: 6 index, step 2, in current HET format).

  • col_std_pred_start (integer) – Index of the column in the data frame where the first predicted standard deviation value is stored. All the predicted values during inference are stored and are interspaced with predictions (Default: 7 index, step 2, in current HET format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

exarl.candlelib.candle_keras.compute_statistics_quantile(df_data, sigma_divisor=2.56, col_true=4, col_pred_start=6)

Extracts ground truth, 50th percentile mean prediction, low percentile and high percentile mean prediction (usually 10th percentile and 90th percentile respectively), error (using 50th percentile), standard deviation of prediction (using 50th percentile) and predicted (learned) standard deviation from interdecile range in inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas data frame) – Data frame generated by current quantile inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_QTL.tsv).

  • sigma_divisor (float) – Divisor to convert from the intercedile range to the corresponding standard deviation for a Gaussian distribution. (Default: 2.56, consistent with an interdecile range computed from the difference between the 90th and 10th percentiles).

  • col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current QTL format).

  • col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with other percentile predictions (Default: 6 index, step 3, in current QTL format).

Returns

  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values (based on the 50th percentile).

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • sigma (numpy array) – Array with standard deviations learned with deep learning model. This corresponds to the interdecile range divided by the sigma divisor.

  • Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string) – Name of data column or quantity predicted (as extracted from the data frame using the col_true index).

  • Ypred_Lp_mean (numpy array) – Array with predicted values of the lower percentile (usually the 10th percentile).

  • Ypred_Hp_mean (numpy array) – Array with predicted values of the higher percentile (usually the 90th percentile).

exarl.candlelib.candle_keras.split_data_for_empirical_calibration(Ytrue, Ypred, sigma, cal_split=0.8)

Extracts a portion of the arrays provided for the computation of the calibration and reserves the remainder portion for testing.

Parameters
  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • sigma (numpy array) – Array with standard deviations learned with deep learning model (or std value computed from prediction if homoscedastic inference).

  • cal_split (float) – Split of data to use for estimating the calibration relationship. It is assumed that it will be a value in (0, 1). (Default: use 80% of predictions to generate empirical calibration).

Returns

  • index_perm_total (numpy array) – Random permutation of the array indices. The first ‘num_cal’ of the indices correspond to the samples that are used for calibration, while the remainder are the samples reserved for calibration testing.

  • pSigma_cal (numpy array) – Part of the input sigma array to use for calibration.

  • pSigma_test (numpy array) – Part of the input sigma array to reserve for testing.

  • pPred_cal (numpy array) – Part of the input Ypred array to use for calibration.

  • pPred_test (numpy array) – Part of the input Ypred array to reserve for testing.

  • true_cal (numpy array) – Part of the input Ytrue array to use for calibration.

  • true_test (numpy array) – Part of the input Ytrue array to reserve for testing.

exarl.candlelib.candle_keras.compute_empirical_calibration(pSigma_cal, pPred_cal, true_cal, bins, coverage_percentile)

Use the arrays provided to estimate an empirical mapping between standard deviation and absolute value of error, both of which have been observed during inference. Since most of the times the raw statistics per bin are very noisy, a smoothing step (based on scipy’s savgol filter) is performed.

Parameters
  • pSigma_cal (numpy array) – Part of the standard deviations array to use for calibration.

  • pPred_cal (numpy array) – Part of the predictions array to use for calibration.

  • true_cal (numpy array) – Part of the true (observed) values array to use for calibration.

  • bins (int) – Number of bins to split the range of standard deviations included in pSigma_cal array.

  • coverage_percentile (float) – Value to use for estimating coverage when evaluating the percentiles of the observed absolute value of errors.

Returns

  • mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.

  • min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.

  • max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.

  • error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.

  • err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.

  • error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.

  • sigma_start_index (non-negative integer) – Index in the mean_sigma array that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error mapping is obtained).

  • sigma_end_index (non-negative integer) – Index in the mean_sigma array that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error mapping is obtained).

  • s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) constructed to express the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.

exarl.candlelib.candle_keras.bining_for_calibration(pSigma_cal_ordered_, minL_sigma, maxL_sigma, Er_vect_cal_orderedSigma_, bins, coverage_percentile)

Bin the values of the standard deviations observed during inference and estimate a specified coverage percentile in the absolute error (observed during inference as well). Bins that have less than 50 samples are merged until they surpass this threshold.

Parameters
  • pSigma_cal_ordered (numpy array) – Array of standard deviations ordered in ascending way.

  • minL_sigma (float) – Minimum value of standard deviations included in pSigma_cal_ordered_ array.

  • maxL_sigma (numpy array) – Maximum value of standard deviations included in pSigma_cal_ordered_ array.

  • Er_vect_cal_orderedSigma (numpy array) – Array ob absolute value of errors corresponding with the array of ordered standard deviations.

  • bins (int) – Number of bins to split the range of standard deviations included in pSigma_cal_ordered_ array.

  • coverage_percentile (float) – Value to use for estimating coverage when evaluating the percentiles of the observed absolute value of errors.

Returns

  • mean_sigma (numpy array) – Array with the mean standard deviations computed per bin.

  • min_sigma (numpy array) – Array with the minimum standard deviations computed per bin.

  • max_sigma (numpy array) – Array with the maximum standard deviations computed per bin.

  • error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.

  • err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.

exarl.candlelib.candle_keras.computation_of_valid_calibration_interval(error_thresholds, error_thresholds_smooth, err_err)

Function that estimates the empirical range in which a monotonic relation is observed between standard deviation and coverage of absolute value of error. Since the statistics computed per bin are relatively noisy, the application of a greedy criterion (e.g. guarantee a monotonically increasing relationship) does not yield good results. Therefore, a softer version is constructed based on the satisfaction of certain criteria depending on: the values of the error coverage computed per bin, a smoothed version of them and the associated error estimated (based on one standard deviation for a binomial distribution estimated by bin vs. the other bins). A minimal validation requiring the end idex to be largest than the starting index is performed before the function return.

Current criteria: - the smoothed errors are inside the error bars AND

they are almost increasing (a small tolerance is allowed, so a small wobbliness in the smoother values is permitted).

OR - both the raw values for the bins (with a small tolerance)

are increasing, AND the smoothed value is greater than the raw value.

OR - the current smoothed value is greater than the previous AND

the smoothed values for the next been are inside the error bars.

Parameters
  • error_thresholds (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin.

  • error_thresholds_smooth (numpy array) – Thresholds of the errors computed to attain a certain error coverage per bin after a smoothed operation is applied to the frequently noisy bin-based estimations.

  • err_err (numpy array) – Error bars in errors (one standard deviation for a binomial distribution estimated by bin vs. the other bins) for the calibration error.

Returns

  • sigma_start_index (non-negative integer) – Index estimated in the mean_sigma array corresponding to the value that defines the start of the valid empirical calibration interval (i.e. index to the smallest std for which a meaningful error mapping is obtained, according to the criteria explained before).

  • sigma_end_index (non-negative integer) – Index estimated in the mean_sigma array corresponding to the value that defines the end of the valid empirical calibration interval (i.e. index to the largest std for which a meaningful error mapping is obtained, according to the criteria explained before).

exarl.candlelib.candle_keras.applying_calibration(pSigma_test, pPred_test, true_test, s_interpolate, minL_sigma_auto, maxL_sigma_auto)

Use the empirical mapping between standard deviation and absolute value of error estimated during calibration (i.e. apply the univariate spline computed) to estimate the error for the part of the standard deviation array that was reserved for testing the empirical calibration. The resulting error array (yp_test) should overestimate the true observed error (eabs_red). All the computations are restricted to the valid calibration interval: [minL_sigma_auto, maxL_sigma_auto].

Parameters
  • pSigma_test (numpy array) – Part of the standard deviations array to use for calibration testing.

  • pPred_test (numpy array) – Part of the predictions array to use for calibration testing.

  • true_test (numpy array) – Part of the true (observed) values array to use for calibration testing.

  • s_interpolate (scipy.interpolate python object) – A python object from scipy.interpolate that computes a univariate spline (InterpolatedUnivariateSpline) expressing the mapping from standard deviation to error. This spline is generated during the computational empirical calibration procedure.

  • minL_sigma_auto (float) – Starting value of the valid empirical calibration interval (i.e. smallest std for which a meaningful error mapping is obtained).

  • maxL_sigma_auto (float) – Ending value of the valid empirical calibration interval (i.e. largest std for which a meaningful error mapping is obtained).

Returns

  • index_sigma_range_test (numpy array) – Indices of the pSigma_test array that are included in the valid calibration interval, given by: [minL_sigma_auto, maxL_sigma_auto].

  • xp_test (numpy array) – Array with the mean standard deviations in the calibration testing array.

  • yp_test (numpy array) – Mapping of the given standard deviation to error computed from the interpolation spline constructed by empirical calibration.

  • eabs_red (numpy array) – Array with the observed absolute errors in the part of the testing array for which the observed standard deviations are in the valid interval of calibration.

exarl.candlelib.candle_keras.overprediction_check(yp_test, eabs_red)

Compute the percentage of overestimated absolute error predictions for the arrays reserved for calibration testing and whose corresponding standard deviations are included in the valid calibration interval.

Parameters
  • yp_test (numpy array) – Mapping of the standard deviation to error computed from the interpolation spline constructed by empirical calibration.

  • eabs_red (numpy array) – Array with the observed absolute errors in the part of the testing array for which the observed standard deviations are in the valid interval of calibration.

exarl.candlelib.candle_keras.build_initializer(type, kerasDefaults, seed=None, constant=0.0)

Set the initializer to the appropriate Keras initializer function based on the input string and learning rate. Other required values are set to the Keras default values

Parameters
  • type (string) –

    String to choose the initializer

    Options recognized: ‘constant’, ‘uniform’, ‘normal’, ‘glorot_uniform’, ‘lecun_uniform’, ‘he_normal’

    See the Keras documentation for a full description of the options

  • kerasDefaults (list) – List of default parameter values to ensure consistency between frameworks

  • seed (integer) – Random number seed

  • constant (float) – Constant value (for the constant initializer only)

Returns

The appropriate Keras initializer function

exarl.candlelib.candle_keras.build_optimizer(type, lr, kerasDefaults)

Set the optimizer to the appropriate Keras optimizer function based on the input string and learning rate. Other required values are set to the Keras default values

Parameters
  • type (string) –

    String to choose the optimizer

    Options recognized: ‘sgd’, ‘rmsprop’, ‘adagrad’, adadelta’, ‘adam’ See the Keras documentation for a full description of the options

  • lr (float) – Learning rate

  • kerasDefaults (list) – List of default parameter values to ensure consistency between frameworks

Returns

The appropriate Keras optimizer function

exarl.candlelib.candle_keras.set_seed(seed)

Set the random number seed to the desired value

Parameters

seed (integer) – Random number seed.

exarl.candlelib.candle_keras.set_parallelism_threads()

Set the number of parallel threads according to the number available on the hardware

class exarl.candlelib.candle_keras.PermanentDropout(rate, **kwargs)

Bases: tensorflow.keras.layers.Dropout

call(self, x, mask=None)
exarl.candlelib.candle_keras.register_permanent_dropout()
class exarl.candlelib.candle_keras.LoggingCallback(print_fcn=print)

Bases: tensorflow.keras.callbacks.Callback

on_epoch_end(self, epoch, logs={})
exarl.candlelib.candle_keras.r2(y_true, y_pred)
exarl.candlelib.candle_keras.mae(y_true, y_pred)
exarl.candlelib.candle_keras.mse(y_true, y_pred)
class exarl.candlelib.candle_keras.CandleRemoteMonitor(params=None)

Bases: tensorflow.keras.callbacks.Callback

Capture Run level output and store/send for monitoring

on_train_begin(self, logs=None)
on_epoch_begin(self, epoch, logs=None)
on_epoch_end(self, epoch, logs=None)
on_train_end(self, logs=None)
submit(self, send)

Send json to solr

Parameters

send (json object) – Object to send

save(self)

Save log_messages to file

exarl.candlelib.candle_keras.compute_trainable_params(model)

Extract number of parameters from the given Keras model

Parameters

model (Keras model) –

Returns

python dictionary that contains trainable_params, non_trainable_params and total_params

class exarl.candlelib.candle_keras.TerminateOnTimeOut(timeout_in_sec=10)

Bases: tensorflow.keras.callbacks.Callback

This class implements timeout on model training. When the script reaches timeout, this class sets model.stop_training = True

Initialize TerminateOnTimeOut class.

Parameters

timeout_in_sec (int) – seconds to timeout

on_train_begin(self, logs={})

Start clock to calculate timeout

on_epoch_end(self, epoch, logs={})

On every epoch end, check whether it exceeded timeout and terminate training if necessary