smx.predicates.bagging#

PredicateBagger: bootstrap/subsample predicates across multiple bags for robust metric estimation.

Classes#

PredicateBagger

Perform predicate bagging with granular control over sampling strategy.

Module Contents#

class smx.predicates.bagging.PredicateBagger(random_seed, n_bags: int = 10, n_predicates_per_bag: int = 20, n_samples_fraction: float = 0.8, replace: bool = False, sample_bagging: bool = True, predicate_bagging: bool = False)[source]#

Perform predicate bagging with granular control over sampling strategy.

Bagging creates repeated random subsets of samples and/or predicates, evaluating each predicate on the subset. This yields a distribution of predicate coverage that is used downstream to compute robust association metrics (see smx.predicates.metrics).

Parameters#

n_bagsint, default 50

Number of bags (iterations) to create.

n_predicates_per_bagint, default 20

Number of predicates to draw per bag (ignored when predicate_bagging=False).

n_samples_fractionfloat, default 0.8

Fraction of samples to draw per bag (ignored when sample_bagging=False). The minimum samples per predicate is hardcoded to 20 % of the dataset.

replacebool, default True

Whether to sample with replacement (bootstrap). Ignored when sample_bagging=False.

random_seedint, default 42

Base random seed for reproducibility.

sample_baggingbool, default True

If False, all samples are used in every bag.

predicate_baggingbool, default True

If False, all predicates are used in every bag.

n_bags = 10#
n_predicates_per_bag = 20#
n_samples_fraction = 0.8#
replace = False#
random_seed#
sample_bagging = True#
predicate_bagging = False#
run(zone_scores_df: pandas.DataFrame, y_predicted_numeric: pandas.Series | numpy.ndarray, predicates_df: pandas.DataFrame) Dict[str, Dict[str, pandas.DataFrame]][source]#

Create bags by sampling samples and/or predicates.

Parameters#

zone_scores_dfpd.DataFrame

Aggregated zone scores (samples × zones).

y_predicted_numericpd.Series or np.ndarray

Continuous model predictions aligned with zone_scores_df.

predicates_dfpd.DataFrame

Predicate catalogue with columns rule, zone, thresholds, operator.

Returns#

dict

{'Bag_1': {rule: DataFrame(['Zone_Sum', 'Predicted_Y', 'Sample_Index']), ...}, 'Bag_2': ...}