smx.predicates.bagging#

PredicateBagger: bootstrap/subsample predicates across multiple bags for robust metric estimation.

Classes#

PredicateBagger

Perform predicate bagging with granular control over sampling strategy.

Module Contents#

class smx.predicates.bagging.PredicateBagger(random_seed, n_bags: int = 10, n_predicates_per_bag: int = 20, n_samples_fraction: float = 0.8, replace: bool = False, sample_bagging: bool = True, predicate_bagging: bool = False)[source]#

Perform predicate bagging with granular control over sampling strategy.

Bagging creates repeated random subsets of samples and/or predicates, evaluating each predicate on the subset. This yields a distribution of predicate coverage that is used downstream to compute robust association metrics (see smx.predicates.metrics).

Parameters#

n_bagsint, default 50: Number of bags (iterations) to create.
n_predicates_per_bagint, default 20: Number of predicates to draw per bag (ignored when predicate_bagging=False).
n_samples_fractionfloat, default 0.8: Fraction of samples to draw per bag (ignored when sample_bagging=False). The minimum samples per predicate is hardcoded to 20 % of the dataset.
replacebool, default True: Whether to sample with replacement (bootstrap). Ignored when sample_bagging=False.
random_seedint, default 42: Base random seed for reproducibility.
sample_baggingbool, default True: If False, all samples are used in every bag.
predicate_baggingbool, default True: If False, all predicates are used in every bag.

n_bags = 10#

n_predicates_per_bag = 20#

n_samples_fraction = 0.8#

replace = False#

random_seed#

sample_bagging = True#

predicate_bagging = False#

run(zone_scores_df: pandas.DataFrame, y_predicted_numeric: pandas.Series | numpy.ndarray, predicates_df: pandas.DataFrame) → Dict[str, Dict[str, pandas.DataFrame]][source]#

Create bags by sampling samples and/or predicates.

Parameters#

zone_scores_dfpd.DataFrame: Aggregated zone scores (samples × zones).
y_predicted_numericpd.Series or np.ndarray: Continuous model predictions aligned with zone_scores_df.
predicates_dfpd.DataFrame: Predicate catalogue with columns rule, zone, thresholds, operator.

Returns#

dict: {'Bag_1': {rule: DataFrame(['Zone_Sum', 'Predicted_Y', 'Sample_Index']), ...}, 'Bag_2': ...}