smx.predicates.bagging#
PredicateBagger: bootstrap/subsample predicates across multiple bags for robust metric estimation.
Classes#
Perform predicate bagging with granular control over sampling strategy. |
Module Contents#
- class smx.predicates.bagging.PredicateBagger(random_seed, n_bags: int = 10, n_predicates_per_bag: int = 20, n_samples_fraction: float = 0.8, replace: bool = False, sample_bagging: bool = True, predicate_bagging: bool = False)[source]#
Perform predicate bagging with granular control over sampling strategy.
Bagging creates repeated random subsets of samples and/or predicates, evaluating each predicate on the subset. This yields a distribution of predicate coverage that is used downstream to compute robust association metrics (see
smx.predicates.metrics).Parameters#
- n_bagsint, default 50
Number of bags (iterations) to create.
- n_predicates_per_bagint, default 20
Number of predicates to draw per bag (ignored when
predicate_bagging=False).- n_samples_fractionfloat, default 0.8
Fraction of samples to draw per bag (ignored when
sample_bagging=False). The minimum samples per predicate is hardcoded to 20 % of the dataset.- replacebool, default True
Whether to sample with replacement (bootstrap). Ignored when
sample_bagging=False.- random_seedint, default 42
Base random seed for reproducibility.
- sample_baggingbool, default True
If
False, all samples are used in every bag.- predicate_baggingbool, default True
If
False, all predicates are used in every bag.
- n_bags = 10#
- n_predicates_per_bag = 20#
- n_samples_fraction = 0.8#
- replace = False#
- random_seed#
- sample_bagging = True#
- predicate_bagging = False#
- run(zone_scores_df: pandas.DataFrame, y_predicted_numeric: pandas.Series | numpy.ndarray, predicates_df: pandas.DataFrame) Dict[str, Dict[str, pandas.DataFrame]][source]#
Create bags by sampling samples and/or predicates.
Parameters#
- zone_scores_dfpd.DataFrame
Aggregated zone scores (samples × zones).
- y_predicted_numericpd.Series or np.ndarray
Continuous model predictions aligned with zone_scores_df.
- predicates_dfpd.DataFrame
Predicate catalogue with columns
rule,zone,thresholds,operator.
Returns#
- dict
{'Bag_1': {rule: DataFrame(['Zone_Sum', 'Predicted_Y', 'Sample_Index']), ...}, 'Bag_2': ...}