smx.pipeline#
SMX: high-level facade for the full SMX explanation pipeline.
This class internalises the seed-loop orchestration that every caller would otherwise have to rewrite manually (zone extraction → predicate generation → bagging → metric → graph → LRC → natural-scale mapping across N seeds).
Individual component classes (ZoneAggregator, PredicateGenerator, etc.)
remain available for power users who need fine-grained control.
Attributes#
Classes#
Full SMX explanation pipeline as a single fit/transform object. |
Module Contents#
- smx.pipeline.logger#
- smx.pipeline.SpectralCuts#
- class smx.pipeline.SMX(spectral_cuts: SpectralCuts, quantiles: List[float], n_repetitions: int = 4, n_bags: int = 10, n_samples_fraction: float = 0.8, replace: bool = False, metric: Literal['covariance', 'perturbation'] = 'perturbation', estimator: Any | None = None, perturbation_mode: str = 'median', perturbation_value: float = 0, perturbation_metric: str = 'probability_shift', perturbation_stats_source: str = 'full', normalize_by_zone_size: bool = True, zone_size_exponent: float = 1.0, covariance_threshold: float = 0.01, var_exp: bool = True, show_graph_details: bool = False, class_threshold: float = 0.5)[source]#
Full SMX explanation pipeline as a single fit/transform object.
Runs zone extraction → PCA aggregation → predicate generation → seed-loop (bagging → metric → graph → LRC) → cross-seed aggregation → natural-scale threshold mapping.
Parameters#
- spectral_cutslist of (name, start, end) tuples
Zone definitions, e.g.
[("Low", 1.0, 4.0), ("High", 4.0, 10.0)].- quantileslist of float
Quantile fractions for predicate generation, e.g.
[0.25, 0.5, 0.75].- n_repetitionsint, default 4
Number of independent bagging repetitions. Seeds are generated as
[0, 1, …, n_repetitions-1].- n_bagsint, default 10
Number of bags per seed.
- n_samples_fractionfloat, default 0.8
Fraction of calibration samples drawn per bag. The minimum samples per predicate is hardcoded to 20 % of the dataset.
- replacebool, default False
Whether to sample bags with replacement.
- metric{‘covariance’, ‘perturbation’}, default ‘perturbation’
Importance metric to use.
- estimatorsklearn-compatible estimator, optional
Trained model required when
metric='perturbation'.- perturbation_modestr, default ‘median’
Replacement strategy for perturbation (
'constant','mean','median','min','max').- perturbation_valuefloat, default 0
Constant replacement value used when
perturbation_mode='constant'.- perturbation_metricstr, default ‘probability_shift’
Perturbation importance measure. Determines how the impact of spectral zone perturbation is quantified. Choice depends on the estimator type and the desired sensitivity:
Classification estimators (with
predict_proba):'probability_shift'— Mean total variation distance between pre- and post-perturbation class probabilities. Sensitive to confidence changes across all classes. Requirespredict_proba.'accuracy_drop'— Drop in accuracy when perturbed predictions are compared to original predictions.'f1_drop'— Weighted F1-score decrease after perturbation.'decision_function_shift'— Mean absolute change in decision function values (e.g. signed distances from hyperplane for SVC). Requiresdecision_function().
Regression estimators (with
predictreturning continuous values):'mean_abs_diff'— Mean absolute difference between original and perturbed predictions.'mean_diff'— Mean signed difference (bias direction). Positive values indicate perturbation increases predictions, negative decreases.'mean_relative_dev'— Mean relative deviation, normalized by original prediction magnitude. Treats zero predictions as NaN.
- normalize_by_zone_sizebool, default True
Divide raw perturbation importance by zone width.
- zone_size_exponentfloat, default 1.0
Exponent applied to zone size during normalisation.
- covariance_thresholdfloat, default 0.01
Minimum covariance value to keep a predicate (covariance metric only).
- var_expbool, default True
Weight graph edges by PC1 explained variance of the source zone.
- show_graph_detailsbool, default False
Print bidirectional-edge details during graph construction.
- class_thresholdfloat, default 0.5
Decision boundary for
Class_Predictedannotation on bags.
Attributes (set after
fit())#- lrc_natural_pd.DataFrame or None
LRC with natural-scale thresholds (available only when
X_cal_naturalis provided tofit()). Columns:Node,Local_Reaching_Centrality,Zone,Threshold,Operator,Threshold_Natural.- lrc_summed_pd.DataFrame
Mean-aggregated LRC across seeds, preprocessed-scale thresholds.
- lrc_summed_unique_pd.DataFrame
Zone-deduplicated version of lrc_summed_ (one row per zone).
- zone_scores_pd.DataFrame
PCA zone scores on the preprocessed calibration data.
- predicates_df_pd.DataFrame
Full predicate catalogue (generated from zone_scores_).
- pca_info_dict
PCA info for the preprocessed zones.
- pca_info_natural_dict or None
PCA info for the natural (unpreprocessed) zones (only when
X_cal_naturalis provided tofit()).- zones_natural_dict or None
Raw spectral zone DataFrames from the unpreprocessed data (only when
X_cal_naturalis provided tofit()).- graphs_by_seed_dict[int, nx.DiGraph]
Per-seed directed predicate graphs (useful for debugging).
- valid_seeds_list[int]
Seeds that produced a non-empty graph (subset of
seeds).- faithfulness_dict
Progressive top-k masking evaluation summary produced by
evaluate_faithfulness().
- spectral_cuts#
- quantiles#
- n_repetitions = 4#
- seeds#
- n_bags = 10#
- n_samples_fraction = 0.8#
- replace = False#
- metric = 'perturbation'#
- estimator = None#
- perturbation_mode = 'median'#
- perturbation_value = 0#
- perturbation_metric = 'probability_shift'#
- perturbation_stats_source = 'full'#
- normalize_by_zone_size = True#
- zone_size_exponent = 1.0#
- covariance_threshold = 0.01#
- var_exp = True#
- show_graph_details = False#
- class_threshold = 0.5#
- lrc_natural_: pandas.DataFrame | None = None#
- lrc_summed_: pandas.DataFrame | None = None#
- lrc_summed_unique_: pandas.DataFrame | None = None#
- zone_scores_: pandas.DataFrame | None = None#
- predicates_df_: pandas.DataFrame | None = None#
- graphs_by_seed_: Dict[int, networkx.DiGraph]#
- fit(X_cal_prep: pandas.DataFrame, y_pred_cal: pandas.Series | numpy.ndarray, X_cal_natural: pandas.DataFrame | None = None) SMX[source]#
Run the full SMX explanation pipeline.
Parameters#
- X_cal_preppd.DataFrame
Pre-processed calibration spectra (samples × features).
- y_pred_calpd.Series or np.ndarray
Continuous model predictions for the calibration set (aligned with X_cal_prep).
- X_cal_naturalpd.DataFrame, optional
Unpreprocessed calibration spectra with the same shape as X_cal_prep. Required for
lrc_natural_threshold mapping. WhenNone, the natural-scale mapping step is skipped andlrc_natural_,zones_natural_, andpca_info_natural_remainNone.
Returns#
self
- evaluate_faithfulness(X_eval: pandas.DataFrame, *, ranking: Literal['unique', 'summed', 'natural'] = 'unique', X_reference: pandas.DataFrame | None = None, metric: Literal['auto', 'probability_shift', 'mean_abs_diff', 'decision_function_shift'] = 'auto', masking_strategy: Literal['zero', 'constant', 'mean', 'median', 'min', 'max'] = 'zero', constant_value: float = 0.0, max_k: int | None = None, n_random_rankings: int = 100, random_state: int | None = 42, output_path: str | pathlib.Path | None = None, plot_title: str | None = None, plot_width: int = 1100, plot_height: int = 560) Dict[str, Any][source]#
Evaluate SMX faithfulness via progressive top-k zone masking.
The ranked spectral zones are progressively masked on X_eval following the selected SMX ranking, and the resulting prediction shift is summarized by the area under the masking curve (AUC).
Parameters#
- X_evalpd.DataFrame
Evaluation spectra to be masked progressively.
- ranking{‘unique’, ‘summed’, ‘natural’}, default ‘unique’
Ranking table used to derive the ordered list of spectral zones.
'unique'uses the one-zone-per-row ranking inlrc_summed_unique_.'summed'and'natural'are deduplicated internally to one row per zone before masking.- X_referencepd.DataFrame, optional
Reference spectra used to compute replacement values for non-zero masking strategies. Defaults to X_eval.
- metric{‘auto’, ‘probability_shift’, ‘mean_abs_diff’, ‘decision_function_shift’}, default ‘auto’
Prediction-shift metric to evaluate.
'auto'chooses'probability_shift'when the estimator exposespredict_proba(),'decision_function_shift'when it exposesdecision_function(), otherwise'mean_abs_diff'.- masking_strategy{‘zero’, ‘constant’, ‘mean’, ‘median’, ‘min’, ‘max’}, default ‘zero’
How masked spectral variables are replaced.
- constant_valuefloat, default 0.0
Replacement value used when
masking_strategy='constant'.- max_kint, optional
Maximum number of ranked zones to mask. Defaults to all ranked zones available in X_eval.
- n_random_rankingsint, default 100
Number of random rankings used to contextualize the observed AUC.
- random_stateint, optional
Seed controlling the random baseline.
- output_pathstr or Path, optional
If provided, also export a faithfulness plot to this path. The extension determines the format (
.htmlor a static image).- plot_titlestr, optional
Title override used when output_path is provided.
- plot_widthint, default 1100
Plot width in pixels. Used only when output_path is provided.
- plot_heightint, default 560
Plot height in pixels. Used only when output_path is provided.
Returns#
- dict
Faithfulness summary including
curve_df,auc,auc_normalized,level, and null-baseline statistics.
- plot_zone_ranking_over_spectrum(output_path: str | pathlib.Path, *, ranking: Literal['unique', 'natural'] = 'unique', aggregation: Literal['mean', 'median'] = 'mean', title: str | None = None, X_natural: pandas.DataFrame | None = None, y_labels: pandas.Series | None = None, class_colors: dict | None = None, width: int | None = 1200, height: int | None = 500) pandas.DataFrame[source]#
Plot ranked spectral zones over a reference spectrum and save to file.
The output format is inferred from output_path —
.htmlfor an interactive figure, or.png/.svg/.pdffor a static image (requireskaleido).Parameters#
- output_pathstr or Path
Destination
.htmlfile.- ranking{‘unique’, ‘natural’}, default ‘unique’
Ranking source.
'unique'useslrc_summed_unique_(one row per zone).'natural'useslrc_natural_and collapses multiple predicates per zone to the strongest LRC value.- aggregation{‘mean’, ‘median’}, default ‘mean’
Aggregation used to build the reference spectrum from
zones_natural_.- titlestr, optional
Figure title override.
- X_naturalpd.DataFrame, optional
Full calibration dataset in natural (unpreprocessed) units. When provided together with y_labels, a mean spectrum is drawn for each class on top of the overall reference spectrum.
- y_labelspd.Series, optional
Class labels aligned with the rows of X_natural. Required when X_natural is given.
- class_colorsdict[str, str], optional
Mapping from class label to hex/CSS color string. Missing labels fall back to a built-in palette.
- widthint, default 1200
Figure width in pixels. Used only for static image exports.
- heightint, default 500
Figure height in pixels. Used only for static image exports.
Returns#
- pd.DataFrame
Normalized ranking table used in the figure.
- plot_faithfulness(output_path: str | pathlib.Path, *, title: str | None = None, width: int = 1100, height: int = 560) pandas.DataFrame[source]#
Plot the progressive masking faithfulness curve saved in
faithfulness_.Parameters#
- output_pathstr or Path
Destination file. Use
.htmlfor an interactive figure or an image extension for static export.- titlestr, optional
Figure title override.
- widthint, default 1100
Figure width in pixels. Used only for static image exports.
- heightint, default 560
Figure height in pixels. Used only for static image exports.
Returns#
- pd.DataFrame
Faithfulness masking curve used in the figure.