smx.pipeline#

SMX: high-level facade for the full SMX explanation pipeline.

This class internalises the seed-loop orchestration that every caller would otherwise have to rewrite manually (zone extraction → predicate generation → bagging → metric → graph → LRC → natural-scale mapping across N seeds).

Individual component classes (ZoneAggregator, PredicateGenerator, etc.) remain available for power users who need fine-grained control.

Attributes#

Classes#

SMX

Full SMX explanation pipeline as a single fit/transform object.

Module Contents#

smx.pipeline.logger#
smx.pipeline.SpectralCuts#
class smx.pipeline.SMX(spectral_cuts: SpectralCuts, quantiles: List[float], n_repetitions: int = 4, n_bags: int = 10, n_samples_fraction: float = 0.8, replace: bool = False, metric: Literal['covariance', 'perturbation'] = 'perturbation', estimator: Any | None = None, perturbation_mode: str = 'median', perturbation_value: float = 0, perturbation_metric: str = 'probability_shift', perturbation_stats_source: str = 'full', normalize_by_zone_size: bool = True, zone_size_exponent: float = 1.0, covariance_threshold: float = 0.01, var_exp: bool = True, show_graph_details: bool = False, class_threshold: float = 0.5)[source]#

Full SMX explanation pipeline as a single fit/transform object.

Runs zone extraction → PCA aggregation → predicate generation → seed-loop (bagging → metric → graph → LRC) → cross-seed aggregation → natural-scale threshold mapping.

Parameters#

spectral_cutslist of (name, start, end) tuples

Zone definitions, e.g. [("Low", 1.0, 4.0), ("High", 4.0, 10.0)].

quantileslist of float

Quantile fractions for predicate generation, e.g. [0.25, 0.5, 0.75].

n_repetitionsint, default 4

Number of independent bagging repetitions. Seeds are generated as [0, 1, …, n_repetitions-1].

n_bagsint, default 10

Number of bags per seed.

n_samples_fractionfloat, default 0.8

Fraction of calibration samples drawn per bag. The minimum samples per predicate is hardcoded to 20 % of the dataset.

replacebool, default False

Whether to sample bags with replacement.

metric{‘covariance’, ‘perturbation’}, default ‘perturbation’

Importance metric to use.

estimatorsklearn-compatible estimator, optional

Trained model required when metric='perturbation'.

perturbation_modestr, default ‘median’

Replacement strategy for perturbation ('constant', 'mean', 'median', 'min', 'max').

perturbation_valuefloat, default 0

Constant replacement value used when perturbation_mode='constant'.

perturbation_metricstr, default ‘probability_shift’

Perturbation importance measure. Determines how the impact of spectral zone perturbation is quantified. Choice depends on the estimator type and the desired sensitivity:

Classification estimators (with predict_proba):

  • 'probability_shift' — Mean total variation distance between pre- and post-perturbation class probabilities. Sensitive to confidence changes across all classes. Requires predict_proba.

  • 'accuracy_drop' — Drop in accuracy when perturbed predictions are compared to original predictions.

  • 'f1_drop' — Weighted F1-score decrease after perturbation.

  • 'decision_function_shift' — Mean absolute change in decision function values (e.g. signed distances from hyperplane for SVC). Requires decision_function().

Regression estimators (with predict returning continuous values):

  • 'mean_abs_diff' — Mean absolute difference between original and perturbed predictions.

  • 'mean_diff' — Mean signed difference (bias direction). Positive values indicate perturbation increases predictions, negative decreases.

  • 'mean_relative_dev' — Mean relative deviation, normalized by original prediction magnitude. Treats zero predictions as NaN.

normalize_by_zone_sizebool, default True

Divide raw perturbation importance by zone width.

zone_size_exponentfloat, default 1.0

Exponent applied to zone size during normalisation.

covariance_thresholdfloat, default 0.01

Minimum covariance value to keep a predicate (covariance metric only).

var_expbool, default True

Weight graph edges by PC1 explained variance of the source zone.

show_graph_detailsbool, default False

Print bidirectional-edge details during graph construction.

class_thresholdfloat, default 0.5

Decision boundary for Class_Predicted annotation on bags.

Attributes (set after fit())#

lrc_natural_pd.DataFrame or None

LRC with natural-scale thresholds (available only when X_cal_natural is provided to fit()). Columns: Node, Local_Reaching_Centrality, Zone, Threshold, Operator, Threshold_Natural.

lrc_summed_pd.DataFrame

Mean-aggregated LRC across seeds, preprocessed-scale thresholds.

lrc_summed_unique_pd.DataFrame

Zone-deduplicated version of lrc_summed_ (one row per zone).

zone_scores_pd.DataFrame

PCA zone scores on the preprocessed calibration data.

predicates_df_pd.DataFrame

Full predicate catalogue (generated from zone_scores_).

pca_info_dict

PCA info for the preprocessed zones.

pca_info_natural_dict or None

PCA info for the natural (unpreprocessed) zones (only when X_cal_natural is provided to fit()).

zones_natural_dict or None

Raw spectral zone DataFrames from the unpreprocessed data (only when X_cal_natural is provided to fit()).

graphs_by_seed_dict[int, nx.DiGraph]

Per-seed directed predicate graphs (useful for debugging).

valid_seeds_list[int]

Seeds that produced a non-empty graph (subset of seeds).

faithfulness_dict

Progressive top-k masking evaluation summary produced by evaluate_faithfulness().

spectral_cuts#
quantiles#
n_repetitions = 4#
seeds#
n_bags = 10#
n_samples_fraction = 0.8#
replace = False#
metric = 'perturbation'#
estimator = None#
perturbation_mode = 'median'#
perturbation_value = 0#
perturbation_metric = 'probability_shift'#
perturbation_stats_source = 'full'#
normalize_by_zone_size = True#
zone_size_exponent = 1.0#
covariance_threshold = 0.01#
var_exp = True#
show_graph_details = False#
class_threshold = 0.5#
lrc_natural_: pandas.DataFrame | None = None#
lrc_summed_: pandas.DataFrame | None = None#
lrc_summed_unique_: pandas.DataFrame | None = None#
zone_scores_: pandas.DataFrame | None = None#
predicates_df_: pandas.DataFrame | None = None#
pca_info_: Dict | None = None#
pca_info_natural_: Dict | None = None#
zones_natural_: Dict | None = None#
graphs_by_seed_: Dict[int, networkx.DiGraph]#
valid_seeds_: List[int] = []#
faithfulness_: Dict[str, Any] | None = None#
fit(X_cal_prep: pandas.DataFrame, y_pred_cal: pandas.Series | numpy.ndarray, X_cal_natural: pandas.DataFrame | None = None) SMX[source]#

Run the full SMX explanation pipeline.

Parameters#

X_cal_preppd.DataFrame

Pre-processed calibration spectra (samples × features).

y_pred_calpd.Series or np.ndarray

Continuous model predictions for the calibration set (aligned with X_cal_prep).

X_cal_naturalpd.DataFrame, optional

Unpreprocessed calibration spectra with the same shape as X_cal_prep. Required for lrc_natural_ threshold mapping. When None, the natural-scale mapping step is skipped and lrc_natural_, zones_natural_, and pca_info_natural_ remain None.

Returns#

self

evaluate_faithfulness(X_eval: pandas.DataFrame, *, ranking: Literal['unique', 'summed', 'natural'] = 'unique', X_reference: pandas.DataFrame | None = None, metric: Literal['auto', 'probability_shift', 'mean_abs_diff', 'decision_function_shift'] = 'auto', masking_strategy: Literal['zero', 'constant', 'mean', 'median', 'min', 'max'] = 'zero', constant_value: float = 0.0, max_k: int | None = None, n_random_rankings: int = 100, random_state: int | None = 42, output_path: str | pathlib.Path | None = None, plot_title: str | None = None, plot_width: int = 1100, plot_height: int = 560) Dict[str, Any][source]#

Evaluate SMX faithfulness via progressive top-k zone masking.

The ranked spectral zones are progressively masked on X_eval following the selected SMX ranking, and the resulting prediction shift is summarized by the area under the masking curve (AUC).

Parameters#

X_evalpd.DataFrame

Evaluation spectra to be masked progressively.

ranking{‘unique’, ‘summed’, ‘natural’}, default ‘unique’

Ranking table used to derive the ordered list of spectral zones. 'unique' uses the one-zone-per-row ranking in lrc_summed_unique_. 'summed' and 'natural' are deduplicated internally to one row per zone before masking.

X_referencepd.DataFrame, optional

Reference spectra used to compute replacement values for non-zero masking strategies. Defaults to X_eval.

metric{‘auto’, ‘probability_shift’, ‘mean_abs_diff’, ‘decision_function_shift’}, default ‘auto’

Prediction-shift metric to evaluate. 'auto' chooses 'probability_shift' when the estimator exposes predict_proba(), 'decision_function_shift' when it exposes decision_function(), otherwise 'mean_abs_diff'.

masking_strategy{‘zero’, ‘constant’, ‘mean’, ‘median’, ‘min’, ‘max’}, default ‘zero’

How masked spectral variables are replaced.

constant_valuefloat, default 0.0

Replacement value used when masking_strategy='constant'.

max_kint, optional

Maximum number of ranked zones to mask. Defaults to all ranked zones available in X_eval.

n_random_rankingsint, default 100

Number of random rankings used to contextualize the observed AUC.

random_stateint, optional

Seed controlling the random baseline.

output_pathstr or Path, optional

If provided, also export a faithfulness plot to this path. The extension determines the format (.html or a static image).

plot_titlestr, optional

Title override used when output_path is provided.

plot_widthint, default 1100

Plot width in pixels. Used only when output_path is provided.

plot_heightint, default 560

Plot height in pixels. Used only when output_path is provided.

Returns#

dict

Faithfulness summary including curve_df, auc, auc_normalized, level, and null-baseline statistics.

plot_zone_ranking_over_spectrum(output_path: str | pathlib.Path, *, ranking: Literal['unique', 'natural'] = 'unique', aggregation: Literal['mean', 'median'] = 'mean', title: str | None = None, X_natural: pandas.DataFrame | None = None, y_labels: pandas.Series | None = None, class_colors: dict | None = None, width: int | None = 1200, height: int | None = 500) pandas.DataFrame[source]#

Plot ranked spectral zones over a reference spectrum and save to file.

The output format is inferred from output_path.html for an interactive figure, or .png / .svg / .pdf for a static image (requires kaleido).

Parameters#

output_pathstr or Path

Destination .html file.

ranking{‘unique’, ‘natural’}, default ‘unique’

Ranking source. 'unique' uses lrc_summed_unique_ (one row per zone). 'natural' uses lrc_natural_ and collapses multiple predicates per zone to the strongest LRC value.

aggregation{‘mean’, ‘median’}, default ‘mean’

Aggregation used to build the reference spectrum from zones_natural_.

titlestr, optional

Figure title override.

X_naturalpd.DataFrame, optional

Full calibration dataset in natural (unpreprocessed) units. When provided together with y_labels, a mean spectrum is drawn for each class on top of the overall reference spectrum.

y_labelspd.Series, optional

Class labels aligned with the rows of X_natural. Required when X_natural is given.

class_colorsdict[str, str], optional

Mapping from class label to hex/CSS color string. Missing labels fall back to a built-in palette.

widthint, default 1200

Figure width in pixels. Used only for static image exports.

heightint, default 500

Figure height in pixels. Used only for static image exports.

Returns#

pd.DataFrame

Normalized ranking table used in the figure.

plot_faithfulness(output_path: str | pathlib.Path, *, title: str | None = None, width: int = 1100, height: int = 560) pandas.DataFrame[source]#

Plot the progressive masking faithfulness curve saved in faithfulness_.

Parameters#

output_pathstr or Path

Destination file. Use .html for an interactive figure or an image extension for static export.

titlestr, optional

Figure title override.

widthint, default 1100

Figure width in pixels. Used only for static image exports.

heightint, default 560

Figure height in pixels. Used only for static image exports.

Returns#

pd.DataFrame

Faithfulness masking curve used in the figure.