smx.pipeline ============ .. py:module:: smx.pipeline .. autoapi-nested-parse:: SMX: high-level facade for the full SMX explanation pipeline. This class internalises the seed-loop orchestration that every caller would otherwise have to rewrite manually (zone extraction → predicate generation → bagging → metric → graph → LRC → natural-scale mapping across N seeds). Individual component classes (``ZoneAggregator``, ``PredicateGenerator``, etc.) remain available for power users who need fine-grained control. Attributes ---------- .. autoapisummary:: smx.pipeline.logger smx.pipeline.SpectralCuts Classes ------- .. autoapisummary:: smx.pipeline.SMX Module Contents --------------- .. py:data:: logger .. py:data:: SpectralCuts .. py:class:: SMX(spectral_cuts: SpectralCuts, quantiles: List[float], n_repetitions: int = 4, n_bags: int = 10, n_samples_fraction: float = 0.8, replace: bool = False, metric: Literal['covariance', 'perturbation'] = 'perturbation', estimator: Optional[Any] = None, perturbation_mode: str = 'median', perturbation_value: float = 0, perturbation_metric: str = 'probability_shift', perturbation_stats_source: str = 'full', normalize_by_zone_size: bool = True, zone_size_exponent: float = 1.0, covariance_threshold: float = 0.01, var_exp: bool = True, show_graph_details: bool = False, class_threshold: float = 0.5) Full SMX explanation pipeline as a single fit/transform object. Runs zone extraction → PCA aggregation → predicate generation → seed-loop (bagging → metric → graph → LRC) → cross-seed aggregation → natural-scale threshold mapping. Parameters ---------- spectral_cuts : list of (name, start, end) tuples Zone definitions, e.g. ``[("Low", 1.0, 4.0), ("High", 4.0, 10.0)]``. quantiles : list of float Quantile fractions for predicate generation, e.g. ``[0.25, 0.5, 0.75]``. n_repetitions : int, default 4 Number of independent bagging repetitions. Seeds are generated as ``[0, 1, …, n_repetitions-1]``. n_bags : int, default 10 Number of bags per seed. n_samples_fraction : float, default 0.8 Fraction of calibration samples drawn per bag. The minimum samples per predicate is hardcoded to 20 % of the dataset. replace : bool, default False Whether to sample bags with replacement. metric : {'covariance', 'perturbation'}, default 'perturbation' Importance metric to use. estimator : sklearn-compatible estimator, optional Trained model required when ``metric='perturbation'``. perturbation_mode : str, default 'median' Replacement strategy for perturbation (``'constant'``, ``'mean'``, ``'median'``, ``'min'``, ``'max'``). perturbation_value : float, default 0 Constant replacement value used when ``perturbation_mode='constant'``. perturbation_metric : str, default 'probability_shift' Perturbation importance measure. Determines how the impact of spectral zone perturbation is quantified. Choice depends on the estimator type and the desired sensitivity: **Classification estimators** (with ``predict_proba``): - ``'probability_shift'`` — Mean total variation distance between pre- and post-perturbation class probabilities. Sensitive to confidence changes across all classes. Requires ``predict_proba``. - ``'accuracy_drop'`` — Drop in accuracy when perturbed predictions are compared to original predictions. - ``'f1_drop'`` — Weighted F1-score decrease after perturbation. - ``'decision_function_shift'`` — Mean absolute change in decision function values (e.g. signed distances from hyperplane for SVC). Requires ``decision_function()``. **Regression estimators** (with ``predict`` returning continuous values): - ``'mean_abs_diff'`` — Mean absolute difference between original and perturbed predictions. - ``'mean_diff'`` — Mean signed difference (bias direction). Positive values indicate perturbation increases predictions, negative decreases. - ``'mean_relative_dev'`` — Mean relative deviation, normalized by original prediction magnitude. Treats zero predictions as NaN. normalize_by_zone_size : bool, default True Divide raw perturbation importance by zone width. zone_size_exponent : float, default 1.0 Exponent applied to zone size during normalisation. covariance_threshold : float, default 0.01 Minimum covariance value to keep a predicate (covariance metric only). var_exp : bool, default True Weight graph edges by PC1 explained variance of the source zone. show_graph_details : bool, default False Print bidirectional-edge details during graph construction. class_threshold : float, default 0.5 Decision boundary for ``Class_Predicted`` annotation on bags. Attributes (set after :meth:`fit`) ------------------------------------ lrc_natural\_ : pd.DataFrame or None LRC with natural-scale thresholds (available only when ``X_cal_natural`` is provided to :meth:`fit`). Columns: ``Node``, ``Local_Reaching_Centrality``, ``Zone``, ``Threshold``, ``Operator``, ``Threshold_Natural``. lrc_summed\_ : pd.DataFrame Mean-aggregated LRC across seeds, preprocessed-scale thresholds. lrc_summed_unique\_ : pd.DataFrame Zone-deduplicated version of *lrc_summed_* (one row per zone). zone_scores\_ : pd.DataFrame PCA zone scores on the preprocessed calibration data. predicates_df\_ : pd.DataFrame Full predicate catalogue (generated from *zone_scores_*). pca_info\_ : dict PCA info for the preprocessed zones. pca_info_natural\_ : dict or None PCA info for the natural (unpreprocessed) zones (only when ``X_cal_natural`` is provided to :meth:`fit`). zones_natural\_ : dict or None Raw spectral zone DataFrames from the unpreprocessed data (only when ``X_cal_natural`` is provided to :meth:`fit`). graphs_by_seed\_ : dict[int, nx.DiGraph] Per-seed directed predicate graphs (useful for debugging). valid_seeds\_ : list[int] Seeds that produced a non-empty graph (subset of ``seeds``). faithfulness\_ : dict Progressive top-k masking evaluation summary produced by :meth:`evaluate_faithfulness`. .. py:attribute:: spectral_cuts .. py:attribute:: quantiles .. py:attribute:: n_repetitions :value: 4 .. py:attribute:: seeds .. py:attribute:: n_bags :value: 10 .. py:attribute:: n_samples_fraction :value: 0.8 .. py:attribute:: replace :value: False .. py:attribute:: metric :value: 'perturbation' .. py:attribute:: estimator :value: None .. py:attribute:: perturbation_mode :value: 'median' .. py:attribute:: perturbation_value :value: 0 .. py:attribute:: perturbation_metric :value: 'probability_shift' .. py:attribute:: perturbation_stats_source :value: 'full' .. py:attribute:: normalize_by_zone_size :value: True .. py:attribute:: zone_size_exponent :value: 1.0 .. py:attribute:: covariance_threshold :value: 0.01 .. py:attribute:: var_exp :value: True .. py:attribute:: show_graph_details :value: False .. py:attribute:: class_threshold :value: 0.5 .. py:attribute:: lrc_natural_ :type: Optional[pandas.DataFrame] :value: None .. py:attribute:: lrc_summed_ :type: Optional[pandas.DataFrame] :value: None .. py:attribute:: lrc_summed_unique_ :type: Optional[pandas.DataFrame] :value: None .. py:attribute:: zone_scores_ :type: Optional[pandas.DataFrame] :value: None .. py:attribute:: predicates_df_ :type: Optional[pandas.DataFrame] :value: None .. py:attribute:: pca_info_ :type: Optional[Dict] :value: None .. py:attribute:: pca_info_natural_ :type: Optional[Dict] :value: None .. py:attribute:: zones_natural_ :type: Optional[Dict] :value: None .. py:attribute:: graphs_by_seed_ :type: Dict[int, networkx.DiGraph] .. py:attribute:: valid_seeds_ :type: List[int] :value: [] .. py:attribute:: faithfulness_ :type: Optional[Dict[str, Any]] :value: None .. py:method:: fit(X_cal_prep: pandas.DataFrame, y_pred_cal: Union[pandas.Series, numpy.ndarray], X_cal_natural: Optional[pandas.DataFrame] = None) -> SMX Run the full SMX explanation pipeline. Parameters ---------- X_cal_prep : pd.DataFrame Pre-processed calibration spectra (samples × features). y_pred_cal : pd.Series or np.ndarray Continuous model predictions for the calibration set (aligned with *X_cal_prep*). X_cal_natural : pd.DataFrame, optional Unpreprocessed calibration spectra with the same shape as *X_cal_prep*. Required for ``lrc_natural_`` threshold mapping. When ``None``, the natural-scale mapping step is skipped and ``lrc_natural_``, ``zones_natural_``, and ``pca_info_natural_`` remain ``None``. Returns ------- self .. py:method:: evaluate_faithfulness(X_eval: pandas.DataFrame, *, ranking: Literal['unique', 'summed', 'natural'] = 'unique', X_reference: Optional[pandas.DataFrame] = None, metric: Literal['auto', 'probability_shift', 'mean_abs_diff', 'decision_function_shift'] = 'auto', masking_strategy: Literal['zero', 'constant', 'mean', 'median', 'min', 'max'] = 'zero', constant_value: float = 0.0, max_k: Optional[int] = None, n_random_rankings: int = 100, random_state: Optional[int] = 42, output_path: Optional[Union[str, pathlib.Path]] = None, plot_title: Optional[str] = None, plot_width: int = 1100, plot_height: int = 560) -> Dict[str, Any] Evaluate SMX faithfulness via progressive top-k zone masking. The ranked spectral zones are progressively masked on *X_eval* following the selected SMX ranking, and the resulting prediction shift is summarized by the area under the masking curve (AUC). Parameters ---------- X_eval : pd.DataFrame Evaluation spectra to be masked progressively. ranking : {'unique', 'summed', 'natural'}, default 'unique' Ranking table used to derive the ordered list of spectral zones. ``'unique'`` uses the one-zone-per-row ranking in :attr:`lrc_summed_unique_`. ``'summed'`` and ``'natural'`` are deduplicated internally to one row per zone before masking. X_reference : pd.DataFrame, optional Reference spectra used to compute replacement values for non-zero masking strategies. Defaults to *X_eval*. metric : {'auto', 'probability_shift', 'mean_abs_diff', 'decision_function_shift'}, default 'auto' Prediction-shift metric to evaluate. ``'auto'`` chooses ``'probability_shift'`` when the estimator exposes ``predict_proba()``, ``'decision_function_shift'`` when it exposes ``decision_function()``, otherwise ``'mean_abs_diff'``. masking_strategy : {'zero', 'constant', 'mean', 'median', 'min', 'max'}, default 'zero' How masked spectral variables are replaced. constant_value : float, default 0.0 Replacement value used when ``masking_strategy='constant'``. max_k : int, optional Maximum number of ranked zones to mask. Defaults to all ranked zones available in *X_eval*. n_random_rankings : int, default 100 Number of random rankings used to contextualize the observed AUC. random_state : int, optional Seed controlling the random baseline. output_path : str or Path, optional If provided, also export a faithfulness plot to this path. The extension determines the format (``.html`` or a static image). plot_title : str, optional Title override used when *output_path* is provided. plot_width : int, default 1100 Plot width in pixels. Used only when *output_path* is provided. plot_height : int, default 560 Plot height in pixels. Used only when *output_path* is provided. Returns ------- dict Faithfulness summary including ``curve_df``, ``auc``, ``auc_normalized``, ``level``, and null-baseline statistics. .. py:method:: plot_zone_ranking_over_spectrum(output_path: Union[str, pathlib.Path], *, ranking: Literal['unique', 'natural'] = 'unique', aggregation: Literal['mean', 'median'] = 'mean', title: Optional[str] = None, X_natural: Optional[pandas.DataFrame] = None, y_labels: Optional[pandas.Series] = None, class_colors: Optional[dict] = None, width: Optional[int] = 1200, height: Optional[int] = 500) -> pandas.DataFrame Plot ranked spectral zones over a reference spectrum and save to file. The output format is inferred from *output_path* — ``.html`` for an interactive figure, or ``.png`` / ``.svg`` / ``.pdf`` for a static image (requires ``kaleido``). Parameters ---------- output_path : str or Path Destination ``.html`` file. ranking : {'unique', 'natural'}, default 'unique' Ranking source. ``'unique'`` uses ``lrc_summed_unique_`` (one row per zone). ``'natural'`` uses ``lrc_natural_`` and collapses multiple predicates per zone to the strongest LRC value. aggregation : {'mean', 'median'}, default 'mean' Aggregation used to build the reference spectrum from ``zones_natural_``. title : str, optional Figure title override. X_natural : pd.DataFrame, optional Full calibration dataset in natural (unpreprocessed) units. When provided together with *y_labels*, a mean spectrum is drawn for each class on top of the overall reference spectrum. y_labels : pd.Series, optional Class labels aligned with the rows of *X_natural*. Required when *X_natural* is given. class_colors : dict[str, str], optional Mapping from class label to hex/CSS color string. Missing labels fall back to a built-in palette. width : int, default 1200 Figure width in pixels. Used only for static image exports. height : int, default 500 Figure height in pixels. Used only for static image exports. Returns ------- pd.DataFrame Normalized ranking table used in the figure. .. py:method:: plot_faithfulness(output_path: Union[str, pathlib.Path], *, title: Optional[str] = None, width: int = 1100, height: int = 560) -> pandas.DataFrame Plot the progressive masking faithfulness curve saved in ``faithfulness_``. Parameters ---------- output_path : str or Path Destination file. Use ``.html`` for an interactive figure or an image extension for static export. title : str, optional Figure title override. width : int, default 1100 Figure width in pixels. Used only for static image exports. height : int, default 560 Figure height in pixels. Used only for static image exports. Returns ------- pd.DataFrame Faithfulness masking curve used in the figure.