smx.pipeline
============

.. py:module:: smx.pipeline

.. autoapi-nested-parse::

   SMX: high-level facade for the full SMX explanation pipeline.

   This class internalises the seed-loop orchestration that every caller would
   otherwise have to rewrite manually (zone extraction → predicate generation →
   bagging → metric → graph → LRC → natural-scale mapping across N seeds).

   Individual component classes (``ZoneAggregator``, ``PredicateGenerator``, etc.)
   remain available for power users who need fine-grained control.


Attributes
----------

.. autoapisummary::

   smx.pipeline.logger
   smx.pipeline.SpectralCuts


Classes
-------

.. autoapisummary::

   smx.pipeline.SMX


Module Contents
---------------

.. py:data:: logger

.. py:data:: SpectralCuts

.. py:class:: SMX(spectral_cuts: SpectralCuts, quantiles: List[float], n_repetitions: int = 4, n_bags: int = 10, n_samples_fraction: float = 0.8, replace: bool = False, metric: Literal['covariance', 'perturbation'] = 'perturbation', estimator: Optional[Any] = None, perturbation_mode: str = 'median', perturbation_value: float = 0, perturbation_metric: str = 'probability_shift', perturbation_stats_source: str = 'full', normalize_by_zone_size: bool = True, zone_size_exponent: float = 1.0, covariance_threshold: float = 0.01, var_exp: bool = True, show_graph_details: bool = False, class_threshold: float = 0.5)

   Full SMX explanation pipeline as a single fit/transform object.

   Runs zone extraction → PCA aggregation → predicate generation →
   seed-loop (bagging → metric → graph → LRC) → cross-seed aggregation →
   natural-scale threshold mapping.

   Parameters
   ----------
   spectral_cuts : list of (name, start, end) tuples
       Zone definitions, e.g. ``[("Low", 1.0, 4.0), ("High", 4.0, 10.0)]``.
   quantiles : list of float
       Quantile fractions for predicate generation, e.g. ``[0.25, 0.5, 0.75]``.
   n_repetitions : int, default 4
       Number of independent bagging repetitions.  Seeds are generated as
       ``[0, 1, …, n_repetitions-1]``.
   n_bags : int, default 10
       Number of bags per seed.
   n_samples_fraction : float, default 0.8
       Fraction of calibration samples drawn per bag.  The minimum samples
       per predicate is hardcoded to 20 % of the dataset.
   replace : bool, default False
       Whether to sample bags with replacement.
   metric : {'covariance', 'perturbation'}, default 'perturbation'
       Importance metric to use.
   estimator : sklearn-compatible estimator, optional
       Trained model required when ``metric='perturbation'``.
   perturbation_mode : str, default 'median'
       Replacement strategy for perturbation (``'constant'``, ``'mean'``,
       ``'median'``, ``'min'``, ``'max'``).
   perturbation_value : float, default 0
       Constant replacement value used when ``perturbation_mode='constant'``.
   perturbation_metric : str, default 'probability_shift'
       Perturbation importance measure. Determines how the impact of
       spectral zone perturbation is quantified. Choice depends on the
       estimator type and the desired sensitivity:

       **Classification estimators** (with ``predict_proba``):

       - ``'probability_shift'`` — Mean total variation distance between
         pre- and post-perturbation class probabilities. Sensitive to
         confidence changes across all classes. Requires ``predict_proba``.
       - ``'accuracy_drop'`` — Drop in accuracy when perturbed predictions
         are compared to original predictions.
       - ``'f1_drop'`` — Weighted F1-score decrease after perturbation.
       - ``'decision_function_shift'`` — Mean absolute change in decision
         function values (e.g. signed distances from hyperplane for SVC).
         Requires ``decision_function()``.

       **Regression estimators** (with ``predict`` returning continuous values):

       - ``'mean_abs_diff'`` — Mean absolute difference between original
         and perturbed predictions.
       - ``'mean_diff'`` — Mean signed difference (bias direction). Positive
         values indicate perturbation increases predictions, negative decreases.
       - ``'mean_relative_dev'`` — Mean relative deviation, normalized by
         original prediction magnitude. Treats zero predictions as NaN.
   normalize_by_zone_size : bool, default True
       Divide raw perturbation importance by zone width.
   zone_size_exponent : float, default 1.0
       Exponent applied to zone size during normalisation.
   covariance_threshold : float, default 0.01
       Minimum covariance value to keep a predicate (covariance metric only).
   var_exp : bool, default True
       Weight graph edges by PC1 explained variance of the source zone.
   show_graph_details : bool, default False
       Print bidirectional-edge details during graph construction.
   class_threshold : float, default 0.5
       Decision boundary for ``Class_Predicted`` annotation on bags.

   Attributes (set after :meth:`fit`)
   ------------------------------------
   lrc_natural\_ : pd.DataFrame or None
       LRC with natural-scale thresholds (available only when
       ``X_cal_natural`` is provided to :meth:`fit`). Columns:
       ``Node``, ``Local_Reaching_Centrality``, ``Zone``, ``Threshold``,
       ``Operator``, ``Threshold_Natural``.
   lrc_summed\_ : pd.DataFrame
       Mean-aggregated LRC across seeds, preprocessed-scale thresholds.
   lrc_summed_unique\_ : pd.DataFrame
       Zone-deduplicated version of *lrc_summed_* (one row per zone).
   zone_scores\_ : pd.DataFrame
       PCA zone scores on the preprocessed calibration data.
   predicates_df\_ : pd.DataFrame
       Full predicate catalogue (generated from *zone_scores_*).
   pca_info\_ : dict
       PCA info for the preprocessed zones.
   pca_info_natural\_ : dict or None
       PCA info for the natural (unpreprocessed) zones (only when
       ``X_cal_natural`` is provided to :meth:`fit`).
   zones_natural\_ : dict or None
       Raw spectral zone DataFrames from the unpreprocessed data (only when
       ``X_cal_natural`` is provided to :meth:`fit`).
   graphs_by_seed\_ : dict[int, nx.DiGraph]
       Per-seed directed predicate graphs (useful for debugging).
   valid_seeds\_ : list[int]
       Seeds that produced a non-empty graph (subset of ``seeds``).
   faithfulness\_ : dict
       Progressive top-k masking evaluation summary produced by
       :meth:`evaluate_faithfulness`.


   .. py:attribute:: spectral_cuts


   .. py:attribute:: quantiles


   .. py:attribute:: n_repetitions
      :value: 4


   .. py:attribute:: seeds


   .. py:attribute:: n_bags
      :value: 10


   .. py:attribute:: n_samples_fraction
      :value: 0.8


   .. py:attribute:: replace
      :value: False


   .. py:attribute:: metric
      :value: 'perturbation'


   .. py:attribute:: estimator
      :value: None


   .. py:attribute:: perturbation_mode
      :value: 'median'


   .. py:attribute:: perturbation_value
      :value: 0


   .. py:attribute:: perturbation_metric
      :value: 'probability_shift'


   .. py:attribute:: perturbation_stats_source
      :value: 'full'


   .. py:attribute:: normalize_by_zone_size
      :value: True


   .. py:attribute:: zone_size_exponent
      :value: 1.0


   .. py:attribute:: covariance_threshold
      :value: 0.01


   .. py:attribute:: var_exp
      :value: True


   .. py:attribute:: show_graph_details
      :value: False


   .. py:attribute:: class_threshold
      :value: 0.5


   .. py:attribute:: lrc_natural_
      :type:  Optional[pandas.DataFrame]
      :value: None


   .. py:attribute:: lrc_summed_
      :type:  Optional[pandas.DataFrame]
      :value: None


   .. py:attribute:: lrc_summed_unique_
      :type:  Optional[pandas.DataFrame]
      :value: None


   .. py:attribute:: zone_scores_
      :type:  Optional[pandas.DataFrame]
      :value: None


   .. py:attribute:: predicates_df_
      :type:  Optional[pandas.DataFrame]
      :value: None


   .. py:attribute:: pca_info_
      :type:  Optional[Dict]
      :value: None


   .. py:attribute:: pca_info_natural_
      :type:  Optional[Dict]
      :value: None


   .. py:attribute:: zones_natural_
      :type:  Optional[Dict]
      :value: None


   .. py:attribute:: graphs_by_seed_
      :type:  Dict[int, networkx.DiGraph]


   .. py:attribute:: valid_seeds_
      :type:  List[int]
      :value: []


   .. py:attribute:: faithfulness_
      :type:  Optional[Dict[str, Any]]
      :value: None


   .. py:method:: fit(X_cal_prep: pandas.DataFrame, y_pred_cal: Union[pandas.Series, numpy.ndarray], X_cal_natural: Optional[pandas.DataFrame] = None) -> SMX

      Run the full SMX explanation pipeline.

      Parameters
      ----------
      X_cal_prep : pd.DataFrame
          Pre-processed calibration spectra (samples × features).
      y_pred_cal : pd.Series or np.ndarray
          Continuous model predictions for the calibration set (aligned
          with *X_cal_prep*).
      X_cal_natural : pd.DataFrame, optional
          Unpreprocessed calibration spectra with the same shape as
          *X_cal_prep*.  Required for ``lrc_natural_`` threshold mapping.
          When ``None``, the natural-scale mapping step is skipped and
          ``lrc_natural_``, ``zones_natural_``, and ``pca_info_natural_``
          remain ``None``.

      Returns
      -------
      self


   .. py:method:: evaluate_faithfulness(X_eval: pandas.DataFrame, *, ranking: Literal['unique', 'summed', 'natural'] = 'unique', X_reference: Optional[pandas.DataFrame] = None, metric: Literal['auto', 'probability_shift', 'mean_abs_diff', 'decision_function_shift'] = 'auto', masking_strategy: Literal['zero', 'constant', 'mean', 'median', 'min', 'max'] = 'zero', constant_value: float = 0.0, max_k: Optional[int] = None, n_random_rankings: int = 100, random_state: Optional[int] = 42, output_path: Optional[Union[str, pathlib.Path]] = None, plot_title: Optional[str] = None, plot_width: int = 1100, plot_height: int = 560) -> Dict[str, Any]

      Evaluate SMX faithfulness via progressive top-k zone masking.

      The ranked spectral zones are progressively masked on *X_eval* following
      the selected SMX ranking, and the resulting prediction shift is
      summarized by the area under the masking curve (AUC).

      Parameters
      ----------
      X_eval : pd.DataFrame
          Evaluation spectra to be masked progressively.
      ranking : {'unique', 'summed', 'natural'}, default 'unique'
          Ranking table used to derive the ordered list of spectral zones.
          ``'unique'`` uses the one-zone-per-row ranking in
          :attr:`lrc_summed_unique_`. ``'summed'`` and ``'natural'`` are
          deduplicated internally to one row per zone before masking.
      X_reference : pd.DataFrame, optional
          Reference spectra used to compute replacement values for
          non-zero masking strategies. Defaults to *X_eval*.
      metric : {'auto', 'probability_shift', 'mean_abs_diff', 'decision_function_shift'}, default 'auto'
          Prediction-shift metric to evaluate. ``'auto'`` chooses
          ``'probability_shift'`` when the estimator exposes
          ``predict_proba()``, ``'decision_function_shift'`` when it exposes
          ``decision_function()``, otherwise ``'mean_abs_diff'``.
      masking_strategy : {'zero', 'constant', 'mean', 'median', 'min', 'max'}, default 'zero'
          How masked spectral variables are replaced.
      constant_value : float, default 0.0
          Replacement value used when ``masking_strategy='constant'``.
      max_k : int, optional
          Maximum number of ranked zones to mask. Defaults to all ranked
          zones available in *X_eval*.
      n_random_rankings : int, default 100
          Number of random rankings used to contextualize the observed AUC.
      random_state : int, optional
          Seed controlling the random baseline.
      output_path : str or Path, optional
          If provided, also export a faithfulness plot to this path. The
          extension determines the format (``.html`` or a static image).
      plot_title : str, optional
          Title override used when *output_path* is provided.
      plot_width : int, default 1100
          Plot width in pixels. Used only when *output_path* is provided.
      plot_height : int, default 560
          Plot height in pixels. Used only when *output_path* is provided.

      Returns
      -------
      dict
          Faithfulness summary including ``curve_df``, ``auc``,
          ``auc_normalized``, ``level``, and null-baseline statistics.


   .. py:method:: plot_zone_ranking_over_spectrum(output_path: Union[str, pathlib.Path], *, ranking: Literal['unique', 'natural'] = 'unique', aggregation: Literal['mean', 'median'] = 'mean', title: Optional[str] = None, X_natural: Optional[pandas.DataFrame] = None, y_labels: Optional[pandas.Series] = None, class_colors: Optional[dict] = None, width: Optional[int] = 1200, height: Optional[int] = 500) -> pandas.DataFrame

      Plot ranked spectral zones over a reference spectrum and save to file.

      The output format is inferred from *output_path* — ``.html`` for an
      interactive figure, or ``.png`` / ``.svg`` / ``.pdf`` for a static image
      (requires ``kaleido``).

      Parameters
      ----------
      output_path : str or Path
          Destination ``.html`` file.
      ranking : {'unique', 'natural'}, default 'unique'
          Ranking source. ``'unique'`` uses ``lrc_summed_unique_`` (one row per
          zone). ``'natural'`` uses ``lrc_natural_`` and collapses multiple
          predicates per zone to the strongest LRC value.
      aggregation : {'mean', 'median'}, default 'mean'
          Aggregation used to build the reference spectrum from
          ``zones_natural_``.
      title : str, optional
          Figure title override.
      X_natural : pd.DataFrame, optional
          Full calibration dataset in natural (unpreprocessed) units.  When
          provided together with *y_labels*, a mean spectrum is drawn for each
          class on top of the overall reference spectrum.
      y_labels : pd.Series, optional
          Class labels aligned with the rows of *X_natural*.  Required when
          *X_natural* is given.
      class_colors : dict[str, str], optional
          Mapping from class label to hex/CSS color string.  Missing labels
          fall back to a built-in palette.
      width : int, default 1200
          Figure width in pixels. Used only for static image exports.
      height : int, default 500
          Figure height in pixels. Used only for static image exports.

      Returns
      -------
      pd.DataFrame
          Normalized ranking table used in the figure.


   .. py:method:: plot_faithfulness(output_path: Union[str, pathlib.Path], *, title: Optional[str] = None, width: int = 1100, height: int = 560) -> pandas.DataFrame

      Plot the progressive masking faithfulness curve saved in ``faithfulness_``.

      Parameters
      ----------
      output_path : str or Path
          Destination file. Use ``.html`` for an interactive figure or an
          image extension for static export.
      title : str, optional
          Figure title override.
      width : int, default 1100
          Figure width in pixels. Used only for static image exports.
      height : int, default 560
          Figure height in pixels. Used only for static image exports.

      Returns
      -------
      pd.DataFrame
          Faithfulness masking curve used in the figure.