Your model hits 99% accuracy. Your SHAP values look clean. You ship it.
Three months later, someone retrains on fresh data and the feature importances shift. The fields your team built workflows around are no longer top contributors. The score still looks good on paper, but nobody trusts it anymore.
This is the explainability gap that most scoring systems ignore. Accuracy tells you the model works. SHAP tells you why. But neither tells you whether that “why” is stable.
That’s the problem Feature Stability Score (FSS) solves.
What is Feature Stability Score?
FSS is a quantitative measure of how consistent your feature importances are across validation folds. It was introduced by Riziq and Ichsan (2026) in their explainable machine learning framework for DDoS detection, published in AVITEC.
The formula is straightforward:
FSS = 1 - (σ_SHAP / μ_SHAP)
Where σ_SHAP is the standard deviation of a feature’s SHAP importance across k-fold cross-validation and μ_SHAP is the mean. It’s essentially 1 minus the coefficient of variation applied to SHAP values.
A high FSS (close to 1) means the feature contributes consistently regardless of which data partition the model sees. A low FSS means the feature’s importance is volatile. It might look critical in one fold and irrelevant in the next.
Why this matters for predictive scoring
Most teams evaluate their models on accuracy, AUC, precision, and recall. These are performance metrics. They tell you the model predicts well.
Feature importances (via SHAP or similar) give you interpretability. They tell you which fields drive the prediction.
But here’s what neither captures: whether those explanations hold up when the data shifts.
This is the difference between a model that’s accurate and a model that’s trustworthy.
Consider a lead scoring model. SHAP analysis says “days since last activity” is the top predictor. Your ops team builds routing rules around it. Marketing adjusts nurture cadences. Sales prioritizes accordingly.
If that feature’s importance is unstable, if it’s only dominant because of a quirk in the training split, those downstream decisions are built on sand.
FSS quantifies this risk. A feature with FSS > 0.8 is reliably important. A feature below 0.5 is noise masquerading as signal.
The research behind it
Riziq and Ichsan tested FSS across six supervised algorithms (Decision Tree, Random Forest, XGBoost, LightGBM, MLP, and Naive Bayes) using standardized preprocessing and 5-fold stratified cross-validation. The key findings:
- Spearman correlation between FSS and model robustness: ρ = 0.857 (p = 0.014). Models with stable feature attributions generalize better. This isn’t a hunch, it’s statistically significant.
- LightGBM achieved the best balance between interpretability stability (FSS = 0.606) and performance robustness (CV F1-Score Std = 0.000139).
- Features with FSS > 0.8 formed the core decision logic of the model, while lower-FSS features contributed inconsistently across folds.
- Top10 FSS correlated moderately with AUC (r = 0.40), confirming that stable top features contribute to better generalization, not just better explanations.
The paper frames this in a cybersecurity context (DDoS detection), but the principle is domain-agnostic. Any model that uses feature importances to inform decisions benefits from knowing whether those importances are stable.
How this applies to GTM scoring
In GTM operations, predictive scores drive real actions: routing, prioritization, segmentation, outreach timing. When a score says “this account is high-fit,” the team acts on it. When the explanation says “because of engagement recency and company size,” the team builds process around those fields.
FSS answers the question: will those explanations still hold next quarter?
A scoring platform that surfaces FSS alongside SHAP gives ops teams something they’ve never had: a confidence measure on the explanation itself, not just the prediction.
What a stable feature set enables
When your top features have high FSS scores, you can:
- Build automation with confidence. Routing rules based on stable features won’t break when the model retrains.
- Communicate clearly to leadership. “These are the 5 fields that consistently predict conversion” is a stronger statement than “these are the 5 fields that mattered this month.”
- Reduce maintenance overhead. Stable models need less babysitting. Unstable feature attributions are an early warning that your model is overfitting or that your data has structural issues.
- Prioritize data quality investment. If a feature has high FSS and high SHAP importance, it’s worth investing in that field’s completeness and accuracy. If it has high SHAP but low FSS, it might not be worth the effort.
The bottom line
Accuracy says the model works. SHAP says why. FSS says whether that “why” is durable.
For any team building decisions on top of predictive scores, whether in cybersecurity, GTM operations, or anywhere else, FSS fills a gap that traditional metrics leave wide open. It’s the difference between a model you deploy and a model you trust.
Reference: Riziq, M. F., & Ichsan, I. N. (2026). Explainable Machine Learning Framework for Distributed Denial-of-Service (DDoS) Attack Detection using Comparative Evaluation and SHAP Analysis. AVITEC, 8(1), 91-110.