How a B2B software company used ML scoring to fix its pipeline

A B2B software company selling product design and manufacturing tools had a problem every ops team knows: too many leads, not enough signal.

Sales reps were manually qualifying a large volume of leads from mixed sources (inbound, events, campaigns, partner referrals). They had a traditional scoring model with point values assigned to behaviors like demo requests, email engagement, and website activity. It looked structured. It wasn’t working.

Reps were slow to contact leads that would convert because they were spending time on leads that wouldn’t. By the time they got to the good ones, competitors had already responded. Conversion rates stayed low. Marketing kept generating volume. Sales kept losing deals.

A research team documented the full transformation in a case study published in Frontiers in Artificial Intelligence.

The setup

The company used Microsoft Dynamics as its CRM and had 4 years of lead data, 23,154 records with 67 fields, covering January 2020 through April 2024. They ran a consultative sales process with multiple stages from lead generation through close.

Their existing scoring model was traditional: manually assigned point values for lead behaviors and characteristics. Marketing defined the weights. Sales followed the ranked list.

The research team extracted the CRM data and built a supervised machine learning classification model to predict whether a lead would become a qualified opportunity.

The process

Data cleaning reduced the dataset from 23,154 records and 67 fields to 16,600 records and 22 fields. Duplicates, inconsistent entries, and sparse columns were removed. Categorical variables were encoded. The class distribution was imbalanced (most leads didn’t convert), which is typical in B2B.

Training used a 70/30 split with 10-fold cross-validation on the training set. Fifteen different classification algorithms were tested under identical conditions using PyCaret, ensuring a fair comparison.

Evaluation used accuracy, AUC, precision, recall, F1-score, confusion matrices, and ROC curves. The goal was finding the model with the best balance of precision (not flagging bad leads as good) and recall (not missing good leads).

The results

The Gradient Boosting Classifier won. Across all 15 algorithms tested:

Gradient Boosting: 98.39% accuracy, 0.9891 AUC, 0.9586 recall, 0.9106 precision, 0.9338 F1
LightGBM: 98.35% accuracy, 0.9885 AUC, 0.9535 recall, 0.9112 precision, 0.9318 F1
XGBoost: 98.21% accuracy, 0.9872 AUC, 0.9360 recall, 0.9149 precision, 0.9248 F1
Logistic Regression: 98.18% accuracy, 0.9775 AUC, 0.9462 recall, 0.9047 precision, 0.9248 F1

A few things stand out:

98.39% accuracy on real, messy CRM data. Not a clean academic dataset. Real leads from a real company with real data quality issues.

AUC of 0.9891. The model distinguishes between leads that convert and leads that don’t with near-perfect separation. The KS statistic was 0.953, confirming strong class separation.

Recall of 95.86%. The model catches nearly all leads that would actually convert. Missed opportunities drop dramatically.

The top 4 algorithms were all within 0.2% of each other on accuracy, but Gradient Boosting had the best balance across all metrics, particularly recall.

What drove the predictions

Feature importance analysis showed the top conversion predictors were:

Lead source
Reason for state
Lead classification
Product
Number of responses
Account type
Interest level

The structural attributes (where the lead came from, how it was classified, what product it wanted) mattered more than the behavioral signals that traditional scoring systems emphasize.

The researchers noted that “the lead source variable reveals the marketing strategy of the lead source” and “number of responses reflects the number of interactions with the company.” The model identified these patterns automatically from the data, without anyone guessing at point values.

What changed operationally

The ML model gave the company something the traditional scoring system couldn’t: a probability, not a point total.

Every lead gets an actual conversion likelihood based on the full picture of its attributes. Sales reps can sort by probability and work the list in order. No more arbitrary cutoffs. No more intuition-based prioritization.

The paper concluded that “supervised machine learning based models can significantly improve the lead conversion” and marketing effectiveness in B2B companies.

The challenges they hit

The researchers were transparent about what was hard:

Overfitting risk. Gradient boosting is prone to memorizing training data. They mitigated this with cross-validation, hyperparameter tuning, and regularization (max tree depth and learning rate limits).

Class imbalance. Most leads don’t convert. They tested SMOTE (synthetic oversampling) but found it didn’t improve metrics, so they kept the natural distribution and used AUC-ROC and F1 instead of raw accuracy for evaluation.

Inference speed. Gradient boosting models can be slow in production. The authors flagged this as a consideration for real-time deployment and suggested dimensionality reduction as a potential optimization.

Data quality. 18.8% of the “Title” field had outlier values. Lead source had 13.26% outliers. Real CRM data is messy, and the model is only as good as the data going in.

The takeaway

This wasn’t a theoretical exercise. It was a real B2B company with a real pipeline problem, real CRM data, and a traditional scoring model that wasn’t getting the job done.

The ML model outperformed it on every metric. Not because machine learning is magic, but because it measures what actually predicts conversion instead of guessing.

The gap between traditional scoring and ML scoring isn’t academic anymore. It’s operational. The companies that close it first get to their best leads first.

Reference: Gonzalez-Flores, L., Rubiano-Moreno, J., & Sosa-Gomez, G. (2025). The relevance of lead prioritization: a B2B lead scoring model based on machine learning. Frontiers in Artificial Intelligence, 8, 1554325.