Model Evaluation

Learn how to evaluate machine learning models in FormulaML. Understand metrics, validation techniques, and how to interpret results for business decisions.

Model Evaluation

Evaluating your machine learning models is crucial for understanding their performance and making informed business decisions. This guide covers evaluation techniques, metrics interpretation, and best practices.

The Evaluation Process

Why Evaluate Models?

  • Prevent Overfitting: Ensure models work on new, unseen data
  • Compare Models: Choose the best algorithm for your problem
  • Business Confidence: Quantify model reliability for decision-making
  • Continuous Improvement: Identify areas for enhancement

Basic Evaluation Workflow

  1. Split data into training and testing sets
  2. Train model on training data
  3. Evaluate on test data (never seen during training)
  4. Interpret metrics in business context

Train-Test Split

The foundation of model evaluation is splitting your data properly.

Using TRAIN_TEST_SPLIT

=ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, test_size, random_state)

Parameters:

  • X: Features DataFrame
  • y: Target DataFrame
  • test_size: Fraction for testing (0.2 = 20% test, 80% train)
  • random_state: Seed for reproducibility (e.g., 42)

Returns: Array of 4 handles: [X_train, X_test, y_train, y_test]

Example: Proper Data Splitting

# Load data
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})  # Features
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 4)          # Target

# Split data (80/20 split)
Cell A5: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)

# Access individual splits
Cell B1: =INDEX(A5, 1)  # X_train
Cell B2: =INDEX(A5, 2)  # X_test
Cell B3: =INDEX(A5, 3)  # y_train
Cell B4: =INDEX(A5, 4)  # y_test

Choosing Split Ratios

  • 80/20: Standard for most datasets
  • 70/30: More test data for smaller datasets
  • 90/10: Very large datasets
  • Consider: Dataset size, model complexity, business requirements

Basic Model Scoring

ML.EVAL.SCORE Function (Free)

=ML.EVAL.SCORE(trained_model, X_test, y_test)

Returns a single score metric:

  • Classification: Accuracy (0-1, percentage correct)
  • Regression: R² score (0-1, variance explained)

Example: Basic Evaluation

# Train model
Cell C1: =ML.CLASSIFICATION.SVM()
Cell C2: =ML.FIT(C1, B1, B3)  # Using training data

# Evaluate on test set
Cell C3: =ML.EVAL.SCORE(C2, B2, B4)  # Returns accuracy (e.g., 0.95)

Understanding Metrics

Classification Metrics

Accuracy

  • Formula: Correct Predictions / Total Predictions
  • Range: 0 to 1 (0% to 100%)
  • Interpretation: Overall correctness
  • Caution: Can be misleading with imbalanced classes

Business Example:

Score: 0.92
Meaning: Model correctly classifies 92% of cases
Good for: Balanced datasets (similar number of each class)

When Accuracy is Misleading

If 95% of emails are not spam:

  • Model always predicting “not spam” gets 95% accuracy
  • But misses 100% of actual spam!
  • Need additional metrics for imbalanced data

Regression Metrics

R² Score (Coefficient of Determination)

  • Range: 0 to 1 (can be negative for poor models)
  • Interpretation: Proportion of variance explained
  • Business Meaning: How well model fits the data

Interpretation Guide:

R² = 1.0:  Perfect fit (rare, possibly overfitting)
R² = 0.9:  Excellent fit
R² = 0.7:  Good fit
R² = 0.5:  Moderate fit
R² < 0.3:  Poor fit
R² < 0:    Model worse than horizontal line

Business Example:

Sales Prediction Model
R² = 0.85
Meaning: Model explains 85% of sales variation
Business Impact: Reliable for planning, 15% unexplained variance

Making Predictions

ML.PREDICT Function

=ML.PREDICT(trained_model, X_new)

Use trained models to make predictions on new data.

Prediction Workflow Example

# Train model
Cell A1: =ML.REGRESSION.LINEAR()
Cell A2: =ML.FIT(A1, historical_features, historical_sales)

# Prepare new data
Cell B1: =ML.DATA.CONVERT_TO_DF(NewData!A1:D10, TRUE)

# Make predictions
Cell C1: =ML.PREDICT(A2, B1)

# View predictions
Cell D1: =ML.DATA.SAMPLE(C1, -1)  # Show all predictions

Cross-Validation (Premium)

Cross-validation provides more robust evaluation by testing on multiple data splits.

ML.EVAL.CV_SCORE Function

=ML.EVAL.CV_SCORE(model, X, y, cv, scoring)

Parameters:

  • model: Untrained model object
  • X: Features
  • y: Target
  • cv: Number of folds (typically 5 or 10)
  • scoring: Metric type (e.g., “accuracy”, “r2”)

Cross-Validation Example

# Create model
Cell A1: =ML.CLASSIFICATION.LOGISTIC()

# Perform 5-fold cross-validation
Cell A2: =ML.EVAL.CV_SCORE(A1, features, target, 5, "accuracy")

# Returns array of 5 scores (one per fold)
# Average these for overall performance

Understanding K-Fold Cross-Validation

With 5-fold CV:

  1. Data split into 5 equal parts
  2. Train on 4 parts, test on 1
  3. Repeat 5 times (each part as test once)
  4. Average scores for final metric

Benefits:

  • More reliable than single split
  • Uses all data for training and testing
  • Reduces variance in evaluation

Hyperparameter Tuning (Premium)

Find optimal model parameters automatically.

ML.EVAL.GRID_SEARCH Function

=ML.EVAL.GRID_SEARCH(model, param_grid, scoring, cv, refit)

Parameters:

  • model: Base model to optimize
  • param_grid: DataFrame of parameters to try
  • scoring: Evaluation metric
  • cv: Cross-validation folds
  • refit: TRUE to retrain on full data

Grid Search Setup Example

Create parameter grid in Excel:

| Model | Parameter | Value1 | Value2 | Value3 |
|-------|-----------|--------|--------|--------|
| svm   | C         | 0.1    | 1.0    | 10.0   |
| svm   | kernel    | linear | rbf    | poly   |
# Convert parameter grid to DataFrame
Cell A1: =ML.DATA.CONVERT_TO_DF(ParamGrid!A1:E3, TRUE)

# Create model and perform grid search
Cell B1: =ML.CLASSIFICATION.SVM()
Cell B2: =ML.EVAL.GRID_SEARCH(B1, A1, "accuracy", 5, TRUE)

# Get best parameters
Cell C1: =ML.EVAL.BEST_PARAMS(B2)

# Get search results
Cell D1: =ML.EVAL.SEARCH_RESULTS(B2)

ML.EVAL.BEST_PARAMS Function

=ML.EVAL.BEST_PARAMS(grid_search_object)

Returns DataFrame with optimal parameter values.

ML.EVAL.SEARCH_RESULTS Function

=ML.EVAL.SEARCH_RESULTS(grid_search_object)

Returns DataFrame with all parameter combinations and scores.

Evaluation Strategies by Problem Type

Binary Classification

Example: Customer Churn (Will Stay/Will Leave)

# Basic evaluation
Cell A1: =ML.CLASSIFICATION.LOGISTIC(1.0, "l2")
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.EVAL.SCORE(A2, X_test, y_test)

# Interpret
Score > 0.85: Good for balanced data
Score > 0.90: Excellent performance
Consider: False positive vs false negative costs

Multi-Class Classification

Example: Product Category (Electronics/Clothing/Food)

# Use SVM for multi-class
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf")
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.EVAL.SCORE(A2, X_test, y_test)

# Interpret
Score meaning: Average accuracy across all classes
Check: Performance per class if available

Regression

Example: Sales Forecasting

# Compare multiple models
Cell A1: =ML.REGRESSION.LINEAR()
Cell A2: =ML.REGRESSION.RIDGE(1.0)
Cell A3: =ML.REGRESSION.LASSO(0.1)

# Train and evaluate each
Cell B1: =ML.FIT(A1, X_train, y_train)
Cell B2: =ML.FIT(A2, X_train, y_train)
Cell B3: =ML.FIT(A3, X_train, y_train)

Cell C1: =ML.EVAL.SCORE(B1, X_test, y_test)  # R² scores
Cell C2: =ML.EVAL.SCORE(B2, X_test, y_test)
Cell C3: =ML.EVAL.SCORE(B3, X_test, y_test)

Clustering

Example: Customer Segmentation

# K-Means doesn't have traditional accuracy
Cell A1: =ML.CLUSTERING.KMEANS(4)
Cell A2: =ML.FIT(A1, customer_features)

# Evaluate by:
# 1. Business sense of clusters
# 2. Cluster stability
# 3. Silhouette score (if available)
# 4. Within-cluster sum of squares

Interpreting Results for Business

Converting Scores to Business Impact

Classification Example

Accuracy: 0.92
Dataset: 10,000 customer records
Correct predictions: 9,200
Errors: 800

Business Impact:
- 92% of marketing campaigns targeted correctly
- 800 customers may receive wrong offers
- Calculate cost of errors vs benefit of correct predictions

Regression Example

R² Score: 0.75
Sales Prediction Model

Business Interpretation:
- Model explains 75% of sales variation
- 25% due to unmeasured factors
- Confidence interval: ±15% on predictions
- Good for planning, not perfect for guarantees

Common Evaluation Pitfalls

1. Data Leakage

Problem: Test data information leaks into training Solution: Always split before any preprocessing

2. Overfitting

Signs:

  • Training score much higher than test score
  • Perfect training score (1.0)
  • Poor performance on new data

Solutions:

  • Regularization (Ridge, Lasso)
  • Simpler models
  • More training data
  • Cross-validation

3. Underfitting

Signs:

  • Low scores on both training and test
  • Model too simple for data

Solutions:

  • More complex models
  • Add features
  • Reduce regularization

4. Wrong Metric

Problem: Using accuracy for imbalanced data Solution: Consider problem-specific metrics

Best Practices

1. Always Use Test Sets

NEVER evaluate on training data
ALWAYS hold out test data
IDEALLY use cross-validation for final models

2. Multiple Metrics

Don’t rely on single scores:

  • Check multiple metrics when possible
  • Consider business-specific measures
  • Validate with domain experts

3. Baseline Comparisons

Always compare to simple baselines:

  • Classification: Majority class prediction
  • Regression: Mean or median prediction

4. Reproducibility

Use random_state for consistent results:

=ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, 0.2, 42)
                                                 ↑
                                    Always use same seed

5. Business Validation

  • Show predictions to domain experts
  • Test on recent real-world data
  • Monitor performance over time

Model Selection Framework

Step-by-Step Process

  1. Split Data: 80/20 train-test split
  2. Try Simple Model: Linear/Logistic baseline
  3. Try Complex Models: SVM, Random Forest
  4. Compare Scores: Use same test set
  5. Select Best: Balance accuracy and interpretability
  6. Final Validation: Cross-validation or new data

Example Comparison Workflow

# Data preparation (same for all models)
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 10)
Cell A4: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)

# Model comparison grid
Model         | Formula                        | Score
Linear        | =ML.REGRESSION.LINEAR()        | 0.45
Ridge         | =ML.REGRESSION.RIDGE(1.0)      | 0.48
Lasso         | =ML.REGRESSION.LASSO(0.1)      | 0.47
Random Forest | =ML.REGRESSION.RANDOM_FOREST() | 0.52 (Premium)

# Winner: Random Forest (if accuracy critical)
# Alternative: Ridge (good balance, free)

Next Steps

Ready to see complete examples? Continue to Examples and Tutorials