Model Evaluation

Evaluating your machine learning models is crucial for understanding their performance and making informed business decisions. This guide covers evaluation techniques, metrics interpretation, and best practices.

The Evaluation Process

Why Evaluate Models?

Prevent Overfitting: Ensure models work on new, unseen data
Compare Models: Choose the best algorithm for your problem
Business Confidence: Quantify model reliability for decision-making
Continuous Improvement: Identify areas for enhancement

Basic Evaluation Workflow

Split data into training and testing sets
Train model on training data
Evaluate on test data (never seen during training)
Interpret metrics in business context

Train-Test Split

The foundation of model evaluation is splitting your data properly.

Using TRAIN_TEST_SPLIT

=ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, test_size, random_state)

Parameters:

X: Features DataFrame
y: Target DataFrame
test_size: Fraction for testing (0.2 = 20% test, 80% train)
random_state: Seed for reproducibility (e.g., 42)

Returns: Array of 4 handles: [X_train, X_test, y_train, y_test]

Example: Proper Data Splitting

# Load data
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})  # Features
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 4)          # Target

# Split data (80/20 split)
Cell A5: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)

# Access individual splits
Cell B1: =INDEX(A5, 1)  # X_train
Cell B2: =INDEX(A5, 2)  # X_test
Cell B3: =INDEX(A5, 3)  # y_train
Cell B4: =INDEX(A5, 4)  # y_test

Choosing Split Ratios

80/20: Standard for most datasets
70/30: More test data for smaller datasets
90/10: Very large datasets
Consider: Dataset size, model complexity, business requirements

Basic Model Scoring

ML.EVAL.SCORE Function (Free)

=ML.EVAL.SCORE(trained_model, X_test, y_test)

Returns a single score metric:

Classification: Accuracy (0-1, percentage correct)
Regression: R² score (0-1, variance explained)

Example: Basic Evaluation

# Train model
Cell C1: =ML.CLASSIFICATION.SVM()
Cell C2: =ML.FIT(C1, B1, B3)  # Using training data

# Evaluate on test set
Cell C3: =ML.EVAL.SCORE(C2, B2, B4)  # Returns accuracy (e.g., 0.95)

Understanding Metrics

Classification Metrics

Accuracy

Formula: Correct Predictions / Total Predictions
Range: 0 to 1 (0% to 100%)
Interpretation: Overall correctness
Caution: Can be misleading with imbalanced classes

Business Example:

Score: 0.92
Meaning: Model correctly classifies 92% of cases
Good for: Balanced datasets (similar number of each class)

When Accuracy is Misleading

If 95% of emails are not spam:

Model always predicting “not spam” gets 95% accuracy
But misses 100% of actual spam!
Need additional metrics for imbalanced data

Regression Metrics

R² Score (Coefficient of Determination)

Range: 0 to 1 (can be negative for poor models)
Interpretation: Proportion of variance explained
Business Meaning: How well model fits the data

Interpretation Guide:

R² = 1.0:  Perfect fit (rare, possibly overfitting)
R² = 0.9:  Excellent fit
R² = 0.7:  Good fit
R² = 0.5:  Moderate fit
R² < 0.3:  Poor fit
R² < 0:    Model worse than horizontal line

Business Example:

Sales Prediction Model
R² = 0.85
Meaning: Model explains 85% of sales variation
Business Impact: Reliable for planning, 15% unexplained variance

Making Predictions

ML.PREDICT Function

=ML.PREDICT(trained_model, X_new)

Use trained models to make predictions on new data.

Prediction Workflow Example

# Train model
Cell A1: =ML.REGRESSION.LINEAR()
Cell A2: =ML.FIT(A1, historical_features, historical_sales)

# Prepare new data
Cell B1: =ML.DATA.CONVERT_TO_DF(NewData!A1:D10, TRUE)

# Make predictions
Cell C1: =ML.PREDICT(A2, B1)

# View predictions
Cell D1: =ML.DATA.SAMPLE(C1, -1)  # Show all predictions

Cross-Validation (Premium)

Cross-validation provides more robust evaluation by testing on multiple data splits.

ML.EVAL.CV_SCORE Function

=ML.EVAL.CV_SCORE(model, X, y, cv, scoring)

Parameters:

model: Untrained model object
X: Features
y: Target
cv: Number of folds (typically 5 or 10)
scoring: Metric type (e.g., “accuracy”, “r2”)

Cross-Validation Example

# Create model
Cell A1: =ML.CLASSIFICATION.LOGISTIC()

# Perform 5-fold cross-validation
Cell A2: =ML.EVAL.CV_SCORE(A1, features, target, 5, "accuracy")

# Returns array of 5 scores (one per fold)
# Average these for overall performance

Understanding K-Fold Cross-Validation

With 5-fold CV:

Data split into 5 equal parts
Train on 4 parts, test on 1
Repeat 5 times (each part as test once)
Average scores for final metric

Benefits:

More reliable than single split
Uses all data for training and testing
Reduces variance in evaluation

Hyperparameter Tuning (Premium)

Find optimal model parameters automatically.

ML.EVAL.GRID_SEARCH Function

=ML.EVAL.GRID_SEARCH(model, param_grid, scoring, cv, refit)

Parameters:

model: Base model to optimize
param_grid: DataFrame of parameters to try
scoring: Evaluation metric
cv: Cross-validation folds
refit: TRUE to retrain on full data

Grid Search Setup Example

Create parameter grid in Excel:

| Model | Parameter | Value1 | Value2 | Value3 |
|-------|-----------|--------|--------|--------|
| svm   | C         | 0.1    | 1.0    | 10.0   |
| svm   | kernel    | linear | rbf    | poly   |

# Convert parameter grid to DataFrame
Cell A1: =ML.DATA.CONVERT_TO_DF(ParamGrid!A1:E3, TRUE)

# Create model and perform grid search
Cell B1: =ML.CLASSIFICATION.SVM()
Cell B2: =ML.EVAL.GRID_SEARCH(B1, A1, "accuracy", 5, TRUE)

# Get best parameters
Cell C1: =ML.EVAL.BEST_PARAMS(B2)

# Get search results
Cell D1: =ML.EVAL.SEARCH_RESULTS(B2)

ML.EVAL.BEST_PARAMS Function

=ML.EVAL.BEST_PARAMS(grid_search_object)

Returns DataFrame with optimal parameter values.

ML.EVAL.SEARCH_RESULTS Function

=ML.EVAL.SEARCH_RESULTS(grid_search_object)

Returns DataFrame with all parameter combinations and scores.

Evaluation Strategies by Problem Type

Binary Classification

Example: Customer Churn (Will Stay/Will Leave)

# Basic evaluation
Cell A1: =ML.CLASSIFICATION.LOGISTIC(1.0, "l2")
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.EVAL.SCORE(A2, X_test, y_test)

# Interpret
Score > 0.85: Good for balanced data
Score > 0.90: Excellent performance
Consider: False positive vs false negative costs

Multi-Class Classification

Example: Product Category (Electronics/Clothing/Food)

# Use SVM for multi-class
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf")
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.EVAL.SCORE(A2, X_test, y_test)

# Interpret
Score meaning: Average accuracy across all classes
Check: Performance per class if available

Regression

Example: Sales Forecasting

# Compare multiple models
Cell A1: =ML.REGRESSION.LINEAR()
Cell A2: =ML.REGRESSION.RIDGE(1.0)
Cell A3: =ML.REGRESSION.LASSO(0.1)

# Train and evaluate each
Cell B1: =ML.FIT(A1, X_train, y_train)
Cell B2: =ML.FIT(A2, X_train, y_train)
Cell B3: =ML.FIT(A3, X_train, y_train)

Cell C1: =ML.EVAL.SCORE(B1, X_test, y_test)  # R² scores
Cell C2: =ML.EVAL.SCORE(B2, X_test, y_test)
Cell C3: =ML.EVAL.SCORE(B3, X_test, y_test)

Clustering

Example: Customer Segmentation

# K-Means doesn't have traditional accuracy
Cell A1: =ML.CLUSTERING.KMEANS(4)
Cell A2: =ML.FIT(A1, customer_features)

# Evaluate by:
# 1. Business sense of clusters
# 2. Cluster stability
# 3. Silhouette score (if available)
# 4. Within-cluster sum of squares

Interpreting Results for Business

Converting Scores to Business Impact

Classification Example

Accuracy: 0.92
Dataset: 10,000 customer records
Correct predictions: 9,200
Errors: 800

Business Impact:
- 92% of marketing campaigns targeted correctly
- 800 customers may receive wrong offers
- Calculate cost of errors vs benefit of correct predictions

Regression Example

R² Score: 0.75
Sales Prediction Model

Business Interpretation:
- Model explains 75% of sales variation
- 25% due to unmeasured factors
- Confidence interval: ±15% on predictions
- Good for planning, not perfect for guarantees

Common Evaluation Pitfalls

1. Data Leakage

Problem: Test data information leaks into training Solution: Always split before any preprocessing

2. Overfitting

Signs:

Training score much higher than test score
Perfect training score (1.0)
Poor performance on new data

Solutions:

Regularization (Ridge, Lasso)
Simpler models
More training data
Cross-validation

3. Underfitting

Signs:

Low scores on both training and test
Model too simple for data

Solutions:

More complex models
Add features
Reduce regularization

4. Wrong Metric

Problem: Using accuracy for imbalanced data Solution: Consider problem-specific metrics

Best Practices

1. Always Use Test Sets

NEVER evaluate on training data
ALWAYS hold out test data
IDEALLY use cross-validation for final models

2. Multiple Metrics

Don’t rely on single scores:

Check multiple metrics when possible
Consider business-specific measures
Validate with domain experts

3. Baseline Comparisons

Always compare to simple baselines:

Classification: Majority class prediction
Regression: Mean or median prediction

4. Reproducibility

Use random_state for consistent results:

=ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, 0.2, 42)
                                                 ↑
                                    Always use same seed

5. Business Validation

Show predictions to domain experts
Test on recent real-world data
Monitor performance over time

Model Selection Framework

Step-by-Step Process

Split Data: 80/20 train-test split
Try Simple Model: Linear/Logistic baseline
Try Complex Models: SVM, Random Forest
Compare Scores: Use same test set
Select Best: Balance accuracy and interpretability
Final Validation: Cross-validation or new data

Example Comparison Workflow

# Data preparation (same for all models)
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 10)
Cell A4: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)

# Model comparison grid
Model         | Formula                        | Score
Linear        | =ML.REGRESSION.LINEAR()        | 0.45
Ridge         | =ML.REGRESSION.RIDGE(1.0)      | 0.48
Lasso         | =ML.REGRESSION.LASSO(0.1)      | 0.47
Random Forest | =ML.REGRESSION.RANDOM_FOREST() | 0.52 (Premium)

# Winner: Random Forest (if accuracy critical)
# Alternative: Ridge (good balance, free)

Next Steps

See examples: Work through complete tutorials
Advanced techniques: Learn about pipelines and preprocessing
Troubleshooting: Check tips and common issues
Function details: Reference the complete function list

Ready to see complete examples? Continue to Examples and Tutorials →

Table of Contents

Model Evaluation

The Evaluation Process

Why Evaluate Models?

Basic Evaluation Workflow

Train-Test Split

Using TRAIN_TEST_SPLIT

Example: Proper Data Splitting

Choosing Split Ratios

Basic Model Scoring

ML.EVAL.SCORE Function (Free)

Example: Basic Evaluation

Understanding Metrics

Classification Metrics

Accuracy

When Accuracy is Misleading

Regression Metrics

R² Score (Coefficient of Determination)

Making Predictions

ML.PREDICT Function

Prediction Workflow Example

Cross-Validation (Premium)

ML.EVAL.CV_SCORE Function

Cross-Validation Example

Understanding K-Fold Cross-Validation

Hyperparameter Tuning (Premium)

ML.EVAL.GRID_SEARCH Function

Grid Search Setup Example

ML.EVAL.BEST_PARAMS Function

ML.EVAL.SEARCH_RESULTS Function

Evaluation Strategies by Problem Type

Binary Classification

Multi-Class Classification

Regression

Clustering

Interpreting Results for Business

Converting Scores to Business Impact

Classification Example

Regression Example

Common Evaluation Pitfalls

1. Data Leakage

2. Overfitting

3. Underfitting

4. Wrong Metric

Best Practices

1. Always Use Test Sets

2. Multiple Metrics

3. Baseline Comparisons

4. Reproducibility

5. Business Validation

Model Selection Framework

Step-by-Step Process

Example Comparison Workflow

Next Steps