Table of Contents
Model Evaluation
Evaluating your machine learning models is crucial for understanding their performance and making informed business decisions. This guide covers evaluation techniques, metrics interpretation, and best practices.
The Evaluation Process
Why Evaluate Models?
- Prevent Overfitting: Ensure models work on new, unseen data
- Compare Models: Choose the best algorithm for your problem
- Business Confidence: Quantify model reliability for decision-making
- Continuous Improvement: Identify areas for enhancement
Basic Evaluation Workflow
- Split data into training and testing sets
- Train model on training data
- Evaluate on test data (never seen during training)
- Interpret metrics in business context
Train-Test Split
The foundation of model evaluation is splitting your data properly.
Using TRAIN_TEST_SPLIT
=ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, test_size, random_state)
Parameters:
X
: Features DataFramey
: Target DataFrametest_size
: Fraction for testing (0.2 = 20% test, 80% train)random_state
: Seed for reproducibility (e.g., 42)
Returns: Array of 4 handles: [X_train, X_test, y_train, y_test]
Example: Proper Data Splitting
# Load data
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3}) # Features
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 4) # Target
# Split data (80/20 split)
Cell A5: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)
# Access individual splits
Cell B1: =INDEX(A5, 1) # X_train
Cell B2: =INDEX(A5, 2) # X_test
Cell B3: =INDEX(A5, 3) # y_train
Cell B4: =INDEX(A5, 4) # y_test
Choosing Split Ratios
- 80/20: Standard for most datasets
- 70/30: More test data for smaller datasets
- 90/10: Very large datasets
- Consider: Dataset size, model complexity, business requirements
Basic Model Scoring
ML.EVAL.SCORE Function (Free)
=ML.EVAL.SCORE(trained_model, X_test, y_test)
Returns a single score metric:
- Classification: Accuracy (0-1, percentage correct)
- Regression: R² score (0-1, variance explained)
Example: Basic Evaluation
# Train model
Cell C1: =ML.CLASSIFICATION.SVM()
Cell C2: =ML.FIT(C1, B1, B3) # Using training data
# Evaluate on test set
Cell C3: =ML.EVAL.SCORE(C2, B2, B4) # Returns accuracy (e.g., 0.95)
Understanding Metrics
Classification Metrics
Accuracy
- Formula: Correct Predictions / Total Predictions
- Range: 0 to 1 (0% to 100%)
- Interpretation: Overall correctness
- Caution: Can be misleading with imbalanced classes
Business Example:
Score: 0.92
Meaning: Model correctly classifies 92% of cases
Good for: Balanced datasets (similar number of each class)
When Accuracy is Misleading
If 95% of emails are not spam:
- Model always predicting “not spam” gets 95% accuracy
- But misses 100% of actual spam!
- Need additional metrics for imbalanced data
Regression Metrics
R² Score (Coefficient of Determination)
- Range: 0 to 1 (can be negative for poor models)
- Interpretation: Proportion of variance explained
- Business Meaning: How well model fits the data
Interpretation Guide:
R² = 1.0: Perfect fit (rare, possibly overfitting)
R² = 0.9: Excellent fit
R² = 0.7: Good fit
R² = 0.5: Moderate fit
R² < 0.3: Poor fit
R² < 0: Model worse than horizontal line
Business Example:
Sales Prediction Model
R² = 0.85
Meaning: Model explains 85% of sales variation
Business Impact: Reliable for planning, 15% unexplained variance
Making Predictions
ML.PREDICT Function
=ML.PREDICT(trained_model, X_new)
Use trained models to make predictions on new data.
Prediction Workflow Example
# Train model
Cell A1: =ML.REGRESSION.LINEAR()
Cell A2: =ML.FIT(A1, historical_features, historical_sales)
# Prepare new data
Cell B1: =ML.DATA.CONVERT_TO_DF(NewData!A1:D10, TRUE)
# Make predictions
Cell C1: =ML.PREDICT(A2, B1)
# View predictions
Cell D1: =ML.DATA.SAMPLE(C1, -1) # Show all predictions
Cross-Validation (Premium)
Cross-validation provides more robust evaluation by testing on multiple data splits.
ML.EVAL.CV_SCORE Function
=ML.EVAL.CV_SCORE(model, X, y, cv, scoring)
Parameters:
model
: Untrained model objectX
: Featuresy
: Targetcv
: Number of folds (typically 5 or 10)scoring
: Metric type (e.g., “accuracy”, “r2”)
Cross-Validation Example
# Create model
Cell A1: =ML.CLASSIFICATION.LOGISTIC()
# Perform 5-fold cross-validation
Cell A2: =ML.EVAL.CV_SCORE(A1, features, target, 5, "accuracy")
# Returns array of 5 scores (one per fold)
# Average these for overall performance
Understanding K-Fold Cross-Validation
With 5-fold CV:
- Data split into 5 equal parts
- Train on 4 parts, test on 1
- Repeat 5 times (each part as test once)
- Average scores for final metric
Benefits:
- More reliable than single split
- Uses all data for training and testing
- Reduces variance in evaluation
Hyperparameter Tuning (Premium)
Find optimal model parameters automatically.
ML.EVAL.GRID_SEARCH Function
=ML.EVAL.GRID_SEARCH(model, param_grid, scoring, cv, refit)
Parameters:
model
: Base model to optimizeparam_grid
: DataFrame of parameters to tryscoring
: Evaluation metriccv
: Cross-validation foldsrefit
: TRUE to retrain on full data
Grid Search Setup Example
Create parameter grid in Excel:
| Model | Parameter | Value1 | Value2 | Value3 |
|-------|-----------|--------|--------|--------|
| svm | C | 0.1 | 1.0 | 10.0 |
| svm | kernel | linear | rbf | poly |
# Convert parameter grid to DataFrame
Cell A1: =ML.DATA.CONVERT_TO_DF(ParamGrid!A1:E3, TRUE)
# Create model and perform grid search
Cell B1: =ML.CLASSIFICATION.SVM()
Cell B2: =ML.EVAL.GRID_SEARCH(B1, A1, "accuracy", 5, TRUE)
# Get best parameters
Cell C1: =ML.EVAL.BEST_PARAMS(B2)
# Get search results
Cell D1: =ML.EVAL.SEARCH_RESULTS(B2)
ML.EVAL.BEST_PARAMS Function
=ML.EVAL.BEST_PARAMS(grid_search_object)
Returns DataFrame with optimal parameter values.
ML.EVAL.SEARCH_RESULTS Function
=ML.EVAL.SEARCH_RESULTS(grid_search_object)
Returns DataFrame with all parameter combinations and scores.
Evaluation Strategies by Problem Type
Binary Classification
Example: Customer Churn (Will Stay/Will Leave)
# Basic evaluation
Cell A1: =ML.CLASSIFICATION.LOGISTIC(1.0, "l2")
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.EVAL.SCORE(A2, X_test, y_test)
# Interpret
Score > 0.85: Good for balanced data
Score > 0.90: Excellent performance
Consider: False positive vs false negative costs
Multi-Class Classification
Example: Product Category (Electronics/Clothing/Food)
# Use SVM for multi-class
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf")
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.EVAL.SCORE(A2, X_test, y_test)
# Interpret
Score meaning: Average accuracy across all classes
Check: Performance per class if available
Regression
Example: Sales Forecasting
# Compare multiple models
Cell A1: =ML.REGRESSION.LINEAR()
Cell A2: =ML.REGRESSION.RIDGE(1.0)
Cell A3: =ML.REGRESSION.LASSO(0.1)
# Train and evaluate each
Cell B1: =ML.FIT(A1, X_train, y_train)
Cell B2: =ML.FIT(A2, X_train, y_train)
Cell B3: =ML.FIT(A3, X_train, y_train)
Cell C1: =ML.EVAL.SCORE(B1, X_test, y_test) # R² scores
Cell C2: =ML.EVAL.SCORE(B2, X_test, y_test)
Cell C3: =ML.EVAL.SCORE(B3, X_test, y_test)
Clustering
Example: Customer Segmentation
# K-Means doesn't have traditional accuracy
Cell A1: =ML.CLUSTERING.KMEANS(4)
Cell A2: =ML.FIT(A1, customer_features)
# Evaluate by:
# 1. Business sense of clusters
# 2. Cluster stability
# 3. Silhouette score (if available)
# 4. Within-cluster sum of squares
Interpreting Results for Business
Converting Scores to Business Impact
Classification Example
Accuracy: 0.92
Dataset: 10,000 customer records
Correct predictions: 9,200
Errors: 800
Business Impact:
- 92% of marketing campaigns targeted correctly
- 800 customers may receive wrong offers
- Calculate cost of errors vs benefit of correct predictions
Regression Example
R² Score: 0.75
Sales Prediction Model
Business Interpretation:
- Model explains 75% of sales variation
- 25% due to unmeasured factors
- Confidence interval: ±15% on predictions
- Good for planning, not perfect for guarantees
Common Evaluation Pitfalls
1. Data Leakage
Problem: Test data information leaks into training Solution: Always split before any preprocessing
2. Overfitting
Signs:
- Training score much higher than test score
- Perfect training score (1.0)
- Poor performance on new data
Solutions:
- Regularization (Ridge, Lasso)
- Simpler models
- More training data
- Cross-validation
3. Underfitting
Signs:
- Low scores on both training and test
- Model too simple for data
Solutions:
- More complex models
- Add features
- Reduce regularization
4. Wrong Metric
Problem: Using accuracy for imbalanced data Solution: Consider problem-specific metrics
Best Practices
1. Always Use Test Sets
NEVER evaluate on training data
ALWAYS hold out test data
IDEALLY use cross-validation for final models
2. Multiple Metrics
Don’t rely on single scores:
- Check multiple metrics when possible
- Consider business-specific measures
- Validate with domain experts
3. Baseline Comparisons
Always compare to simple baselines:
- Classification: Majority class prediction
- Regression: Mean or median prediction
4. Reproducibility
Use random_state for consistent results:
=ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, 0.2, 42)
↑
Always use same seed
5. Business Validation
- Show predictions to domain experts
- Test on recent real-world data
- Monitor performance over time
Model Selection Framework
Step-by-Step Process
- Split Data: 80/20 train-test split
- Try Simple Model: Linear/Logistic baseline
- Try Complex Models: SVM, Random Forest
- Compare Scores: Use same test set
- Select Best: Balance accuracy and interpretability
- Final Validation: Cross-validation or new data
Example Comparison Workflow
# Data preparation (same for all models)
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 10)
Cell A4: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)
# Model comparison grid
Model | Formula | Score
Linear | =ML.REGRESSION.LINEAR() | 0.45
Ridge | =ML.REGRESSION.RIDGE(1.0) | 0.48
Lasso | =ML.REGRESSION.LASSO(0.1) | 0.47
Random Forest | =ML.REGRESSION.RANDOM_FOREST() | 0.52 (Premium)
# Winner: Random Forest (if accuracy critical)
# Alternative: Ridge (good balance, free)
Next Steps
- See examples: Work through complete tutorials
- Advanced techniques: Learn about pipelines and preprocessing
- Troubleshooting: Check tips and common issues
- Function details: Reference the complete function list
Ready to see complete examples? Continue to Examples and Tutorials →