Models and Algorithms

Comprehensive guide to machine learning models available in FormulaML. Learn when and how to use regression, classification, and clustering algorithms.

Models and Algorithms

FormulaML provides a comprehensive suite of machine learning algorithms accessible through simple Excel formulas. This guide helps you choose the right model for your data and business problem.

Understanding Model Types

What Type of Problem Are You Solving?

Before selecting a model, identify your problem type:

  1. Regression: Predicting continuous numerical values

    • Examples: Sales forecasting, price prediction, demand estimation
    • Output: Numbers like $45,000 or 23.5%
  2. Classification: Categorizing data into groups

    • Examples: Customer churn (yes/no), product quality (good/defective), email spam detection
    • Output: Categories like “High Risk” or “Premium Customer”
  3. Clustering: Finding natural groups in data

    • Examples: Customer segmentation, product grouping, anomaly detection
    • Output: Group assignments like “Cluster 1” or “Segment A”

Regression Models

Regression models predict continuous numerical values based on input features.

Linear Regression (Free)

=ML.REGRESSION.LINEAR(fit_intercept)

When to use:

  • Simple relationships between variables
  • Baseline model for comparison
  • When interpretability is crucial

Parameters:

  • fit_intercept: TRUE to include intercept (default), FALSE for data through origin

Business Example: Predicting sales based on advertising spend:

Cell A1: =ML.REGRESSION.LINEAR(TRUE)
Cell A2: =ML.FIT(A1, advertising_features, sales_target)

Ridge Regression (Free)

=ML.REGRESSION.RIDGE(alpha, fit_intercept)

When to use:

  • When you have many features (multicollinearity)
  • To prevent overfitting
  • When all features might be relevant

Parameters:

  • alpha: Regularization strength (default: 1.0, higher = more regularization)
  • fit_intercept: TRUE/FALSE for intercept inclusion

Business Example: Predicting house prices with many correlated features:

Cell A1: =ML.REGRESSION.RIDGE(0.5, TRUE)

Lasso Regression (Free)

=ML.REGRESSION.LASSO(alpha, fit_intercept)

When to use:

  • Feature selection is important
  • You suspect many features are irrelevant
  • Want a sparse model (fewer features)

Parameters:

  • alpha: Regularization strength (default: 1.0)
  • fit_intercept: TRUE/FALSE for intercept

Business Example: Finding key factors affecting customer lifetime value:

Cell A1: =ML.REGRESSION.LASSO(0.1, TRUE)

Elastic Net (Free)

=ML.REGRESSION.ELASTIC_NET(alpha, l1_ratio, fit_intercept)

When to use:

  • Combines benefits of Ridge and Lasso
  • Many features with some groups of correlated features
  • Balanced regularization approach

Parameters:

  • alpha: Overall regularization strength (default: 1.0)
  • l1_ratio: Balance between L1 and L2 penalty (0-1, default: 0.5)
  • fit_intercept: TRUE/FALSE for intercept

Business Example: Complex demand forecasting with mixed feature types:

Cell A1: =ML.REGRESSION.ELASTIC_NET(1.0, 0.7, TRUE)

Random Forest Regression (Premium)

=ML.REGRESSION.RANDOM_FOREST_REG(n_estimators, criterion, max_depth, ...)

When to use:

  • Non-linear relationships
  • High accuracy requirements
  • Robust to outliers
  • No need for feature scaling

Key Parameters:

  • n_estimators: Number of trees (default: 100, more = better but slower)
  • criterion: “squared_error”, “absolute_error”, “friedman_mse”, or “poisson”
  • max_depth: Maximum tree depth (NULL for unlimited)
  • min_samples_split: Minimum samples to split node (default: 2)
  • random_state: For reproducibility

Business Example: Complex price optimization with non-linear patterns:

Cell A1: =ML.REGRESSION.RANDOM_FOREST_REG(200, "squared_error", 10)

Classification Models

Classification models predict discrete categories or classes.

Logistic Regression (Free)

=ML.CLASSIFICATION.LOGISTIC(C, penalty, fit_intercept, max_iter, tol)

When to use:

  • Binary or multi-class classification
  • Need probability estimates
  • Linear decision boundaries
  • Interpretable results needed

Parameters:

  • C: Inverse regularization strength (default: 1.0, smaller = stronger regularization)
  • penalty: “l2” (default), “l1”, “elasticnet”, or “none”
  • fit_intercept: TRUE/FALSE for intercept
  • max_iter: Maximum iterations (default: 100)
  • tol: Convergence tolerance (default: 0.0001)

Business Example: Customer churn prediction:

Cell A1: =ML.CLASSIFICATION.LOGISTIC(1.0, "l2", TRUE, 100, 0.0001)

Support Vector Machine - SVM (Free)

=ML.CLASSIFICATION.SVM(C, kernel, degree, gamma, coef0)

When to use:

  • Complex, non-linear boundaries
  • High-dimensional data
  • Binary or multi-class classification
  • Robust to outliers

Parameters:

  • C: Regularization parameter (default: 1.0)
  • kernel: “rbf” (default), “linear”, “poly”, “sigmoid”
  • degree: Polynomial degree (for “poly” kernel, default: 3)
  • gamma: Kernel coefficient (“scale” or “auto”, default: “scale”)
  • coef0: Independent term (for “poly” and “sigmoid”, default: 0.0)

Business Example: Credit risk classification with complex patterns:

Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf", 3, "scale", 0.0)

Random Forest Classifier (Premium)

=ML.CLASSIFICATION.RANDOM_FOREST_CLF(n_estimators, criterion, max_depth, ...)

When to use:

  • Non-linear relationships
  • High accuracy requirements
  • Feature importance needed
  • Handles mixed data types well

Key Parameters:

  • n_estimators: Number of trees (default: 100)
  • criterion: “gini” (default), “entropy”, or “log_loss”
  • max_depth: Maximum tree depth (NULL for unlimited)
  • min_samples_split: Minimum samples to split (default: 2)
  • min_samples_leaf: Minimum samples in leaf (default: 1)
  • bootstrap: TRUE/FALSE for bootstrap samples (default: TRUE)
  • random_state: For reproducibility

Business Example: Product defect detection with multiple sensor readings:

Cell A1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(150, "gini", 15)

Clustering Models

Clustering models find natural groups in your data without predefined labels.

K-Means Clustering (Free)

=ML.CLUSTERING.KMEANS(n_clusters, init, n_init, max_iter, tol, random_state, algorithm)

When to use:

  • Customer segmentation
  • Product grouping
  • Anomaly detection
  • Pattern discovery

Parameters:

  • n_clusters: Number of clusters to create (default: 8)
  • init: Initialization method (“k-means++” or “random”, default: “k-means++”)
  • n_init: Number of initialization runs (“auto” or integer, default: “auto”)
  • max_iter: Maximum iterations (default: 300)
  • tol: Convergence tolerance (default: 0.0001)
  • random_state: For reproducibility
  • algorithm: “lloyd” (default) or “elkan”

Business Example: Customer segmentation for targeted marketing:

Cell A1: =ML.CLUSTERING.KMEANS(5, "k-means++", "auto", 300, 0.0001, 42, "lloyd")
Cell A2: =ML.FIT(A1, customer_features)
Cell A3: =ML.PREDICT(A2, new_customers)

How to Choose Number of Clusters

Use the elbow method:

  1. Try different values of k (2-10)
  2. Calculate within-cluster sum of squares for each
  3. Plot and look for the “elbow” point
  4. Choose k at the elbow

Choosing the Right Model

Decision Framework

For Regression Problems:

  1. Start with Linear Regression - Simple baseline
  2. Try Ridge if many features - Handles multicollinearity
  3. Use Lasso for feature selection - Identifies important features
  4. Apply Elastic Net for balance - Best of both worlds
  5. Upgrade to Random Forest for complex patterns - Premium, highest accuracy

For Classification Problems:

  1. Start with Logistic Regression - Interpretable, fast
  2. Try SVM for complex boundaries - Non-linear patterns
  3. Use Random Forest for best accuracy - Premium, feature importance

For Clustering:

  1. Use K-Means - Simple, effective
  2. Experiment with different k values
  3. Validate clusters make business sense

Model Comparison Example

Compare multiple models on the same data:

# Load and split data
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 10)
Cell A4: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)

# Train multiple models
Cell B1: =ML.REGRESSION.LINEAR(TRUE)
Cell B2: =ML.FIT(B1, INDEX(A4,1), INDEX(A4,3))
Cell B3: =ML.EVAL.SCORE(B2, INDEX(A4,2), INDEX(A4,4))

Cell C1: =ML.REGRESSION.RIDGE(1.0, TRUE)
Cell C2: =ML.FIT(C1, INDEX(A4,1), INDEX(A4,3))
Cell C3: =ML.EVAL.SCORE(C2, INDEX(A4,2), INDEX(A4,4))

Cell D1: =ML.REGRESSION.LASSO(0.1, TRUE)
Cell D2: =ML.FIT(D1, INDEX(A4,1), INDEX(A4,3))
Cell D3: =ML.EVAL.SCORE(D2, INDEX(A4,2), INDEX(A4,4))

Model Training Workflow

Standard Workflow

  1. Create Model: Choose algorithm with parameters
  2. Train Model: Use ML.FIT with training data
  3. Make Predictions: Use ML.PREDICT on new data
  4. Evaluate: Check performance with ML.EVAL.SCORE

Complete Example

# Step 1: Create model
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf")

# Step 2: Train model
Cell A2: =ML.FIT(A1, X_train, y_train)

# Step 3: Make predictions
Cell A3: =ML.PREDICT(A2, X_test)

# Step 4: Evaluate
Cell A4: =ML.EVAL.SCORE(A2, X_test, y_test)

Model Parameters Guide

Regularization Parameters

  • C (SVM, Logistic): Inverse regularization (smaller = more regularization)
  • alpha (Ridge, Lasso): Direct regularization (larger = more regularization)
  • l1_ratio (Elastic Net): Balance between L1 and L2 (0 = Ridge, 1 = Lasso)

Tree-Based Parameters

  • n_estimators: More trees = better accuracy but slower
  • max_depth: Deeper trees = more complex patterns but risk overfitting
  • min_samples_split/leaf: Higher values = simpler trees, less overfitting

Convergence Parameters

  • max_iter: Maximum iterations before stopping
  • tol: Convergence tolerance (smaller = more precise)

Performance Considerations

Model Speed (Fastest to Slowest)

  1. Linear/Logistic Regression
  2. Ridge/Lasso/Elastic Net
  3. K-Means
  4. SVM
  5. Random Forest

Memory Usage (Lowest to Highest)

  1. Linear models (Linear, Ridge, Lasso)
  2. Logistic Regression
  3. K-Means
  4. SVM
  5. Random Forest

Accuracy Potential (Generally)

  1. Random Forest (Premium) - Highest
  2. SVM - High for complex patterns
  3. Elastic Net - Good with regularization
  4. Ridge/Lasso - Good for linear relationships
  5. Linear/Logistic - Baseline

Free vs Premium Models

Free Models Include:

  • Regression: Linear, Ridge, Lasso, Elastic Net
  • Classification: Logistic, SVM
  • Clustering: K-Means

Premium Models Include:

  • Regression: Random Forest Regressor
  • Classification: Random Forest Classifier
  • Additional evaluation methods and hyperparameter tuning

Best Practices

1. Start Simple

Always begin with simple models (Linear/Logistic) as baselines before moving to complex ones.

2. Use Cross-Validation

For important decisions, use cross-validation (premium) to ensure model reliability.

3. Check Feature Importance

Use Lasso or Random Forest to identify which features matter most.

4. Monitor Overfitting

Compare training and test scores - large gaps indicate overfitting.

5. Consider Business Context

Choose models based on:

  • Interpretability requirements
  • Prediction accuracy needs
  • Computational resources
  • Update frequency

Next Steps

Ready to evaluate your models? Continue to Model Evaluation