Models and Algorithms

FormulaML provides a comprehensive suite of machine learning algorithms accessible through simple Excel formulas. This guide helps you choose the right model for your data and business problem.

Understanding Model Types

What Type of Problem Are You Solving?

Before selecting a model, identify your problem type:

Regression: Predicting continuous numerical values
- Examples: Sales forecasting, price prediction, demand estimation
- Output: Numbers like $45,000 or 23.5%
Classification: Categorizing data into groups
- Examples: Customer churn (yes/no), product quality (good/defective), email spam detection
- Output: Categories like “High Risk” or “Premium Customer”
Clustering: Finding natural groups in data
- Examples: Customer segmentation, product grouping, anomaly detection
- Output: Group assignments like “Cluster 1” or “Segment A”

Regression Models

Regression models predict continuous numerical values based on input features.

Linear Regression (Free)

=ML.REGRESSION.LINEAR(fit_intercept)

When to use:

Simple relationships between variables
Baseline model for comparison
When interpretability is crucial

Parameters:

fit_intercept: TRUE to include intercept (default), FALSE for data through origin

Business Example: Predicting sales based on advertising spend:

Cell A1: =ML.REGRESSION.LINEAR(TRUE)
Cell A2: =ML.FIT(A1, advertising_features, sales_target)

Ridge Regression (Free)

=ML.REGRESSION.RIDGE(alpha, fit_intercept)

When to use:

When you have many features (multicollinearity)
To prevent overfitting
When all features might be relevant

Parameters:

alpha: Regularization strength (default: 1.0, higher = more regularization)
fit_intercept: TRUE/FALSE for intercept inclusion

Business Example: Predicting house prices with many correlated features:

Cell A1: =ML.REGRESSION.RIDGE(0.5, TRUE)

Lasso Regression (Free)

=ML.REGRESSION.LASSO(alpha, fit_intercept)

When to use:

Feature selection is important
You suspect many features are irrelevant
Want a sparse model (fewer features)

Parameters:

alpha: Regularization strength (default: 1.0)
fit_intercept: TRUE/FALSE for intercept

Business Example: Finding key factors affecting customer lifetime value:

Cell A1: =ML.REGRESSION.LASSO(0.1, TRUE)

Elastic Net (Free)

=ML.REGRESSION.ELASTIC_NET(alpha, l1_ratio, fit_intercept)

When to use:

Combines benefits of Ridge and Lasso
Many features with some groups of correlated features
Balanced regularization approach

Parameters:

alpha: Overall regularization strength (default: 1.0)
l1_ratio: Balance between L1 and L2 penalty (0-1, default: 0.5)
fit_intercept: TRUE/FALSE for intercept

Business Example: Complex demand forecasting with mixed feature types:

Cell A1: =ML.REGRESSION.ELASTIC_NET(1.0, 0.7, TRUE)

Random Forest Regression (Premium)

=ML.REGRESSION.RANDOM_FOREST_REG(n_estimators, criterion, max_depth, ...)

When to use:

Non-linear relationships
High accuracy requirements
Robust to outliers
No need for feature scaling

Key Parameters:

n_estimators: Number of trees (default: 100, more = better but slower)
criterion: “squared_error”, “absolute_error”, “friedman_mse”, or “poisson”
max_depth: Maximum tree depth (NULL for unlimited)
min_samples_split: Minimum samples to split node (default: 2)
random_state: For reproducibility

Business Example: Complex price optimization with non-linear patterns:

Cell A1: =ML.REGRESSION.RANDOM_FOREST_REG(200, "squared_error", 10)

Classification Models

Classification models predict discrete categories or classes.

Logistic Regression (Free)

=ML.CLASSIFICATION.LOGISTIC(C, penalty, fit_intercept, max_iter, tol)

When to use:

Binary or multi-class classification
Need probability estimates
Linear decision boundaries
Interpretable results needed

Parameters:

C: Inverse regularization strength (default: 1.0, smaller = stronger regularization)
penalty: “l2” (default), “l1”, “elasticnet”, or “none”
fit_intercept: TRUE/FALSE for intercept
max_iter: Maximum iterations (default: 100)
tol: Convergence tolerance (default: 0.0001)

Business Example: Customer churn prediction:

Cell A1: =ML.CLASSIFICATION.LOGISTIC(1.0, "l2", TRUE, 100, 0.0001)

Support Vector Machine - SVM (Free)

=ML.CLASSIFICATION.SVM(C, kernel, degree, gamma, coef0)

When to use:

Complex, non-linear boundaries
High-dimensional data
Binary or multi-class classification
Robust to outliers

Parameters:

C: Regularization parameter (default: 1.0)
kernel: “rbf” (default), “linear”, “poly”, “sigmoid”
degree: Polynomial degree (for “poly” kernel, default: 3)
gamma: Kernel coefficient (“scale” or “auto”, default: “scale”)
coef0: Independent term (for “poly” and “sigmoid”, default: 0.0)

Business Example: Credit risk classification with complex patterns:

Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf", 3, "scale", 0.0)

Random Forest Classifier (Premium)

=ML.CLASSIFICATION.RANDOM_FOREST_CLF(n_estimators, criterion, max_depth, ...)

When to use:

Non-linear relationships
High accuracy requirements
Feature importance needed
Handles mixed data types well

Key Parameters:

n_estimators: Number of trees (default: 100)
criterion: “gini” (default), “entropy”, or “log_loss”
max_depth: Maximum tree depth (NULL for unlimited)
min_samples_split: Minimum samples to split (default: 2)
min_samples_leaf: Minimum samples in leaf (default: 1)
bootstrap: TRUE/FALSE for bootstrap samples (default: TRUE)
random_state: For reproducibility

Business Example: Product defect detection with multiple sensor readings:

Cell A1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(150, "gini", 15)

Clustering Models

Clustering models find natural groups in your data without predefined labels.

K-Means Clustering (Free)

=ML.CLUSTERING.KMEANS(n_clusters, init, n_init, max_iter, tol, random_state, algorithm)

When to use:

Customer segmentation
Product grouping
Anomaly detection
Pattern discovery

Parameters:

n_clusters: Number of clusters to create (default: 8)
init: Initialization method (“k-means++” or “random”, default: “k-means++”)
n_init: Number of initialization runs (“auto” or integer, default: “auto”)
max_iter: Maximum iterations (default: 300)
tol: Convergence tolerance (default: 0.0001)
random_state: For reproducibility
algorithm: “lloyd” (default) or “elkan”

Business Example: Customer segmentation for targeted marketing:

Cell A1: =ML.CLUSTERING.KMEANS(5, "k-means++", "auto", 300, 0.0001, 42, "lloyd")
Cell A2: =ML.FIT(A1, customer_features)
Cell A3: =ML.PREDICT(A2, new_customers)

How to Choose Number of Clusters

Use the elbow method:

Try different values of k (2-10)
Calculate within-cluster sum of squares for each
Plot and look for the “elbow” point
Choose k at the elbow

Choosing the Right Model

Decision Framework

For Regression Problems:

Start with Linear Regression - Simple baseline
Try Ridge if many features - Handles multicollinearity
Use Lasso for feature selection - Identifies important features
Apply Elastic Net for balance - Best of both worlds
Upgrade to Random Forest for complex patterns - Premium, highest accuracy

For Classification Problems:

Start with Logistic Regression - Interpretable, fast
Try SVM for complex boundaries - Non-linear patterns
Use Random Forest for best accuracy - Premium, feature importance

For Clustering:

Use K-Means - Simple, effective
Experiment with different k values
Validate clusters make business sense

Model Comparison Example

Compare multiple models on the same data:

# Load and split data
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 10)
Cell A4: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)

# Train multiple models
Cell B1: =ML.REGRESSION.LINEAR(TRUE)
Cell B2: =ML.FIT(B1, INDEX(A4,1), INDEX(A4,3))
Cell B3: =ML.EVAL.SCORE(B2, INDEX(A4,2), INDEX(A4,4))

Cell C1: =ML.REGRESSION.RIDGE(1.0, TRUE)
Cell C2: =ML.FIT(C1, INDEX(A4,1), INDEX(A4,3))
Cell C3: =ML.EVAL.SCORE(C2, INDEX(A4,2), INDEX(A4,4))

Cell D1: =ML.REGRESSION.LASSO(0.1, TRUE)
Cell D2: =ML.FIT(D1, INDEX(A4,1), INDEX(A4,3))
Cell D3: =ML.EVAL.SCORE(D2, INDEX(A4,2), INDEX(A4,4))

Model Training Workflow

Standard Workflow

Create Model: Choose algorithm with parameters
Train Model: Use ML.FIT with training data
Make Predictions: Use ML.PREDICT on new data
Evaluate: Check performance with ML.EVAL.SCORE

Complete Example

# Step 1: Create model
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf")

# Step 2: Train model
Cell A2: =ML.FIT(A1, X_train, y_train)

# Step 3: Make predictions
Cell A3: =ML.PREDICT(A2, X_test)

# Step 4: Evaluate
Cell A4: =ML.EVAL.SCORE(A2, X_test, y_test)

Model Parameters Guide

Regularization Parameters

C (SVM, Logistic): Inverse regularization (smaller = more regularization)
alpha (Ridge, Lasso): Direct regularization (larger = more regularization)
l1_ratio (Elastic Net): Balance between L1 and L2 (0 = Ridge, 1 = Lasso)

Tree-Based Parameters

n_estimators: More trees = better accuracy but slower
max_depth: Deeper trees = more complex patterns but risk overfitting
min_samples_split/leaf: Higher values = simpler trees, less overfitting

Convergence Parameters

max_iter: Maximum iterations before stopping
tol: Convergence tolerance (smaller = more precise)

Performance Considerations

Model Speed (Fastest to Slowest)

Linear/Logistic Regression
Ridge/Lasso/Elastic Net
K-Means
SVM
Random Forest

Memory Usage (Lowest to Highest)

Linear models (Linear, Ridge, Lasso)
Logistic Regression
K-Means
SVM
Random Forest

Accuracy Potential (Generally)

Random Forest (Premium) - Highest
SVM - High for complex patterns
Elastic Net - Good with regularization
Ridge/Lasso - Good for linear relationships
Linear/Logistic - Baseline

Free vs Premium Models

Free Models Include:

Regression: Linear, Ridge, Lasso, Elastic Net
Classification: Logistic, SVM
Clustering: K-Means

Premium Models Include:

Regression: Random Forest Regressor
Classification: Random Forest Classifier
Additional evaluation methods and hyperparameter tuning

Best Practices

1. Start Simple

Always begin with simple models (Linear/Logistic) as baselines before moving to complex ones.

2. Use Cross-Validation

For important decisions, use cross-validation (premium) to ensure model reliability.

3. Check Feature Importance

Use Lasso or Random Forest to identify which features matter most.

4. Monitor Overfitting

Compare training and test scores - large gaps indicate overfitting.

5. Consider Business Context

Choose models based on:

Interpretability requirements
Prediction accuracy needs
Computational resources
Update frequency

Next Steps

Evaluate your models: Learn about Model Evaluation
Try examples: See complete workflows in Examples
Advanced techniques: Explore pipelines and preprocessing
Function details: Check the Function Reference

Ready to evaluate your models? Continue to Model Evaluation →

Table of Contents

Models and Algorithms

Understanding Model Types

What Type of Problem Are You Solving?

Regression Models

Linear Regression (Free)

Ridge Regression (Free)

Lasso Regression (Free)

Elastic Net (Free)

Random Forest Regression (Premium)

Classification Models

Logistic Regression (Free)

Support Vector Machine - SVM (Free)

Random Forest Classifier (Premium)

Clustering Models

K-Means Clustering (Free)

How to Choose Number of Clusters

Choosing the Right Model

Decision Framework

For Regression Problems:

For Classification Problems:

For Clustering:

Model Comparison Example

Model Training Workflow

Standard Workflow

Complete Example

Model Parameters Guide

Regularization Parameters

Tree-Based Parameters

Convergence Parameters

Performance Considerations

Model Speed (Fastest to Slowest)

Memory Usage (Lowest to Highest)

Accuracy Potential (Generally)

Free vs Premium Models

Free Models Include:

Premium Models Include:

Best Practices

1. Start Simple

2. Use Cross-Validation

3. Check Feature Importance

4. Monitor Overfitting

5. Consider Business Context

Next Steps