Table of Contents
Models and Algorithms
FormulaML provides a comprehensive suite of machine learning algorithms accessible through simple Excel formulas. This guide helps you choose the right model for your data and business problem.
Understanding Model Types
What Type of Problem Are You Solving?
Before selecting a model, identify your problem type:
-
Regression: Predicting continuous numerical values
- Examples: Sales forecasting, price prediction, demand estimation
- Output: Numbers like $45,000 or 23.5%
-
Classification: Categorizing data into groups
- Examples: Customer churn (yes/no), product quality (good/defective), email spam detection
- Output: Categories like “High Risk” or “Premium Customer”
-
Clustering: Finding natural groups in data
- Examples: Customer segmentation, product grouping, anomaly detection
- Output: Group assignments like “Cluster 1” or “Segment A”
Regression Models
Regression models predict continuous numerical values based on input features.
Linear Regression (Free)
=ML.REGRESSION.LINEAR(fit_intercept)
When to use:
- Simple relationships between variables
- Baseline model for comparison
- When interpretability is crucial
Parameters:
fit_intercept
: TRUE to include intercept (default), FALSE for data through origin
Business Example: Predicting sales based on advertising spend:
Cell A1: =ML.REGRESSION.LINEAR(TRUE)
Cell A2: =ML.FIT(A1, advertising_features, sales_target)
Ridge Regression (Free)
=ML.REGRESSION.RIDGE(alpha, fit_intercept)
When to use:
- When you have many features (multicollinearity)
- To prevent overfitting
- When all features might be relevant
Parameters:
alpha
: Regularization strength (default: 1.0, higher = more regularization)fit_intercept
: TRUE/FALSE for intercept inclusion
Business Example: Predicting house prices with many correlated features:
Cell A1: =ML.REGRESSION.RIDGE(0.5, TRUE)
Lasso Regression (Free)
=ML.REGRESSION.LASSO(alpha, fit_intercept)
When to use:
- Feature selection is important
- You suspect many features are irrelevant
- Want a sparse model (fewer features)
Parameters:
alpha
: Regularization strength (default: 1.0)fit_intercept
: TRUE/FALSE for intercept
Business Example: Finding key factors affecting customer lifetime value:
Cell A1: =ML.REGRESSION.LASSO(0.1, TRUE)
Elastic Net (Free)
=ML.REGRESSION.ELASTIC_NET(alpha, l1_ratio, fit_intercept)
When to use:
- Combines benefits of Ridge and Lasso
- Many features with some groups of correlated features
- Balanced regularization approach
Parameters:
alpha
: Overall regularization strength (default: 1.0)l1_ratio
: Balance between L1 and L2 penalty (0-1, default: 0.5)fit_intercept
: TRUE/FALSE for intercept
Business Example: Complex demand forecasting with mixed feature types:
Cell A1: =ML.REGRESSION.ELASTIC_NET(1.0, 0.7, TRUE)
Random Forest Regression (Premium)
=ML.REGRESSION.RANDOM_FOREST_REG(n_estimators, criterion, max_depth, ...)
When to use:
- Non-linear relationships
- High accuracy requirements
- Robust to outliers
- No need for feature scaling
Key Parameters:
n_estimators
: Number of trees (default: 100, more = better but slower)criterion
: “squared_error”, “absolute_error”, “friedman_mse”, or “poisson”max_depth
: Maximum tree depth (NULL for unlimited)min_samples_split
: Minimum samples to split node (default: 2)random_state
: For reproducibility
Business Example: Complex price optimization with non-linear patterns:
Cell A1: =ML.REGRESSION.RANDOM_FOREST_REG(200, "squared_error", 10)
Classification Models
Classification models predict discrete categories or classes.
Logistic Regression (Free)
=ML.CLASSIFICATION.LOGISTIC(C, penalty, fit_intercept, max_iter, tol)
When to use:
- Binary or multi-class classification
- Need probability estimates
- Linear decision boundaries
- Interpretable results needed
Parameters:
C
: Inverse regularization strength (default: 1.0, smaller = stronger regularization)penalty
: “l2” (default), “l1”, “elasticnet”, or “none”fit_intercept
: TRUE/FALSE for interceptmax_iter
: Maximum iterations (default: 100)tol
: Convergence tolerance (default: 0.0001)
Business Example: Customer churn prediction:
Cell A1: =ML.CLASSIFICATION.LOGISTIC(1.0, "l2", TRUE, 100, 0.0001)
Support Vector Machine - SVM (Free)
=ML.CLASSIFICATION.SVM(C, kernel, degree, gamma, coef0)
When to use:
- Complex, non-linear boundaries
- High-dimensional data
- Binary or multi-class classification
- Robust to outliers
Parameters:
C
: Regularization parameter (default: 1.0)kernel
: “rbf” (default), “linear”, “poly”, “sigmoid”degree
: Polynomial degree (for “poly” kernel, default: 3)gamma
: Kernel coefficient (“scale” or “auto”, default: “scale”)coef0
: Independent term (for “poly” and “sigmoid”, default: 0.0)
Business Example: Credit risk classification with complex patterns:
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf", 3, "scale", 0.0)
Random Forest Classifier (Premium)
=ML.CLASSIFICATION.RANDOM_FOREST_CLF(n_estimators, criterion, max_depth, ...)
When to use:
- Non-linear relationships
- High accuracy requirements
- Feature importance needed
- Handles mixed data types well
Key Parameters:
n_estimators
: Number of trees (default: 100)criterion
: “gini” (default), “entropy”, or “log_loss”max_depth
: Maximum tree depth (NULL for unlimited)min_samples_split
: Minimum samples to split (default: 2)min_samples_leaf
: Minimum samples in leaf (default: 1)bootstrap
: TRUE/FALSE for bootstrap samples (default: TRUE)random_state
: For reproducibility
Business Example: Product defect detection with multiple sensor readings:
Cell A1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(150, "gini", 15)
Clustering Models
Clustering models find natural groups in your data without predefined labels.
K-Means Clustering (Free)
=ML.CLUSTERING.KMEANS(n_clusters, init, n_init, max_iter, tol, random_state, algorithm)
When to use:
- Customer segmentation
- Product grouping
- Anomaly detection
- Pattern discovery
Parameters:
n_clusters
: Number of clusters to create (default: 8)init
: Initialization method (“k-means++” or “random”, default: “k-means++”)n_init
: Number of initialization runs (“auto” or integer, default: “auto”)max_iter
: Maximum iterations (default: 300)tol
: Convergence tolerance (default: 0.0001)random_state
: For reproducibilityalgorithm
: “lloyd” (default) or “elkan”
Business Example: Customer segmentation for targeted marketing:
Cell A1: =ML.CLUSTERING.KMEANS(5, "k-means++", "auto", 300, 0.0001, 42, "lloyd")
Cell A2: =ML.FIT(A1, customer_features)
Cell A3: =ML.PREDICT(A2, new_customers)
How to Choose Number of Clusters
Use the elbow method:
- Try different values of k (2-10)
- Calculate within-cluster sum of squares for each
- Plot and look for the “elbow” point
- Choose k at the elbow
Choosing the Right Model
Decision Framework
For Regression Problems:
- Start with Linear Regression - Simple baseline
- Try Ridge if many features - Handles multicollinearity
- Use Lasso for feature selection - Identifies important features
- Apply Elastic Net for balance - Best of both worlds
- Upgrade to Random Forest for complex patterns - Premium, highest accuracy
For Classification Problems:
- Start with Logistic Regression - Interpretable, fast
- Try SVM for complex boundaries - Non-linear patterns
- Use Random Forest for best accuracy - Premium, feature importance
For Clustering:
- Use K-Means - Simple, effective
- Experiment with different k values
- Validate clusters make business sense
Model Comparison Example
Compare multiple models on the same data:
# Load and split data
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})
Cell A3: =ML.DATA.SELECT_COLUMNS(A1, 10)
Cell A4: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(A2, A3, 0.2, 42)
# Train multiple models
Cell B1: =ML.REGRESSION.LINEAR(TRUE)
Cell B2: =ML.FIT(B1, INDEX(A4,1), INDEX(A4,3))
Cell B3: =ML.EVAL.SCORE(B2, INDEX(A4,2), INDEX(A4,4))
Cell C1: =ML.REGRESSION.RIDGE(1.0, TRUE)
Cell C2: =ML.FIT(C1, INDEX(A4,1), INDEX(A4,3))
Cell C3: =ML.EVAL.SCORE(C2, INDEX(A4,2), INDEX(A4,4))
Cell D1: =ML.REGRESSION.LASSO(0.1, TRUE)
Cell D2: =ML.FIT(D1, INDEX(A4,1), INDEX(A4,3))
Cell D3: =ML.EVAL.SCORE(D2, INDEX(A4,2), INDEX(A4,4))
Model Training Workflow
Standard Workflow
- Create Model: Choose algorithm with parameters
- Train Model: Use ML.FIT with training data
- Make Predictions: Use ML.PREDICT on new data
- Evaluate: Check performance with ML.EVAL.SCORE
Complete Example
# Step 1: Create model
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf")
# Step 2: Train model
Cell A2: =ML.FIT(A1, X_train, y_train)
# Step 3: Make predictions
Cell A3: =ML.PREDICT(A2, X_test)
# Step 4: Evaluate
Cell A4: =ML.EVAL.SCORE(A2, X_test, y_test)
Model Parameters Guide
Regularization Parameters
- C (SVM, Logistic): Inverse regularization (smaller = more regularization)
- alpha (Ridge, Lasso): Direct regularization (larger = more regularization)
- l1_ratio (Elastic Net): Balance between L1 and L2 (0 = Ridge, 1 = Lasso)
Tree-Based Parameters
- n_estimators: More trees = better accuracy but slower
- max_depth: Deeper trees = more complex patterns but risk overfitting
- min_samples_split/leaf: Higher values = simpler trees, less overfitting
Convergence Parameters
- max_iter: Maximum iterations before stopping
- tol: Convergence tolerance (smaller = more precise)
Performance Considerations
Model Speed (Fastest to Slowest)
- Linear/Logistic Regression
- Ridge/Lasso/Elastic Net
- K-Means
- SVM
- Random Forest
Memory Usage (Lowest to Highest)
- Linear models (Linear, Ridge, Lasso)
- Logistic Regression
- K-Means
- SVM
- Random Forest
Accuracy Potential (Generally)
- Random Forest (Premium) - Highest
- SVM - High for complex patterns
- Elastic Net - Good with regularization
- Ridge/Lasso - Good for linear relationships
- Linear/Logistic - Baseline
Free vs Premium Models
Free Models Include:
- Regression: Linear, Ridge, Lasso, Elastic Net
- Classification: Logistic, SVM
- Clustering: K-Means
Premium Models Include:
- Regression: Random Forest Regressor
- Classification: Random Forest Classifier
- Additional evaluation methods and hyperparameter tuning
Best Practices
1. Start Simple
Always begin with simple models (Linear/Logistic) as baselines before moving to complex ones.
2. Use Cross-Validation
For important decisions, use cross-validation (premium) to ensure model reliability.
3. Check Feature Importance
Use Lasso or Random Forest to identify which features matter most.
4. Monitor Overfitting
Compare training and test scores - large gaps indicate overfitting.
5. Consider Business Context
Choose models based on:
- Interpretability requirements
- Prediction accuracy needs
- Computational resources
- Update frequency
Next Steps
- Evaluate your models: Learn about Model Evaluation
- Try examples: See complete workflows in Examples
- Advanced techniques: Explore pipelines and preprocessing
- Function details: Check the Function Reference
Ready to evaluate your models? Continue to Model Evaluation →