Table of Contents
Advanced Features
Take your machine learning in Excel to the next level with FormulaML’s advanced capabilities. This guide covers sophisticated techniques for building production-ready ML workflows.
Machine Learning Pipelines
Pipelines chain multiple operations into a single, reusable workflow.
Creating Pipelines with ML.PIPELINE()
=ML.PIPELINE(step1, step2, step3, ...)
Combines preprocessing steps and models into one object.
Basic Pipeline Example
# Create components
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.REGRESSION.LINEAR()
# Combine into pipeline
Cell B1: =ML.PIPELINE(A1, A2)
# Use pipeline like a regular model
Cell C1: =ML.FIT(B1, X_train, y_train)
Cell C2: =ML.PREDICT(C1, X_test)
Advanced Pipeline with Multiple Steps
# Preprocessing steps
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A2: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A3: =ML.DIM_REDUCTION.PCA(5)
Cell A4: =ML.CLASSIFICATION.SVM()
# Create complex pipeline
Cell B1: =ML.PIPELINE(A1, A2, A3, A4)
# Train entire pipeline
Cell C1: =ML.FIT(B1, raw_data, labels)
Benefits of Pipelines
- Reproducibility: Same preprocessing for train and test
- Simplicity: Single object to manage
- Prevent Leakage: Proper transformation order
- Production Ready: Deploy complete workflow
Data Preprocessing
Transform your data for optimal model performance.
Scaling and Normalization
Standard Scaler
=ML.PREPROCESSING.STANDARD_SCALER()
Standardizes features to mean=0, variance=1.
When to use:
- Features have different scales
- Algorithms sensitive to scale (SVM, Neural Networks)
- Normal distribution assumed
Example:
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.FIT(A1, training_data)
Cell A3: =ML.TRANSFORM(A2, test_data)
Min-Max Scaler
=ML.PREPROCESSING.MIN_MAX_SCALER()
Scales features to [0, 1] range.
When to use:
- Need bounded values
- Preserving zero values important
- Uniform distribution
Example:
Cell A1: =ML.PREPROCESSING.MIN_MAX_SCALER()
Cell A2: =ML.FIT_TRANSFORM(A1, data)
Robust Scaler
=ML.PREPROCESSING.ROBUST_SCALER()
Uses median and IQR, robust to outliers.
When to use:
- Data contains outliers
- Non-normal distributions
- Need outlier-resistant scaling
Encoding Categorical Variables
One-Hot Encoder
=ML.PREPROCESSING.ONE_HOT_ENCODER(handle_unknown)
Parameters:
handle_unknown
: “error” or “ignore”
Converts categories to binary columns.
Example:
# For categories: ["red", "blue", "green"]
# Creates columns: is_red, is_blue, is_green
Cell A1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell A2: =ML.FIT_TRANSFORM(A1, categorical_data)
Ordinal Encoder
=ML.PREPROCESSING.ORDINAL_ENCODER(handle_unknown)
Assigns integer values to categories.
When to use:
- Natural ordering exists (e.g., “low”, “medium”, “high”)
- Tree-based models
- Memory constraints
Handling Missing Data
Simple Imputer
=ML.IMPUTE.SIMPLE_IMPUTER(strategy, fill_value)
Strategies:
"mean"
: Replace with column mean"median"
: Replace with column median"most_frequent"
: Replace with mode"constant"
: Replace with fill_value
Imputation Examples
# Mean imputation
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A2: =ML.FIT_TRANSFORM(A1, data_with_missing)
# Constant value imputation
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 0)
Cell B2: =ML.FIT_TRANSFORM(B1, data_with_missing)
# Mode imputation for categories
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")
Cell C2: =ML.FIT_TRANSFORM(C1, categorical_data)
Imputation Strategy Guide
- Numerical: Use mean or median
- Categorical: Use most_frequent
- Time Series: Forward fill or interpolation
- Known Default: Use constant
Dimensionality Reduction
Reduce feature count while preserving information.
Principal Component Analysis (PCA)
=ML.DIM_REDUCTION.PCA(n_components, whiten, svd_solver, ...)
Key Parameters:
n_components
: Number of components to keepwhiten
: TRUE for uncorrelated outputssvd_solver
: “auto”, “full”, “arpack”, “randomized”
PCA Example
# Reduce to 3 dimensions
Cell A1: =ML.DIM_REDUCTION.PCA(3)
Cell A2: =ML.FIT(A1, high_dim_data)
Cell A3: =ML.TRANSFORM(A2, high_dim_data)
# View explained variance
Cell B1: =ML.DIM_REDUCTION.PCA.RESULTS(A2)
When to Use PCA
- Too many features (>50)
- Multicollinearity issues
- Visualization needs (reduce to 2-3D)
- Noise reduction
- Speed up training
Kernel PCA (Premium) ⭐
=ML.DIM_REDUCTION.KERNEL_PCA(n_components, kernel, degree, gamma, coef0)
Kernels:
"linear"
: Standard PCA"poly"
: Polynomial features"rbf"
: Radial basis function"sigmoid"
: Sigmoid kernel"cosine"
: Cosine similarity
Example:
# Non-linear dimensionality reduction
Cell A1: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.1)
Cell A2: =ML.FIT_TRANSFORM(A1, complex_data)
Feature Selection
Choose the most informative features.
Select Percentile
=ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile)
Parameters:
score_func
: “chi2” for classificationpercentile
: Top percentage to keep (e.g., 20)
Feature Selection Example
# Keep top 30% of features
Cell A1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 30)
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.TRANSFORM(A2, X_test)
Feature Selection Strategies
- Statistical Tests: Chi-squared, ANOVA
- Model-Based: Lasso, Random Forest importance
- Iterative: Forward/backward selection
- Domain Knowledge: Business understanding
Column Transformers
Apply different transformations to different columns.
Creating Column Transformers
=ML.COMPOSE.COLUMN_TRANSFORMER(transformer, columns)
Column Transformer Example
# Different preprocessing for different columns
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.COMPOSE.COLUMN_TRANSFORMER(A1, {0,1,2})
Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER()
Cell B2: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {3,4})
# Combine transformers
Cell C1: =ML.COMPOSE.DATA_TRANSFORMER(A2, B2)
Advanced Column Transformation
# Numerical columns: scale
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.COMPOSE.COLUMN_TRANSFORMER(A1, "age,salary,experience")
# Categorical columns: encode
Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell B2: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, "department,location")
# Missing values: impute
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(C1, "bonus,commission")
# Combine all
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(A2, B2, C2)
Model Inspection (Premium) ⭐
Decision Boundaries
=ML.INSPECT.DECISION_BOUNDARY(model, X, response_method, ...)
Visualize classifier decision boundaries.
Parameters:
model
: Trained classifierX
: Feature dataresponse_method
: “predict” or “decision_function”mesh_step
: Grid resolutionfeature_indices
: Which 2 features to plot
Example:
# Train classifier
Cell A1: =ML.CLASSIFICATION.SVM("rbf")
Cell A2: =ML.FIT(A1, X_train, y_train)
# Get decision boundary
Cell B1: =ML.INSPECT.DECISION_BOUNDARY(A2, X_test, "predict", 0.05, , {0,1})
Advanced Workflows
Complete Preprocessing Pipeline
# Step 1: Handle missing values
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
# Step 2: Scale features
Cell A2: =ML.PREPROCESSING.STANDARD_SCALER()
# Step 3: Reduce dimensions
Cell A3: =ML.DIM_REDUCTION.PCA(10)
# Step 4: Select best features
Cell A4: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)
# Step 5: Model
Cell A5: =ML.CLASSIFICATION.SVM()
# Combine into pipeline
Cell B1: =ML.PIPELINE(A1, A2, A3, A4, A5)
# Train complete pipeline
Cell C1: =ML.FIT(B1, raw_data, labels)
Production-Ready Template
# Data validation
Cell A1: =ML.DATA.INFO(raw_data)
Cell A2: =ML.DATA.DESCRIBE(raw_data)
# Preprocessing pipeline
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
Cell B2: =ML.PREPROCESSING.ROBUST_SCALER()
Cell B3: =ML.DIM_REDUCTION.PCA(0.95) # Keep 95% variance
# Model with preprocessing
Cell C1: =ML.REGRESSION.RANDOM_FOREST_REG(200)
Cell C2: =ML.PIPELINE(B1, B2, B3, C1)
# Training with cross-validation
Cell D1: =ML.EVAL.CV_SCORE(C2, X, y, 5, "r2")
Cell D2: =AVERAGE(D1) # Average CV score
# Final training
Cell E1: =ML.FIT(C2, all_data, all_labels)
# Deployment ready
Cell F1: =ML.PREDICT(E1, new_data)
Performance Optimization
Tips for Large Datasets
-
Sample First
=ML.DATA.SAMPLE(large_data, 10000, , , 42)
-
Reduce Dimensions Early
=ML.DIM_REDUCTION.PCA(20) # Before modeling
-
Use Efficient Algorithms
- Linear models for baseline
- Avoid deep trees
- Limit iterations
-
Batch Processing
- Split data into chunks
- Process separately
- Combine results
Memory Management
-
Select Only Needed Columns
=ML.DATA.SELECT_COLUMNS(data, {1,3,5,7})
-
Drop Missing Rows
=ML.DATA.DROP_MISSING_ROWS(data)
-
Use Sparse Representations
- One-hot encoding creates sparse data
- Some algorithms optimize for sparsity
Best Practices
1. Pipeline Design
- Order matters: impute → scale → reduce → model
- Test pipeline on small data first
- Document each step’s purpose
2. Preprocessing Choices
- Scale after splitting (prevent leakage)
- Impute based on training data only
- Keep preprocessing consistent
3. Feature Engineering
- Create domain-specific features
- Combine existing features
- Use business knowledge
4. Validation Strategy
- Always validate preprocessing
- Check intermediate outputs
- Monitor for data drift
Common Advanced Patterns
Pattern 1: Ensemble Pipeline
# Multiple models in ensemble
Model1: Linear + Standard Scaler
Model2: SVM + Min-Max Scaler
Model3: Random Forest (no scaling needed)
Average predictions for final result
Pattern 2: Feature Union
# Combine different feature sets
Original features + PCA components + Engineered features
Concatenate all for final feature set
Pattern 3: Nested Cross-Validation
# Outer loop: Model selection
# Inner loop: Hyperparameter tuning
Most robust but computationally expensive
Next Steps
- Practice with examples: Try Pipeline Tutorial
- Optimize models: Learn Hyperparameter Tuning
- Reference details: Check Function Reference
- Troubleshooting: See Tips & Troubleshooting
Ready to solve common issues? Continue to Tips & Troubleshooting →