Advanced Features

Take your machine learning in Excel to the next level with FormulaML’s advanced capabilities. This guide covers sophisticated techniques for building production-ready ML workflows.

Machine Learning Pipelines

Pipelines chain multiple operations into a single, reusable workflow.

Creating Pipelines with ML.PIPELINE()

=ML.PIPELINE(step1, step2, step3, ...)

Combines preprocessing steps and models into one object.

Basic Pipeline Example

# Create components
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.REGRESSION.LINEAR()

# Combine into pipeline
Cell B1: =ML.PIPELINE(A1, A2)

# Use pipeline like a regular model
Cell C1: =ML.FIT(B1, X_train, y_train)
Cell C2: =ML.PREDICT(C1, X_test)

Advanced Pipeline with Multiple Steps

# Preprocessing steps
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A2: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A3: =ML.DIM_REDUCTION.PCA(5)
Cell A4: =ML.CLASSIFICATION.SVM()

# Create complex pipeline
Cell B1: =ML.PIPELINE(A1, A2, A3, A4)

# Train entire pipeline
Cell C1: =ML.FIT(B1, raw_data, labels)

Benefits of Pipelines

Reproducibility: Same preprocessing for train and test
Simplicity: Single object to manage
Prevent Leakage: Proper transformation order
Production Ready: Deploy complete workflow

Data Preprocessing

Transform your data for optimal model performance.

Scaling and Normalization

Standard Scaler

=ML.PREPROCESSING.STANDARD_SCALER()

Standardizes features to mean=0, variance=1.

When to use:

Features have different scales
Algorithms sensitive to scale (SVM, Neural Networks)
Normal distribution assumed

Example:

Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.FIT(A1, training_data)
Cell A3: =ML.TRANSFORM(A2, test_data)

Min-Max Scaler

=ML.PREPROCESSING.MIN_MAX_SCALER()

Scales features to [0, 1] range.

When to use:

Need bounded values
Preserving zero values important
Uniform distribution

Example:

Cell A1: =ML.PREPROCESSING.MIN_MAX_SCALER()
Cell A2: =ML.FIT_TRANSFORM(A1, data)

Robust Scaler

=ML.PREPROCESSING.ROBUST_SCALER()

Uses median and IQR, robust to outliers.

When to use:

Data contains outliers
Non-normal distributions
Need outlier-resistant scaling

Encoding Categorical Variables

One-Hot Encoder

=ML.PREPROCESSING.ONE_HOT_ENCODER(handle_unknown)

Parameters:

handle_unknown: “error” or “ignore”

Converts categories to binary columns.

Example:

# For categories: ["red", "blue", "green"]
# Creates columns: is_red, is_blue, is_green

Cell A1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell A2: =ML.FIT_TRANSFORM(A1, categorical_data)

Ordinal Encoder

=ML.PREPROCESSING.ORDINAL_ENCODER(handle_unknown)

Assigns integer values to categories.

When to use:

Natural ordering exists (e.g., “low”, “medium”, “high”)
Tree-based models
Memory constraints

Handling Missing Data

Simple Imputer

=ML.IMPUTE.SIMPLE_IMPUTER(strategy, fill_value)

Strategies:

"mean": Replace with column mean
"median": Replace with column median
"most_frequent": Replace with mode
"constant": Replace with fill_value

Imputation Examples

# Mean imputation
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A2: =ML.FIT_TRANSFORM(A1, data_with_missing)

# Constant value imputation
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 0)
Cell B2: =ML.FIT_TRANSFORM(B1, data_with_missing)

# Mode imputation for categories
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")
Cell C2: =ML.FIT_TRANSFORM(C1, categorical_data)

Imputation Strategy Guide

Numerical: Use mean or median
Categorical: Use most_frequent
Time Series: Forward fill or interpolation
Known Default: Use constant

Dimensionality Reduction

Reduce feature count while preserving information.

Principal Component Analysis (PCA)

=ML.DIM_REDUCTION.PCA(n_components, whiten, svd_solver, ...)

Key Parameters:

n_components: Number of components to keep
whiten: TRUE for uncorrelated outputs
svd_solver: “auto”, “full”, “arpack”, “randomized”

PCA Example

# Reduce to 3 dimensions
Cell A1: =ML.DIM_REDUCTION.PCA(3)
Cell A2: =ML.FIT(A1, high_dim_data)
Cell A3: =ML.TRANSFORM(A2, high_dim_data)

# View explained variance
Cell B1: =ML.DIM_REDUCTION.PCA.RESULTS(A2)

When to Use PCA

Too many features (>50)
Multicollinearity issues
Visualization needs (reduce to 2-3D)
Noise reduction
Speed up training

Kernel PCA (Premium) ⭐

=ML.DIM_REDUCTION.KERNEL_PCA(n_components, kernel, degree, gamma, coef0)

Kernels:

"linear": Standard PCA
"poly": Polynomial features
"rbf": Radial basis function
"sigmoid": Sigmoid kernel
"cosine": Cosine similarity

Example:

# Non-linear dimensionality reduction
Cell A1: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.1)
Cell A2: =ML.FIT_TRANSFORM(A1, complex_data)

Feature Selection

Choose the most informative features.

Select Percentile

=ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile)

Parameters:

score_func: “chi2” for classification
percentile: Top percentage to keep (e.g., 20)

Feature Selection Example

# Keep top 30% of features
Cell A1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 30)
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.TRANSFORM(A2, X_test)

Feature Selection Strategies

Statistical Tests: Chi-squared, ANOVA
Model-Based: Lasso, Random Forest importance
Iterative: Forward/backward selection
Domain Knowledge: Business understanding

Column Transformers

Apply different transformations to different columns.

Creating Column Transformers

=ML.COMPOSE.COLUMN_TRANSFORMER(transformer, columns)

Column Transformer Example

# Different preprocessing for different columns
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.COMPOSE.COLUMN_TRANSFORMER(A1, {0,1,2})

Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER()
Cell B2: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {3,4})

# Combine transformers
Cell C1: =ML.COMPOSE.DATA_TRANSFORMER(A2, B2)

Advanced Column Transformation

# Numerical columns: scale
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.COMPOSE.COLUMN_TRANSFORMER(A1, "age,salary,experience")

# Categorical columns: encode
Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell B2: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, "department,location")

# Missing values: impute
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(C1, "bonus,commission")

# Combine all
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(A2, B2, C2)

Model Inspection (Premium) ⭐

Decision Boundaries

=ML.INSPECT.DECISION_BOUNDARY(model, X, response_method, ...)

Visualize classifier decision boundaries.

Parameters:

model: Trained classifier
X: Feature data
response_method: “predict” or “decision_function”
mesh_step: Grid resolution
feature_indices: Which 2 features to plot

Example:

# Train classifier
Cell A1: =ML.CLASSIFICATION.SVM("rbf")
Cell A2: =ML.FIT(A1, X_train, y_train)

# Get decision boundary
Cell B1: =ML.INSPECT.DECISION_BOUNDARY(A2, X_test, "predict", 0.05, , {0,1})

Advanced Workflows

Complete Preprocessing Pipeline

# Step 1: Handle missing values
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")

# Step 2: Scale features
Cell A2: =ML.PREPROCESSING.STANDARD_SCALER()

# Step 3: Reduce dimensions
Cell A3: =ML.DIM_REDUCTION.PCA(10)

# Step 4: Select best features
Cell A4: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)

# Step 5: Model
Cell A5: =ML.CLASSIFICATION.SVM()

# Combine into pipeline
Cell B1: =ML.PIPELINE(A1, A2, A3, A4, A5)

# Train complete pipeline
Cell C1: =ML.FIT(B1, raw_data, labels)

Production-Ready Template

# Data validation
Cell A1: =ML.DATA.INFO(raw_data)
Cell A2: =ML.DATA.DESCRIBE(raw_data)

# Preprocessing pipeline
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
Cell B2: =ML.PREPROCESSING.ROBUST_SCALER()
Cell B3: =ML.DIM_REDUCTION.PCA(0.95)  # Keep 95% variance

# Model with preprocessing
Cell C1: =ML.REGRESSION.RANDOM_FOREST_REG(200)
Cell C2: =ML.PIPELINE(B1, B2, B3, C1)

# Training with cross-validation
Cell D1: =ML.EVAL.CV_SCORE(C2, X, y, 5, "r2")
Cell D2: =AVERAGE(D1)  # Average CV score

# Final training
Cell E1: =ML.FIT(C2, all_data, all_labels)

# Deployment ready
Cell F1: =ML.PREDICT(E1, new_data)

Performance Optimization

Tips for Large Datasets

Sample First

=ML.DATA.SAMPLE(large_data, 10000, , , 42)

Reduce Dimensions Early

=ML.DIM_REDUCTION.PCA(20)  # Before modeling

Use Efficient Algorithms
- Linear models for baseline
- Avoid deep trees
- Limit iterations
Batch Processing
- Split data into chunks
- Process separately
- Combine results

Memory Management

Select Only Needed Columns

=ML.DATA.SELECT_COLUMNS(data, {1,3,5,7})

Drop Missing Rows
```
=ML.DATA.DROP_MISSING_ROWS(data)
```
Use Sparse Representations
- One-hot encoding creates sparse data
- Some algorithms optimize for sparsity

Best Practices

1. Pipeline Design

Order matters: impute → scale → reduce → model
Test pipeline on small data first
Document each step’s purpose

2. Preprocessing Choices

Scale after splitting (prevent leakage)
Impute based on training data only
Keep preprocessing consistent

3. Feature Engineering

Create domain-specific features
Combine existing features
Use business knowledge

4. Validation Strategy

Always validate preprocessing
Check intermediate outputs
Monitor for data drift

Common Advanced Patterns

Pattern 1: Ensemble Pipeline

# Multiple models in ensemble
Model1: Linear + Standard Scaler
Model2: SVM + Min-Max Scaler
Model3: Random Forest (no scaling needed)
Average predictions for final result

Pattern 2: Feature Union

# Combine different feature sets
Original features + PCA components + Engineered features
Concatenate all for final feature set

Pattern 3: Nested Cross-Validation

# Outer loop: Model selection
# Inner loop: Hyperparameter tuning
Most robust but computationally expensive

Next Steps

Practice with examples: Try Pipeline Tutorial
Optimize models: Learn Hyperparameter Tuning
Reference details: Check Function Reference
Troubleshooting: See Tips & Troubleshooting

Ready to solve common issues? Continue to Tips & Troubleshooting →

Table of Contents

Advanced Features

Machine Learning Pipelines

Creating Pipelines with ML.PIPELINE()

Basic Pipeline Example

Advanced Pipeline with Multiple Steps

Benefits of Pipelines

Data Preprocessing

Scaling and Normalization

Standard Scaler

Min-Max Scaler

Robust Scaler

Encoding Categorical Variables

One-Hot Encoder

Ordinal Encoder

Handling Missing Data

Simple Imputer

Imputation Examples

Imputation Strategy Guide

Dimensionality Reduction

Principal Component Analysis (PCA)

PCA Example

When to Use PCA

Kernel PCA (Premium) ⭐

Feature Selection

Select Percentile

Feature Selection Example

Feature Selection Strategies

Column Transformers

Creating Column Transformers

Column Transformer Example

Advanced Column Transformation

Model Inspection (Premium) ⭐

Decision Boundaries

Advanced Workflows

Complete Preprocessing Pipeline

Production-Ready Template

Performance Optimization

Tips for Large Datasets

Memory Management

Best Practices

1. Pipeline Design

2. Preprocessing Choices

3. Feature Engineering

4. Validation Strategy

Common Advanced Patterns

Pattern 1: Ensemble Pipeline

Pattern 2: Feature Union

Pattern 3: Nested Cross-Validation

Next Steps