Advanced Features

Master advanced FormulaML capabilities including pipelines, preprocessing, dimensionality reduction, and feature engineering for sophisticated ML workflows.

Advanced Features

Take your machine learning in Excel to the next level with FormulaML’s advanced capabilities. This guide covers sophisticated techniques for building production-ready ML workflows.

Machine Learning Pipelines

Pipelines chain multiple operations into a single, reusable workflow.

Creating Pipelines with ML.PIPELINE()

=ML.PIPELINE(step1, step2, step3, ...)

Combines preprocessing steps and models into one object.

Basic Pipeline Example

# Create components
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.REGRESSION.LINEAR()

# Combine into pipeline
Cell B1: =ML.PIPELINE(A1, A2)

# Use pipeline like a regular model
Cell C1: =ML.FIT(B1, X_train, y_train)
Cell C2: =ML.PREDICT(C1, X_test)

Advanced Pipeline with Multiple Steps

# Preprocessing steps
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A2: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A3: =ML.DIM_REDUCTION.PCA(5)
Cell A4: =ML.CLASSIFICATION.SVM()

# Create complex pipeline
Cell B1: =ML.PIPELINE(A1, A2, A3, A4)

# Train entire pipeline
Cell C1: =ML.FIT(B1, raw_data, labels)

Benefits of Pipelines

  • Reproducibility: Same preprocessing for train and test
  • Simplicity: Single object to manage
  • Prevent Leakage: Proper transformation order
  • Production Ready: Deploy complete workflow

Data Preprocessing

Transform your data for optimal model performance.

Scaling and Normalization

Standard Scaler

=ML.PREPROCESSING.STANDARD_SCALER()

Standardizes features to mean=0, variance=1.

When to use:

  • Features have different scales
  • Algorithms sensitive to scale (SVM, Neural Networks)
  • Normal distribution assumed

Example:

Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.FIT(A1, training_data)
Cell A3: =ML.TRANSFORM(A2, test_data)

Min-Max Scaler

=ML.PREPROCESSING.MIN_MAX_SCALER()

Scales features to [0, 1] range.

When to use:

  • Need bounded values
  • Preserving zero values important
  • Uniform distribution

Example:

Cell A1: =ML.PREPROCESSING.MIN_MAX_SCALER()
Cell A2: =ML.FIT_TRANSFORM(A1, data)

Robust Scaler

=ML.PREPROCESSING.ROBUST_SCALER()

Uses median and IQR, robust to outliers.

When to use:

  • Data contains outliers
  • Non-normal distributions
  • Need outlier-resistant scaling

Encoding Categorical Variables

One-Hot Encoder

=ML.PREPROCESSING.ONE_HOT_ENCODER(handle_unknown)

Parameters:

  • handle_unknown: “error” or “ignore”

Converts categories to binary columns.

Example:

# For categories: ["red", "blue", "green"]
# Creates columns: is_red, is_blue, is_green

Cell A1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell A2: =ML.FIT_TRANSFORM(A1, categorical_data)

Ordinal Encoder

=ML.PREPROCESSING.ORDINAL_ENCODER(handle_unknown)

Assigns integer values to categories.

When to use:

  • Natural ordering exists (e.g., “low”, “medium”, “high”)
  • Tree-based models
  • Memory constraints

Handling Missing Data

Simple Imputer

=ML.IMPUTE.SIMPLE_IMPUTER(strategy, fill_value)

Strategies:

  • "mean": Replace with column mean
  • "median": Replace with column median
  • "most_frequent": Replace with mode
  • "constant": Replace with fill_value

Imputation Examples

# Mean imputation
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A2: =ML.FIT_TRANSFORM(A1, data_with_missing)

# Constant value imputation
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 0)
Cell B2: =ML.FIT_TRANSFORM(B1, data_with_missing)

# Mode imputation for categories
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")
Cell C2: =ML.FIT_TRANSFORM(C1, categorical_data)

Imputation Strategy Guide

  • Numerical: Use mean or median
  • Categorical: Use most_frequent
  • Time Series: Forward fill or interpolation
  • Known Default: Use constant

Dimensionality Reduction

Reduce feature count while preserving information.

Principal Component Analysis (PCA)

=ML.DIM_REDUCTION.PCA(n_components, whiten, svd_solver, ...)

Key Parameters:

  • n_components: Number of components to keep
  • whiten: TRUE for uncorrelated outputs
  • svd_solver: “auto”, “full”, “arpack”, “randomized”

PCA Example

# Reduce to 3 dimensions
Cell A1: =ML.DIM_REDUCTION.PCA(3)
Cell A2: =ML.FIT(A1, high_dim_data)
Cell A3: =ML.TRANSFORM(A2, high_dim_data)

# View explained variance
Cell B1: =ML.DIM_REDUCTION.PCA.RESULTS(A2)

When to Use PCA

  • Too many features (>50)
  • Multicollinearity issues
  • Visualization needs (reduce to 2-3D)
  • Noise reduction
  • Speed up training

Kernel PCA (Premium) ⭐

=ML.DIM_REDUCTION.KERNEL_PCA(n_components, kernel, degree, gamma, coef0)

Kernels:

  • "linear": Standard PCA
  • "poly": Polynomial features
  • "rbf": Radial basis function
  • "sigmoid": Sigmoid kernel
  • "cosine": Cosine similarity

Example:

# Non-linear dimensionality reduction
Cell A1: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.1)
Cell A2: =ML.FIT_TRANSFORM(A1, complex_data)

Feature Selection

Choose the most informative features.

Select Percentile

=ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile)

Parameters:

  • score_func: “chi2” for classification
  • percentile: Top percentage to keep (e.g., 20)

Feature Selection Example

# Keep top 30% of features
Cell A1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 30)
Cell A2: =ML.FIT(A1, X_train, y_train)
Cell A3: =ML.TRANSFORM(A2, X_test)

Feature Selection Strategies

  1. Statistical Tests: Chi-squared, ANOVA
  2. Model-Based: Lasso, Random Forest importance
  3. Iterative: Forward/backward selection
  4. Domain Knowledge: Business understanding

Column Transformers

Apply different transformations to different columns.

Creating Column Transformers

=ML.COMPOSE.COLUMN_TRANSFORMER(transformer, columns)

Column Transformer Example

# Different preprocessing for different columns
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.COMPOSE.COLUMN_TRANSFORMER(A1, {0,1,2})

Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER()
Cell B2: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {3,4})

# Combine transformers
Cell C1: =ML.COMPOSE.DATA_TRANSFORMER(A2, B2)

Advanced Column Transformation

# Numerical columns: scale
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.COMPOSE.COLUMN_TRANSFORMER(A1, "age,salary,experience")

# Categorical columns: encode
Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell B2: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, "department,location")

# Missing values: impute
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(C1, "bonus,commission")

# Combine all
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(A2, B2, C2)

Model Inspection (Premium) ⭐

Decision Boundaries

=ML.INSPECT.DECISION_BOUNDARY(model, X, response_method, ...)

Visualize classifier decision boundaries.

Parameters:

  • model: Trained classifier
  • X: Feature data
  • response_method: “predict” or “decision_function”
  • mesh_step: Grid resolution
  • feature_indices: Which 2 features to plot

Example:

# Train classifier
Cell A1: =ML.CLASSIFICATION.SVM("rbf")
Cell A2: =ML.FIT(A1, X_train, y_train)

# Get decision boundary
Cell B1: =ML.INSPECT.DECISION_BOUNDARY(A2, X_test, "predict", 0.05, , {0,1})

Advanced Workflows

Complete Preprocessing Pipeline

# Step 1: Handle missing values
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")

# Step 2: Scale features
Cell A2: =ML.PREPROCESSING.STANDARD_SCALER()

# Step 3: Reduce dimensions
Cell A3: =ML.DIM_REDUCTION.PCA(10)

# Step 4: Select best features
Cell A4: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)

# Step 5: Model
Cell A5: =ML.CLASSIFICATION.SVM()

# Combine into pipeline
Cell B1: =ML.PIPELINE(A1, A2, A3, A4, A5)

# Train complete pipeline
Cell C1: =ML.FIT(B1, raw_data, labels)

Production-Ready Template

# Data validation
Cell A1: =ML.DATA.INFO(raw_data)
Cell A2: =ML.DATA.DESCRIBE(raw_data)

# Preprocessing pipeline
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
Cell B2: =ML.PREPROCESSING.ROBUST_SCALER()
Cell B3: =ML.DIM_REDUCTION.PCA(0.95)  # Keep 95% variance

# Model with preprocessing
Cell C1: =ML.REGRESSION.RANDOM_FOREST_REG(200)
Cell C2: =ML.PIPELINE(B1, B2, B3, C1)

# Training with cross-validation
Cell D1: =ML.EVAL.CV_SCORE(C2, X, y, 5, "r2")
Cell D2: =AVERAGE(D1)  # Average CV score

# Final training
Cell E1: =ML.FIT(C2, all_data, all_labels)

# Deployment ready
Cell F1: =ML.PREDICT(E1, new_data)

Performance Optimization

Tips for Large Datasets

  1. Sample First

    =ML.DATA.SAMPLE(large_data, 10000, , , 42)
    
  2. Reduce Dimensions Early

    =ML.DIM_REDUCTION.PCA(20)  # Before modeling
    
  3. Use Efficient Algorithms

    • Linear models for baseline
    • Avoid deep trees
    • Limit iterations
  4. Batch Processing

    • Split data into chunks
    • Process separately
    • Combine results

Memory Management

  1. Select Only Needed Columns

    =ML.DATA.SELECT_COLUMNS(data, {1,3,5,7})
    
  2. Drop Missing Rows

    =ML.DATA.DROP_MISSING_ROWS(data)
    
  3. Use Sparse Representations

    • One-hot encoding creates sparse data
    • Some algorithms optimize for sparsity

Best Practices

1. Pipeline Design

  • Order matters: impute → scale → reduce → model
  • Test pipeline on small data first
  • Document each step’s purpose

2. Preprocessing Choices

  • Scale after splitting (prevent leakage)
  • Impute based on training data only
  • Keep preprocessing consistent

3. Feature Engineering

  • Create domain-specific features
  • Combine existing features
  • Use business knowledge

4. Validation Strategy

  • Always validate preprocessing
  • Check intermediate outputs
  • Monitor for data drift

Common Advanced Patterns

Pattern 1: Ensemble Pipeline

# Multiple models in ensemble
Model1: Linear + Standard Scaler
Model2: SVM + Min-Max Scaler
Model3: Random Forest (no scaling needed)
Average predictions for final result

Pattern 2: Feature Union

# Combine different feature sets
Original features + PCA components + Engineered features
Concatenate all for final feature set

Pattern 3: Nested Cross-Validation

# Outer loop: Model selection
# Inner loop: Hyperparameter tuning
Most robust but computationally expensive

Next Steps

Ready to solve common issues? Continue to Tips & Troubleshooting