Table of Contents
Data Functions Reference
Functions for loading, exploring, and manipulating data in FormulaML.
ML.DATASETS Namespace
Built-in datasets for learning and testing machine learning models.
ML.DATASETS.IRIS()
Loads the classic Iris flower classification dataset.
Syntax:
=ML.DATASETS.IRIS()
Parameters: None
Returns: DataFrame object handle containing 150 samples with 5 columns
Dataset Details:
- Samples: 150
- Features: 4 (sepal length, sepal width, petal length, petal width)
- Target: 1 (species: setosa, versicolor, virginica)
- Use Case: Multi-class classification
Example:
Cell A1: =ML.DATASETS.IRIS()
Result: <Dataset>
# Explore the dataset
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.SAMPLE(A1, 5)
ML.DATASETS.DIABETES()
Loads the diabetes regression dataset for predicting disease progression.
Syntax:
=ML.DATASETS.DIABETES()
Parameters: None
Returns: DataFrame object handle containing 442 samples with 11 columns
Dataset Details:
- Samples: 442
- Features: 10 (age, sex, BMI, blood pressure, and 6 blood serum measurements)
- Target: 1 (disease progression one year after baseline)
- Use Case: Regression analysis
Example:
Cell A1: =ML.DATASETS.DIABETES()
Result: <Dataset>
# Prepare for regression
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9}) # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 10) # Target
ML.DATASETS.DIGITS()
Loads the handwritten digits classification dataset.
Syntax:
=ML.DATASETS.DIGITS()
Parameters: None
Returns: DataFrame object handle containing 1,797 samples with 65 columns
Dataset Details:
- Samples: 1,797
- Features: 64 (8x8 pixel values)
- Target: 1 (digit 0-9)
- Use Case: Multi-class image classification
Example:
Cell A1: =ML.DATASETS.DIGITS()
Result: <Dataset>
# Complex classification task
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63") # All pixel features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64) # Digit label
ML.DATASETS.OPENML() ⭐
Loads datasets from the OpenML repository (Premium feature).
Syntax:
=ML.DATASETS.OPENML(name, version)
Parameters:
name
(String, Required): Dataset name from OpenMLversion
(String, Optional): Dataset version (default: “active”)
Returns: DataFrame object handle
Example:
Cell A1: =ML.DATASETS.OPENML("wine-quality-red", "active")
Result: <Dataset>
# Load specific dataset
Cell B1: =ML.DATASETS.OPENML("credit-g")
Note: Visit https://www.openml.org/ to browse available datasets.
ML.DATA Namespace
Functions for data manipulation and exploration.
ML.DATA.CONVERT_TO_DF()
Converts Excel data range to a FormulaML DataFrame object.
Syntax:
=ML.DATA.CONVERT_TO_DF(data_range, has_header)
Parameters:
data_range
(Range, Required): Excel range containing datahas_header
(Boolean, Required): TRUE if first row contains column names
Returns: DataFrame object handle
Example:
# Convert Excel data with headers
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E100, TRUE)
Result: <DataFrame>
# Convert without headers
Cell B1: =ML.DATA.CONVERT_TO_DF(Data!B2:F50, FALSE)
ML.DATA.INFO()
Returns detailed information about a DataFrame.
Syntax:
=ML.DATA.INFO(dataframe)
Parameters:
dataframe
(Object, Required): DataFrame object handle
Returns: DataFrame with columns: Column, Non-Null Count, Dtype
Example:
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.INFO(A1)
# Returns table showing:
# Column Non-Null Count Dtype
# sepal_length 150 float64
# sepal_width 150 float64
# ...
ML.DATA.DESCRIBE()
Generates descriptive statistics for numerical columns.
Syntax:
=ML.DATA.DESCRIBE(dataframe)
Parameters:
dataframe
(Object, Required): DataFrame object handle
Returns: DataFrame with statistics (count, mean, std, min, 25%, 50%, 75%, max)
Example:
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.DESCRIBE(A1)
# Returns statistical summary:
# age sex bmi ...
# count 442.0 442.0 442.0
# mean 48.5 0.05 26.3
# std 13.1 0.04 4.4
# ...
ML.DATA.SAMPLE()
Returns a sample of rows from the DataFrame.
Syntax:
=ML.DATA.SAMPLE(dataframe, n, head, tail, seed, missing_values)
Parameters:
dataframe
(Object, Required): DataFrame object handlen
(Integer, Optional): Number of rows (default: 10, use -1 for all)head
(Boolean, Optional): Return first n rows (default: TRUE)tail
(Boolean, Optional): Return last n rows (default: FALSE)seed
(Integer, Optional): Random seed for reproducibilitymissing_values
(String, Optional): Excel error to show for missing (#N/A default)
Returns: DataFrame with sampled rows
Examples:
# First 5 rows
Cell A1: =ML.DATA.SAMPLE(dataframe, 5)
# Last 10 rows
Cell A2: =ML.DATA.SAMPLE(dataframe, 10, FALSE, TRUE)
# Random 20 rows with seed
Cell A3: =ML.DATA.SAMPLE(dataframe, 20, FALSE, FALSE, 42)
# All data
Cell A4: =ML.DATA.SAMPLE(dataframe, -1)
ML.DATA.SELECT_COLUMNS()
Selects specific columns from a DataFrame.
Syntax:
=ML.DATA.SELECT_COLUMNS(dataframe, columns)
Parameters:
dataframe
(Object, Required): DataFrame object handlecolumns
(Array/Integer/String, Required): Column indices or names
Returns: DataFrame object handle with selected columns
Examples:
# Select single column by index
Cell A1: =ML.DATA.SELECT_COLUMNS(dataframe, 0)
# Select multiple columns by indices
Cell A2: =ML.DATA.SELECT_COLUMNS(dataframe, {0,1,2,3})
# Select column range (as string)
Cell A3: =ML.DATA.SELECT_COLUMNS(dataframe, "0:3")
# Select by column names (if available)
Cell A4: =ML.DATA.SELECT_COLUMNS(dataframe, {"age","bmi","bp"})
ML.DATA.DROP_MISSING_ROWS()
Removes rows with missing values from DataFrame.
Syntax:
=ML.DATA.DROP_MISSING_ROWS(dataframe, how, thresh, subset)
Parameters:
dataframe
(Object, Required): DataFrame object handlehow
(String, Optional): “any” or “all” (default: “any”)thresh
(Integer, Optional): Minimum non-null values requiredsubset
(Array, Optional): Columns to check for missing values
Returns: DataFrame object handle without missing values
Examples:
# Drop rows with any missing values
Cell A1: =ML.DATA.DROP_MISSING_ROWS(dataframe)
# Drop rows where all values are missing
Cell A2: =ML.DATA.DROP_MISSING_ROWS(dataframe, "all")
# Keep rows with at least 3 non-null values
Cell A3: =ML.DATA.DROP_MISSING_ROWS(dataframe, , 3)
# Check specific columns only
Cell A4: =ML.DATA.DROP_MISSING_ROWS(dataframe, "any", , {"age","bmi"})
Common Patterns
Loading and Preparing Data
# Load dataset
Cell A1: =ML.DATASETS.IRIS()
# Explore structure
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.DESCRIBE(A1)
Cell A4: =ML.DATA.SAMPLE(A1, 10)
# Separate features and target
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3}) # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4) # Target
Converting Excel Data
# Prepare Excel range
# Assume data in columns A-E, rows 1-100, with headers
# Convert to DataFrame
Cell F1: =ML.DATA.CONVERT_TO_DF(A1:E100, TRUE)
# Clean missing values
Cell F2: =ML.DATA.DROP_MISSING_ROWS(F1)
# Select relevant columns
Cell F3: =ML.DATA.SELECT_COLUMNS(F2, {1,2,3})
Data Exploration Workflow
# Load data
Cell A1: =ML.DATASETS.DIABETES()
# Get overview
Cell B1: =ML.DATA.INFO(A1) # Structure
Cell B2: =ML.DATA.DESCRIBE(A1) # Statistics
Cell B3: =ML.DATA.SAMPLE(A1, 20) # Sample rows
# Check for patterns
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2}) # First 3 features
Cell C2: =ML.DATA.DESCRIBE(C1) # Stats for subset
Tips and Best Practices
-
Always explore data first
- Use INFO to understand structure
- Use DESCRIBE for statistical overview
- Use SAMPLE to inspect actual values
-
Handle missing values early
- Check INFO output for null counts
- Use DROP_MISSING_ROWS or imputation
-
Use consistent column selection
- Numerical indices are more reliable
- Document which columns represent what
-
Set random seeds for reproducibility
- Use same seed in SAMPLE for consistent results
- Important for debugging and testing
-
Memory considerations
- Use SAMPLE with small n for large datasets
- Select only necessary columns
Related Functions
- ML.PREPROCESSING.TRAIN_TEST_SPLIT() - Split data for training
- ML.IMPUTE.SIMPLE_IMPUTER() - Handle missing values
- ML.FIT() - Train models with data
- ML.PREDICT() - Make predictions on data