Data Functions Reference

Complete reference for FormulaML data handling functions including datasets, data manipulation, and exploration.

Data Functions Reference

Functions for loading, exploring, and manipulating data in FormulaML.

ML.DATASETS Namespace

Built-in datasets for learning and testing machine learning models.

ML.DATASETS.IRIS()

Loads the classic Iris flower classification dataset.

Syntax:

=ML.DATASETS.IRIS()

Parameters: None

Returns: DataFrame object handle containing 150 samples with 5 columns

Dataset Details:

  • Samples: 150
  • Features: 4 (sepal length, sepal width, petal length, petal width)
  • Target: 1 (species: setosa, versicolor, virginica)
  • Use Case: Multi-class classification

Example:

Cell A1: =ML.DATASETS.IRIS()
Result: <Dataset>

# Explore the dataset
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.SAMPLE(A1, 5)

ML.DATASETS.DIABETES()

Loads the diabetes regression dataset for predicting disease progression.

Syntax:

=ML.DATASETS.DIABETES()

Parameters: None

Returns: DataFrame object handle containing 442 samples with 11 columns

Dataset Details:

  • Samples: 442
  • Features: 10 (age, sex, BMI, blood pressure, and 6 blood serum measurements)
  • Target: 1 (disease progression one year after baseline)
  • Use Case: Regression analysis

Example:

Cell A1: =ML.DATASETS.DIABETES()
Result: <Dataset>

# Prepare for regression
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})  # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 10)                      # Target

ML.DATASETS.DIGITS()

Loads the handwritten digits classification dataset.

Syntax:

=ML.DATASETS.DIGITS()

Parameters: None

Returns: DataFrame object handle containing 1,797 samples with 65 columns

Dataset Details:

  • Samples: 1,797
  • Features: 64 (8x8 pixel values)
  • Target: 1 (digit 0-9)
  • Use Case: Multi-class image classification

Example:

Cell A1: =ML.DATASETS.DIGITS()
Result: <Dataset>

# Complex classification task
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # All pixel features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)      # Digit label

ML.DATASETS.OPENML() ⭐

Loads datasets from the OpenML repository (Premium feature).

Syntax:

=ML.DATASETS.OPENML(name, version)

Parameters:

  • name (String, Required): Dataset name from OpenML
  • version (String, Optional): Dataset version (default: “active”)

Returns: DataFrame object handle

Example:

Cell A1: =ML.DATASETS.OPENML("wine-quality-red", "active")
Result: <Dataset>

# Load specific dataset
Cell B1: =ML.DATASETS.OPENML("credit-g")

Note: Visit https://www.openml.org/ to browse available datasets.


ML.DATA Namespace

Functions for data manipulation and exploration.

ML.DATA.CONVERT_TO_DF()

Converts Excel data range to a FormulaML DataFrame object.

Syntax:

=ML.DATA.CONVERT_TO_DF(data_range, has_header)

Parameters:

  • data_range (Range, Required): Excel range containing data
  • has_header (Boolean, Required): TRUE if first row contains column names

Returns: DataFrame object handle

Example:

# Convert Excel data with headers
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E100, TRUE)
Result: <DataFrame>

# Convert without headers
Cell B1: =ML.DATA.CONVERT_TO_DF(Data!B2:F50, FALSE)

ML.DATA.INFO()

Returns detailed information about a DataFrame.

Syntax:

=ML.DATA.INFO(dataframe)

Parameters:

  • dataframe (Object, Required): DataFrame object handle

Returns: DataFrame with columns: Column, Non-Null Count, Dtype

Example:

Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.INFO(A1)

# Returns table showing:
# Column          Non-Null Count  Dtype
# sepal_length    150            float64
# sepal_width     150            float64
# ...

ML.DATA.DESCRIBE()

Generates descriptive statistics for numerical columns.

Syntax:

=ML.DATA.DESCRIBE(dataframe)

Parameters:

  • dataframe (Object, Required): DataFrame object handle

Returns: DataFrame with statistics (count, mean, std, min, 25%, 50%, 75%, max)

Example:

Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.DESCRIBE(A1)

# Returns statistical summary:
#        age      sex      bmi     ...
# count  442.0    442.0    442.0
# mean   48.5     0.05     26.3
# std    13.1     0.04     4.4
# ...

ML.DATA.SAMPLE()

Returns a sample of rows from the DataFrame.

Syntax:

=ML.DATA.SAMPLE(dataframe, n, head, tail, seed, missing_values)

Parameters:

  • dataframe (Object, Required): DataFrame object handle
  • n (Integer, Optional): Number of rows (default: 10, use -1 for all)
  • head (Boolean, Optional): Return first n rows (default: TRUE)
  • tail (Boolean, Optional): Return last n rows (default: FALSE)
  • seed (Integer, Optional): Random seed for reproducibility
  • missing_values (String, Optional): Excel error to show for missing (#N/A default)

Returns: DataFrame with sampled rows

Examples:

# First 5 rows
Cell A1: =ML.DATA.SAMPLE(dataframe, 5)

# Last 10 rows
Cell A2: =ML.DATA.SAMPLE(dataframe, 10, FALSE, TRUE)

# Random 20 rows with seed
Cell A3: =ML.DATA.SAMPLE(dataframe, 20, FALSE, FALSE, 42)

# All data
Cell A4: =ML.DATA.SAMPLE(dataframe, -1)

ML.DATA.SELECT_COLUMNS()

Selects specific columns from a DataFrame.

Syntax:

=ML.DATA.SELECT_COLUMNS(dataframe, columns)

Parameters:

  • dataframe (Object, Required): DataFrame object handle
  • columns (Array/Integer/String, Required): Column indices or names

Returns: DataFrame object handle with selected columns

Examples:

# Select single column by index
Cell A1: =ML.DATA.SELECT_COLUMNS(dataframe, 0)

# Select multiple columns by indices
Cell A2: =ML.DATA.SELECT_COLUMNS(dataframe, {0,1,2,3})

# Select column range (as string)
Cell A3: =ML.DATA.SELECT_COLUMNS(dataframe, "0:3")

# Select by column names (if available)
Cell A4: =ML.DATA.SELECT_COLUMNS(dataframe, {"age","bmi","bp"})

ML.DATA.DROP_MISSING_ROWS()

Removes rows with missing values from DataFrame.

Syntax:

=ML.DATA.DROP_MISSING_ROWS(dataframe, how, thresh, subset)

Parameters:

  • dataframe (Object, Required): DataFrame object handle
  • how (String, Optional): “any” or “all” (default: “any”)
  • thresh (Integer, Optional): Minimum non-null values required
  • subset (Array, Optional): Columns to check for missing values

Returns: DataFrame object handle without missing values

Examples:

# Drop rows with any missing values
Cell A1: =ML.DATA.DROP_MISSING_ROWS(dataframe)

# Drop rows where all values are missing
Cell A2: =ML.DATA.DROP_MISSING_ROWS(dataframe, "all")

# Keep rows with at least 3 non-null values
Cell A3: =ML.DATA.DROP_MISSING_ROWS(dataframe, , 3)

# Check specific columns only
Cell A4: =ML.DATA.DROP_MISSING_ROWS(dataframe, "any", , {"age","bmi"})

Common Patterns

Loading and Preparing Data

# Load dataset
Cell A1: =ML.DATASETS.IRIS()

# Explore structure
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.DESCRIBE(A1)
Cell A4: =ML.DATA.SAMPLE(A1, 10)

# Separate features and target
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})  # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4)          # Target

Converting Excel Data

# Prepare Excel range
# Assume data in columns A-E, rows 1-100, with headers

# Convert to DataFrame
Cell F1: =ML.DATA.CONVERT_TO_DF(A1:E100, TRUE)

# Clean missing values
Cell F2: =ML.DATA.DROP_MISSING_ROWS(F1)

# Select relevant columns
Cell F3: =ML.DATA.SELECT_COLUMNS(F2, {1,2,3})

Data Exploration Workflow

# Load data
Cell A1: =ML.DATASETS.DIABETES()

# Get overview
Cell B1: =ML.DATA.INFO(A1)        # Structure
Cell B2: =ML.DATA.DESCRIBE(A1)    # Statistics
Cell B3: =ML.DATA.SAMPLE(A1, 20)  # Sample rows

# Check for patterns
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2})  # First 3 features
Cell C2: =ML.DATA.DESCRIBE(C1)                 # Stats for subset

Tips and Best Practices

  1. Always explore data first

    • Use INFO to understand structure
    • Use DESCRIBE for statistical overview
    • Use SAMPLE to inspect actual values
  2. Handle missing values early

    • Check INFO output for null counts
    • Use DROP_MISSING_ROWS or imputation
  3. Use consistent column selection

    • Numerical indices are more reliable
    • Document which columns represent what
  4. Set random seeds for reproducibility

    • Use same seed in SAMPLE for consistent results
    • Important for debugging and testing
  5. Memory considerations

    • Use SAMPLE with small n for large datasets
    • Select only necessary columns