Table of Contents
Working with Data
Data is the foundation of any machine learning project. This guide covers everything you need to know about handling data in FormulaML, from loading built-in datasets to preparing your own Excel data for analysis.
Understanding Object Handles
Before we dive into data operations, it’s crucial to understand how FormulaML manages data in Excel.
What are Object Handles?
Object handles are FormulaML’s way of storing complex data structures in Excel cells. They appear as text like <Dataset>
or <StandardScaler>
.
Think of handles like this:
- A handle is like a shipping label on a package
- The label (handle) fits in a cell, but references the entire package (your data)
- You pass handles between functions to operate on your data
Working with Handles
Cell A1: =ML.DATASETS.IRIS() → Returns: <Dataset>
Cell A2: =ML.DATA.INFO(A1) → Uses handle from A1
Cell A3: =ML.DATA.SAMPLE(A1, 5) → Also uses handle from A1
Each function that uses the handle can access the full dataset without copying it to multiple cells.
Loading Data
FormulaML provides several ways to load data for analysis.
Built-in Datasets
Perfect for learning and testing, these datasets are included with FormulaML:
Iris Dataset (Classification)
=ML.DATASETS.IRIS()
- 150 samples of iris flowers
- 4 features: sepal length/width, petal length/width
- 3 target classes: Setosa, Versicolor, Virginica
- Use case: Multi-class classification
Diabetes Dataset (Regression)
=ML.DATASETS.DIABETES()
- 442 patient samples
- 10 physiological features
- Target: Disease progression after one year
- Use case: Regression analysis
Digits Dataset (Classification)
=ML.DATASETS.DIGITS()
- 1,797 handwritten digit images
- 64 features (8x8 pixel values)
- 10 target classes (digits 0-9)
- Use case: Image classification
OpenML Datasets (Premium)
=ML.DATASETS.OPENML(dataset_id)
- Access thousands of datasets from OpenML repository
- Example:
=ML.DATASETS.OPENML(61)
loads the Iris dataset from OpenML - Requires premium subscription
Converting Excel Data
Transform your Excel data into FormulaML DataFrames:
Basic Conversion
=ML.DATA.CONVERT_TO_DF(data_range, has_header)
Parameters:
data_range
: Your Excel data range (e.g., A1:E100)has_header
: TRUE if first row contains column names, FALSE otherwise
Example:
=ML.DATA.CONVERT_TO_DF(Sheet1!A1:D50, TRUE)
Converts data from Sheet1, treating the first row as headers.
Practical Example: Sales Data
Suppose you have sales data in columns A-D:
- Column A: Date
- Column B: Product
- Column C: Quantity
- Column D: Revenue
Cell F1: =ML.DATA.CONVERT_TO_DF(A:D, TRUE)
Result: <DataFrame>
containing your sales data
Exploring Your Data
Once data is loaded, explore it to understand its structure and content.
Data Information
=ML.DATA.INFO(dataframe)
Returns detailed information about your dataset:
- Number of rows and columns
- Column names and data types
- Memory usage
- Missing value counts
Statistical Summary
=ML.DATA.DESCRIBE(dataframe)
Provides statistical summary for numerical columns:
- Count (non-null values)
- Mean (average)
- Standard deviation
- Min and max values
- Quartiles (25%, 50%, 75%)
Viewing Sample Data
=ML.DATA.SAMPLE(dataframe, n_rows)
Displays random rows from your dataset.
Example Workflow:
Cell A1: =ML.DATASETS.DIABETES()
Cell A3: =ML.DATA.INFO(A1) → Shows 442 samples, 11 columns
Cell A5: =ML.DATA.DESCRIBE(A1) → Statistical summary
Cell A7: =ML.DATA.SAMPLE(A1, 10) → View 10 random rows
Selecting and Manipulating Data
Selecting Columns
Select Specific Columns
=ML.DATA.SELECT_COLUMNS(dataframe, columns)
Examples:
Select single column: =ML.DATA.SELECT_COLUMNS(A1, 0)
Select multiple columns: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2})
Select column range: =ML.DATA.SELECT_COLUMNS(A1, "0:3")
Practical Example: Feature and Target Separation
For machine learning, you typically need to separate features (X) from target (y):
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3}) → Features (X)
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4) → Target (y)
Data Sampling
Random Sampling
=ML.DATA.SAMPLE(dataframe, n_samples, random_state)
n_samples
: Number of rows to samplerandom_state
: Optional seed for reproducibility
Stratified Sampling
For balanced sampling across categories (useful for imbalanced datasets):
=ML.DATA.SAMPLE(dataframe, n_samples, random_state, stratify_column)
Data Preparation Workflow
Here’s a complete workflow for preparing your data:
Step 1: Load Your Data
Cell A1: =ML.DATA.CONVERT_TO_DF(RawData!A:F, TRUE)
Step 2: Explore the Data
Cell A3: =ML.DATA.INFO(A1)
Cell A5: =ML.DATA.DESCRIBE(A1)
Cell A7: =ML.DATA.SAMPLE(A1, 10)
Step 3: Check for Issues
Look for:
- Missing values in the INFO output
- Unusual statistics in DESCRIBE (outliers)
- Data types that need conversion
Step 4: Select Relevant Columns
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {1,2,3,4}) → Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 5) → Target
Step 5: Split for Training/Testing
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, C1, 0.2, 42)
Working with Different Data Types
Numerical Data
FormulaML automatically handles numerical data. Ensure your Excel data:
- Uses consistent number formats
- Doesn’t mix text with numbers
- Handles missing values appropriately (blank cells or specific values)
Categorical Data
For text categories (e.g., “High”, “Medium”, “Low”):
- Convert to DataFrame normally
- FormulaML will encode categories automatically when needed
- Use appropriate algorithms that handle categorical data
Date and Time Data
Convert dates to numerical features before loading:
Excel formula to convert date to days: =A2-$A$2
Excel formula to extract month: =MONTH(A2)
Excel formula to extract day of week: =WEEKDAY(A2)
Handling Missing Data
Identifying Missing Values
The ML.DATA.INFO()
function shows missing value counts for each column.
Strategies for Missing Data
Option 1: Remove Rows with Missing Values
Use Excel filtering before converting to DataFrame
Option 2: Imputation (Premium)
=ML.IMPUTE.SIMPLE_IMPUTER(strategy)
Strategies: “mean”, “median”, “most_frequent”, “constant”
Example workflow:
Cell A1: =ML.DATA.CONVERT_TO_DF(MyData!A:D, TRUE)
Cell A2: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A3: =ML.FIT_TRANSFORM(A2, A1)
Best Practices
1. Data Quality Checks
Always explore your data before modeling:
- Check for missing values
- Look for outliers in statistical summaries
- Verify data types are correct
2. Consistent Formatting
- Use consistent number formats across columns
- Avoid mixing data types in columns
- Remove special characters from text data
3. Data Size Considerations
- FormulaML handles datasets up to 1 million rows efficiently
- For larger datasets, consider sampling
- Premium features offer better performance for large data
4. Reproducibility
Use random_state
parameters for consistent results:
=ML.DATA.SAMPLE(A1, 100, 42) → Same sample every time
Common Data Scenarios
Scenario 1: Time Series Sales Forecasting
1. Load data: =ML.DATA.CONVERT_TO_DF(Sales!A:C, TRUE)
2. Create lag features manually in Excel
3. Select features: =ML.DATA.SELECT_COLUMNS(dataframe, {1,2,3})
4. Select target: =ML.DATA.SELECT_COLUMNS(dataframe, 4)
Scenario 2: Customer Segmentation
1. Load customer data: =ML.DATA.CONVERT_TO_DF(Customers!A:G, TRUE)
2. Select numerical features: =ML.DATA.SELECT_COLUMNS(dataframe, {2,3,4,5})
3. Apply clustering: =ML.CLUSTERING.KMEANS(n_clusters=3)
Scenario 3: Binary Classification
1. Load data: =ML.DATA.CONVERT_TO_DF(Data!A:F, TRUE)
2. Separate features and target
3. Split data: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, 0.3)
4. Train classifier: =ML.CLASSIFICATION.LOGISTIC()
Troubleshooting Data Issues
“Object handle not found”
- Ensure you’re referencing the correct cell containing the handle
- Handles expire when Excel closes; reload data if needed
“Invalid data format”
- Check for mixed data types in columns
- Remove non-numeric characters from numerical columns
- Ensure consistent formatting
“Memory error with large dataset”
- Sample your data:
=ML.DATA.SAMPLE(dataframe, 10000)
- Select only necessary columns
- Consider premium version for better performance
Next Steps
Now that you understand data handling in FormulaML:
- Learn about models: Explore Models and Algorithms
- Try examples: Work through our tutorials
- Advanced techniques: Discover preprocessing methods
Ready to build models? Continue to Models and Algorithms →