Scaling is a fundamental preprocessing step in machine learning that can dramatically affect the performance of models. By standardising the range of features of data, scaling ensures that each feature contributes equally to the outcome of the model, preventing any single feature with a wider range from disproportionately influencing the model’s predictions.
While scaling is a powerful tool for enhancing model accuracy, the choice of scaling method must align with your data characteristics and the model’s mathematical assumptions. Testing multiple scaling techniques during the cross-validation phase is a practical approach to determine which scaling method harmonises best with your data and model choice. Like a well-tailored suit, the right scaling ensures your data fits your model perfectly, enabling it to perform at its best.
Why Scaling Is Imperative in Machine Learning
Imagine trying to interpret a conversation where one person speaks in a whisper and another with a megaphone. It would be difficult to gauge the true intent and weight of each speaker’s words. Similarly, in machine learning, features on different scales can distort the real significance of each feature when algorithms that are sensitive to feature magnitude, like K-Nearest Neighbors (KNN) or Gradient Descent, are used. Scaling normalises these voice levels, so to speak, allowing the model to learn from the data more effectively without bias towards irrelevant nuances of data representation.
Main Scaling Methodologies
- Min-Max and MaxAbs scaling transform data to a bounded interval, essential for models sensitive to the absolute size.
- Standardisation removes the mean and scales to unit variance, helping in cases where the algorithm assumes data is centred around zero.
- Robust Scaling uses the median and the quartile range, thus reducing the influence of outliers.
- Quantile Transformer scales features to follow the same distribution, which can improve the predictive performance for datasets not conforming to normal distribution.
Min-Max Scaling
- Best for: Data needing bounded ranges without outliers influencing the scaling.
- Ideal Algorithms: Neural networks, KNN, and any algorithm requiring a strict [0, 1] range.
- Examples: Image data normalisation, pre-processing for consumer choice modelling, or adjusting survey data spread across multiple-choice scales.
Standardisation (Z-Score Normalisation)
- Best for: Gaussian distribution and when outliers are not a concern.
- Ideal Algorithms: Linear and logistic regression, support vector machines, and linear discriminant analysis.
- Examples: Scaling stock prices for financial time series forecasting, patient health indicators in medical diagnostics, or feature scaling in anomaly detection systems.
MaxAbs Scaling
- Best for: Data where the distribution is not known, and preserving zero entries in sparse data is crucial.
- Ideal Algorithms: Large-scale text data processing, image processing where sparsity is a key factor.
- Examples: Document term frequency data in text classification, large-scale image datasets for computer vision tasks.
Robust Scaling
- Best for: Data with outliers.
- Ideal Algorithms: Any that need robustness to outliers, such as decision trees or clustering algorithms.
- Examples: Real estate pricing models, economic data analysis during volatile periods, handling of exception cases in manufacturing data.
Quantile Transformer Scaling
- Best for: Non-parametric data without assuming any distribution.
- Ideal Algorithms: Neural networks in non-standard data settings, gradient boosting models.
- Examples: Complex sensor data in IoT, creative content features in recommendation systems, unusual pattern recognition in cybersecurity.
What happens if Scaling is NOT Optimal?
Choosing an inappropriate scaling method can lead to suboptimal model performance. Here’s a hypothetical comparison of outcomes if one scaling method is substituted for another:
If you use… | Instead of… | Possible Outcome |
---|---|---|
Min-Max Scaling | Standardisation | Poor performance with outliers; features not normally distributed. |
Standardisation | Min-Max Scaling | Features with naturally large ranges may dominate. |
Robust Scaling | Quantile Transformer | Over-normalisation, losing some important outlier information. |
MaxAbs Scaling | Robust Scaling | Distortion in handling outliers, particularly in sparse datasets. |
Quantile Transformer | Standardisation | Overfitting due to transforming features to a similar distribution unnaturally. |
When to use - What to use
Min-Max Scaling
Min-Max Scaling is particularly useful for algorithms that require input data to be bounded within a specific range:
- Neural Networks: Input normalization can help speed up learning and lead to faster convergence.
- K-Nearest Neighbors (KNN): Since KNN calculates the distance between different instances, scaling ensures that all features contribute equally.
- Gradient Boosting Machines: Although tree-based methods usually handle varying scales well, gradient boosting can benefit from scaled input to stabilise and speed up training.
- Support Vector Machines (SVM): For kernels that compute the dot product (e.g., the polynomial kernel), normalized data ensures that the feature space is equally treated.
- Principal Component Analysis (PCA): Although not a machine learning model per se, PCA’s performance in dimensionality reduction is enhanced when features are on the same scale.
Standardisation (Z-Score Normalisation)
Standardisation is ideal for algorithms that assume data is normally distributed or can be sensitive to outliers:
- Linear Regression: Parameters estimation can be more stable and interpretation easier when features are standardised.
- Logistic Regression: Similar to linear regression, standardisation helps in learning weights uniformly across all features.
- Support Vector Machines (SVM): Especially with non-linear kernels like RBF, where scaling influences the kernel’s ability to manage the feature space.
- Ridge and Lasso Regression: Regularization paths in these methods are directly affected by the scale of the features.
- K-Means Clustering: Standardisation helps in measuring true similarities between instances, which is crucial for cluster formation.
MaxAbs Scaling
MaxAbs Scaling is suitable for data that contains large outliers or features measured on different scales:
- Sparse Data Models: Algorithms processing text data, such as NLP tasks, where data remains sparse and zero entries are preserved.
- Ridge Regression: Works well with sparse data, benefiting from feature scaling that does not shift the data distribution.
- Lasso Regression: Similar to ridge, especially when dealing with high-dimensional data.
- DBSCAN: Suitable for clustering high-dimensional data in which maintaining the sparsity and range of data is crucial.
- Latent Dirichlet Allocation (LDA): Used in text analysis and topic modelling, where maintaining the sign of zeros is beneficial.
Robust Scaling
Robust Scaling is beneficial for datasets heavily populated with outliers:
- Robust Linear Models: These models are explicitly designed to handle outliers in data.
- Hierarchical Clustering: Ensures that clusters are not influenced by extreme values.
- DBSCAN: This clustering algorithm can perform better when outliers do not distort the metric space.
- Decision Trees and Random Forests: Although inherently robust to different scales, using robust scaling can improve model interpretability.
- All Kinds of Regression Models: Especially in cases where outliers might skew the interpretation and balance of regression models.
Quantile Transformer Scaling
Quantile Transformer Scaling is ideal for modelling non-parametric data without assuming any specific data distribution:
- Neural Networks: Helps in stabilising the activation functions throughout layers.
- Gradient Boosting Machines: Ensures that the model is less sensitive to the outlier distribution.
- K-Means Clustering: Can improve cluster allocations by transforming data to approximately the same scale and distribution.
- Anomaly Detection: Facilitates identifying deviations as the data follows a similar distribution.
- Gaussian Processes: Benefits from data that adheres to a normal distribution, which the quantile transformer can approximate.