TL;DR — Drawing samples from Gaussian Mixture Models (GMM), or other generative models, is another creative oversampling technique that can potentially outperform SMOTE variants.
In a previous article, I discuss how one can come up with many creative oversampling techniques that can outperform SMOTE variants. We saw how oversampling using “crossovers” outperformed SMOTE variants.
In this article, we will show how Gaussian Mixture Models (GMM), or generative model, can be used to oversample minority class instances in an imbalanced dataset.
TL;DR: I show that decision trees with a single-step lookahead mechanism can outperform standard, greedy decision trees (no lookahead). No overfitting or lookahead pathology is observed in the sample dataset.
Suppose we are trying to predict if a potential job candidate can be successful in his job.
Table of contents:
Monte Carlo simulation is a great forecasting tool for sales, asset returns, project ROI, and more.
In a previous article, I provide a practical introduction of how monte Carlo simulations can be used in a business setting to predict a range of possible business outcomes and their associated probabilities.
TL;DR — There are many ways to oversample imbalanced data, other than random oversampling, SMOTE, and its variants. In a classification dataset generated using scikit-learn’s make_classification default settings, samples generated using crossover operations outperform SMOTE and random oversampling on the most relevant metrics.
Many of us have been in the situation of working on a predictive model with an imbalanced dataset.
The most popular approaches to handling the imbalance include:
Monte Carlo simulation is a computational technique that can be used for a wide range of functions such as solving some of the more difficult mathematical problems as well as risk management.
We will go through 2 examples to demonstrate how Monte Carlo simulations can help you quantify risks in your next project or business decision.
Suppose you have an innovative product that you have been selling for the past year.
Gradient boosted trees have been widely used to win several competitions on Kaggle. It is no surprise that for most tabular datasets you are working with, you would likely find XGBoost or another implementation of boosted decision trees as the model with the best cross-validation score on your metric(s).
Question — How many times have you deployed a gradient boosted trees model with a supposedly good cross-validation score, but your…