Other generative models can also similarly be used for oversampling

Image by Markéta Machová from Pixabay

Table of contents

  • Introduction
  • Dataset preparation
  • Intro to GMM
  • Using GMM as an oversampling technique
  • Evaluation of performance metrics
  • Conclustion


In a previous article, I discuss how one can come up with many creative oversampling techniques that can outperform SMOTE variants. We saw how oversampling using “crossovers” outperformed SMOTE variants.

Dataset preparation

We first generate…

Lookahead mechanisms in decision trees can produce better predictions

Image by Steve Buissinne from Pixabay

Why lookahead?

Suppose we are trying to predict if a potential job candidate can be successful in his job.

Getting Started

Can sales of vanilla ice cream overtake chocolate?

Image by Nicky • 👉 PLEASE STAY SAFE 👈 from Pixabay
  • Problem Statement
  • Data preparation
  • Wrong method 1 — Independent simulation (parametric)
  • Wrong method 2 — Independent simulation (non-parametric)
  • Method 1 — Multivariate distribution
  • Method 2— Copulas with marginal distributions
  • Method 3— Simulating historical combinations of sales growth
  • Method 4— Decorrelating store sales growth using PCA


Monte Carlo simulation is a great forecasting tool for sales, asset returns, project ROI, and more.

Crossover/recombination oversampling adds novelty to a dataset and can score well on classification metrics vs. SMOTE and random oversampling

Image by liyuanalison at Pixabay

Table of contents

  • Introduction
  • Dataset preparation
  • Random oversampling and SMOTE
  • Crossover oversampling
  • Evaluation of performance metrics
  • Conclusion


Many of us have been in the situation of working on a predictive model with an imbalanced dataset.

  • Oversampling techniques
  • Undersampling techniques
  • Combinations of over and under…

Assess probabilities of various business outcomes

Photo by Mark de Jong on Unsplash

Example 1: Sales Offer From a Wholesaler

Suppose you have an innovative product that you have been selling for the past year.

Model Interpretability

Tree-based ensembles and other popular algorithms often lead to counter-intuitive predictions when kept unchecked

Photo by Jose Vega from Pexels

Table of Contents:

  • Intro to model controllability
  • Preparing a sample dataset (House Sales in King County, USA)
  • Finding the model with the top cross-validation score (CatBoost)
  • Linear model’s outperformance in sanity checks
  • Conclusion


Gradient boosted trees have been widely used to win several competitions on Kaggle. It is no surprise that for most tabular datasets you are working with, you would likely find XGBoost or another implementation of boosted decision trees as the model with the best cross-validation score on your metric(s).

Bassel Karami

Leading a data science team building retail analytics for shopping malls in the MENA region. MSc Econometrics | CFA, FRM, and CMA.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store