Introduction
An imbalanced dataset refers to a situation where the distribution of classes or labels in the dataset is highly skewed, meaning that one class (the minority class) is significantly underrepresented compared to another class (the majority class).
This issue can arise in various real-world scenarios across different domains, including medical diagnosis, fraud detection, customer churn prediction, rare event detection, and more.
The problem arises when traditional machine learning algorithms, which aim to optimize overall accuracy, tend to perform poorly in the presence of imbalanced data. Since the majority class has more instances, the model may become biased towards predicting that class most of the time, resulting in poor generalization to the minority class. In some cases, the model might even ignore the minority class altogether, leading to serious consequences when the minority class contains critical information or represents important events.
Nowadays, in order to solve the problem of data imbalance, many people have made many attempts in different directions, such as sampling or measurement standards.
This blog mainly introduces three different sampling methods, namely repeatFactorSampler, AdaptiveSampler and weightedRandomSampler