-
Notifications
You must be signed in to change notification settings - Fork 27
/
smrt.tex
182 lines (130 loc) · 16.3 KB
/
smrt.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
\documentclass[twoside,11pt]{article}
% Any additional packages needed should be included after jmlr2e.
% Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,
% and defines many common macros, such as 'proof' and 'example'.
%
% It also sets the bibliographystyle to plainnat; for more information on
% natbib citation styles, see the natbib documentation, a copy of which
% is archived at http://www.jmlr.org/format/natbib.pdf
\usepackage{jmlr2e}
\usepackage{amsmath}
\usepackage{algorithm}
\usepackage{bm}
\usepackage[noend]{algpseudocode}
\usepackage{wrapfig}
\usepackage{float}
% Definitions of handy macros can go here
\newcommand{\dataset}{{\cal D}}
\newcommand{\fracpartial}[2]{\frac{\partial #1}{\partial #2}}
% Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names}
% \jmlrheading{1}{2017}{1-48}{4/00}{10/00}{Taylor G. Smith and Jason M. White}
% Short headings should be running head and authors last names
\ShortHeadings{Correcting Class Imbalance with Variational Auto-Encoders}{Smith and White}
\firstpageno{1}
\begin{document}
\title{On the Synthetic Generation of Minority Class Observations Using Variational Auto-Encoders}
\author{\name Taylor G.\ Smith \email taylor.smith@alkaline-ml.com
\AND
\name Jason M.\ White \email jason.m.white5@gmail.com}
\editor{Michael Bernico}
\maketitle
\begin{abstract}% <- trailing '%' for backward compatibility of .sty file
This document describes a methodology by which to remedy imbalanced datasets for the purpose of classification. A dataset may be considered imbalanced if its classification labels are disproportionately represented across classes. In many real-world domains, class imbalance---or the presence of extremely rare events which may be costly to misclassify---is quite prevalent. Commonly cited problem domains include fraud detection, medical imaging, anomaly detection and countless more. Our procedure augments the Synthetic Minority Over-sampling Technique \citep*{chawla2002smote}---or SMOTE---by training generative variational auto-encoders on the minority class(es) for the purpose of creating synthetic examples. This ensures a set of synthetic observations that are more true to the most archetypal observed points in the minority class.
\end{abstract}
\begin{keywords}
classification, class-imbalance, cost-sensitive learning, SMOTE, variational auto-encoders, skewed distributions
\end{keywords}
\section{Introduction}
A dataset may be considered imbalanced if its classification labels are disproportionately represented across classes. While class imbalance is very common and manifests itself in varying degrees, of particular interest are the cases in which one class---the majority class---is significantly more present than one or more minority classes, which are represented at a much smaller ratio. This can detrimentally impact a learning algorithm's ability to estimate a generalizable decision boundary. Consider, for instance, a medical test to determine whether a patient suffers a rare disease. The dataset may be 99.8\% composed of negative observations with only 0.2\% positive examples. Even the most na\"ive classifier can achieve 99.8\% classification accuracy in this case by simply learning to always predict the negative class \citep{lewis1994heterogeneous}. However, the utility of this test is completely absent, since it will never accurately predict the condition of value. This is by no means an isolated case, either; many machine learning domains---such as fraud detection, network security, spam filtration---frequently face some level of class imbalance. This paper presents some of the pitfalls of training machine learning models on such dataset, and presents a remedying class-balancing technique.
Throughout this paper, we focus on inducing classification algorithms on a given training set, $X \in \mathbb{R}^{m \times n}$, with a corresponding set of class labels, $y \in \{0, 1, ..., c\}$ in which one or more of the minority class labels is/are represented at a significantly smaller proportion than that of one or more majority class labels. As noted by countless studies and the aforementioned medical test example, classifier efficacy often cannot meaningfully be measured via conventional, cost-insensitive metrics such as accuracy (or the percentage of testing observations properly identified by the learner). This greatly complicates the classification task since such metrics will offer misleadingly optimistic scores on an otherwise ineffective classifier. Therefore, what makes class imbalance a particularly interesting and relevant problem is the frequent tangible cost with which misclassification of rare events is typically associated. The real-world impact of such errors can be especially perilous in the medical domain, where diagnostic datasets are especially susceptible to class disparity, as high risk examples of interest (e.g., instances of rare diseases) tend to constitute the minority class \citep{rahman2013addressing}.
Section 2 presents previous work to which our approach may be compared. Section 3 introduces generative models, and more specifically, variational auto-encoders. Section 4 outlines the details of our technique. Section 5 details the specifics of our experiments and the performance of our technique compared with other common class imbalance solutions. \\
\section{Previous Work}
The problem of class imbalance is not a new one; for a long time, the machine learning community has attempted to address the problem of class imbalance in several ways: class weighting---such as the development of cost-sensitive classification metrics like the ROC convex hull \citep{provost2001robust} or the F-measure \citep{lewis1994training}---which assigns a heavier cost to misclassified events in the positive class, and data preparation tasks that either resample or manipulate the input data before building a model. Our approach can be considered one of the latter, and seeks to synthetically augment the training set irrespective of the selected scoring metric. We present existing research for each of the strategies in the following subsections.
\subsection{Over-sampling minority class samples}
Research on the efficacy of resampling as a preprocessing technique for imbalanced datasets has been extensively conducted. One common approach for resampling a training set is over-sampling the minority class examples with replacement \citep{japkowicz2000class}. She proposed several variants of over-sampling, with varying degrees of success. Her random resampling approach drew minority samples with replacement until the minority class was represented at the same magnitude as the majority class. She also proposed ``focused resampling," which only drew minority samples from along the decision boundary between the majority and minority classes.
Of note is the fact that she observed no clear advantage between generalized over-sampling and her ``focused" over-sampling \citep{japkowicz2000class}. Using a decision tree classifier, \citet{chawla2002smote} showed that duplicating minority observations actually further narrowed the (theoretically, already very small) decision region, causing more splits in the tree and leading to overfitting.
\subsection{Under-sampling majority class samples}
In conjunction with her research on over-sampling methods, \citet*{japkowicz2000learning} showed that under-sampling the majority class can be effective---especially in tasks where $c > 2$---presenting the added benefit of reducing the training set size, and consequently the complexity of the training task. However, in cases of extreme class disparity, the amount of down-sampling required to balance the class labels may so significantly diminish the size of the dataset that it could become unusable.
\citet*{kubat1997addressing} proposed a more sophisticated form of under-sampling by which the majority class samples are segmented into one of four categories: those suffering \emph{class-label noise}, unreliable \emph{borderline} samples that sit near the decision surface, \emph{redundant} examples, and \emph{safe} examples (archetypal to the majority class representation, and necessary for training). They showed that the noisy and border-point majority class examples could easily be recognized in the training set by identifying and removing majority class samples that form \emph{Tomek links} \citep{tomek1976two} with minority class samples.
\subsection{Synthesizing minority class samples}
Rather than simply sampling the minority class with replacement, \cite{chawla2002smote} proposed the over-sampling of synthetic minority class observations. As previously mentioned, they implemented the SMOTE algorithm which uses randomly sampled minority class observations' \emph{k}-nearest neighbors to generalize minority class decision region neighborhoods: for each of $p$ randomly sampled minority observations, $\mathbf{x}^{(i)} \in \mathbb{R}^{n}$, compute the \emph{k}-nearest neighbors to $\mathbf{x}^{(i)}$, denoted as $\mathbf{J} \in \mathbb{R}^{k \times n}$, and generate a synthetic example, $\mathbf{x}^{(i)\prime}$, between $\mathbf{x}^{(i)}$ and each nearest neighbor, $\mathbf{j}^{(l)}$:
\[
\mathbf{x}^{(i)\prime} = \mathbf{x}^{(i)} + gap * (\mathbf{j}^{(l)} - \mathbf{x}^{(i)})
\]
where $gap$ is a random scalar between 0 and 1. The application of SMOTE has been shown to effectively aid in generalizing the decision region for the minority class(es) by synthetically creating minority class observations \citep*{chawla2002smote}. However, it runs the risk of ``\emph{over}-generalized'' minority decision regions in cases of significant class overlap.
Figure 1 shows an unbalanced dataset with significant overlap between classes. The decision regions for the minority class are narrow with respect to the total decision space, and classification performance is poor. Figure 2 shows the same dataset after being balanced by SMOTE. The decision region for the minority class, though significantly generalized, has bled so far into the majority class space as to diminish the utility of the classifier.
%\newpage
%\centerline{
% \includegraphics[scale=.6]{no-balance.png}
%}
%\newpage
\begin{figure}[H]
\centering
\includegraphics[scale=.5]{no-balance.png}
\caption[width=40mm]{An SVM classifier exhibits poor performance on a highly unbalanced dataset with significant class overlap \& a narrow decision region for the positive class.}
\label{fig:label}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[scale=.5]{smote-balance.png}
\caption[width=40mm]{The same dataset, after being balanced with SMOTE. The decision region has been over-generalized to the point of bleeding so far into the positive decision space that the usefulness of the classifier is diminished.}
\label{fig:label}
\end{figure}
\section{Generative Models}
At a high level, generative methods are a family of unsupervised learning techniques that can be trained to generate synthetic data. Easy to conceptualize is the case of image datasets, where a model can be thought of as not simply memorizing input samples for some classification task, but learning features that can be used to create a stylistically canonical image that is completely fake, yet resembles a sample that \emph{could} have been presented to the inducer for training. Thus, the advent of generative algorithms is that they do not simply learn the classification boundaries between classes---as do discriminative algorithms---but model the distribution by which the data itself was generated.
\subsection{Variational Auto-Encoders}
SMRT is built atop variational auto-encoders. An auto-encoder is a special case of unsupervised multilayer perceptron (MLP) that estimates its own input vector as its output. In a more formal sense, the auto-encoder takes some input vector $\mathbf{x} \in \mathbb{R}^{n}$ and maps it deterministically to some hidden representation (typically a compressed space) $\mathbf{y} \in \mathbb{R}^{h}$ via $\mathbf{y} = f_{\theta}(\mathbf{x}) = g(\mathbf{Wx} + b)$, where $\mathbf{W}$ is a $n \times h$ matrix of weights parameterized by $\theta = \{\mathbf{W, b}\}$ \citep{meyer2015introduction}. $b \in \mathbb{R}^{h}$ is a vector of biases generally initialized as $\vec{1}$, and $g$ is the activation function---commonly either the \emph{sigmoid} or \emph{relu}:
\[
\mathrm{sigmoid(x)} = \frac{1}{1 + e^{-x}}
\]
\[
\mathrm{relu(x)} = \mathrm{max}(0,x)
\]
The goal, then, is that the mapping of $\mathbf{x} \rightarrow \mathbf{y}$ reveals some non-linear transformation of $\mathbf{x}$, thus extracting new features derived by otherwise unobvious relationships between input features. This offers the ability to effectively reconstruct inputs with minimal error, while identifying high-reconstruction-error examples as anomalous, or non-conformant to the underlying learned distribution. The general architecture of an auto-encoder is dual: the encoding task either compresses or expands the input signal into the hidden layer space, while the decode task projects the hidden layer-transformed values back to the input space.
Whereas general auto-encoders are deterministic MLPs, the \emph{variational} auto-encoder (VAE) is a generative adaptation that consists of two parts: a \emph{probabilistic} encoder, $q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})$, that approximates the (intractable) posterior distribution, $p(\mathbf{z}|\mathbf{x}^{(i)})$; and a \emph{generative} decoder, $p_{\theta}(\mathbf{x}^{(i)}|\mathbf{z})$ which does not depend on any input \citep{shiffman2016}. VAEs learn latent vectors, which approximate a unit Gaussian posterior, $\mathbf{z}$, with a diagonal covariance structure \citep{kingma2013auto}:
\[
\log q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)}) = \log \mathcal{N}(\mathbf{z}; \bm{\mu}^{(i)}, \bm{\sigma}^{2(i)}\mathbf{I})
\]
where the mean and standard deviation of $\mathbf{z}$ (the approximate posterior), $\bm{\mu}^{(i)}$ and $\bm{\sigma}^{(i)}$, are the output of the auto-encoder's encode task \citep{kingma2013auto}.
The VAE parameters are found by minimizing a dual loss-function: the combination of the reconstructive loss---the MSE of the input vector and the reconstructed vector---and the latent loss---the Kullback-Leibler (KL) divergence---which measures how well the latent vectors approximate the unit Gaussian:
\[
D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})\parallel p_{\theta}(\mathbf{z}))
\]
In the decode stage, synthetic examples are generated by sampling the posterior:
\[
\mathbf{z}^{(i,l)} \sim q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})
\]
and finally decoding $\mathbf{z}^{(i,l)}$ with either a Bernoulli or Gaussian multilayer perceptron \citep{kingma2013auto}. The utility of using VAEs for class balancing is that the approximated posterior distribution will generate synthetic examples that are more probabilistically archetypal to the minority class samples, in similar fashion to the idea of ``safe examples'' proposed by \citet{kubat1997addressing}.
\section{SMRT}
We present a generative over-sampling approach similar to SMOTE, by which the observations in each minority class are used to fit a variational auto-encoder, and synthetic examples are generated until the class is represented at the user-specified ratio. \\
\makeatletter
\def\BState{\State\hskip-\ALG@thistlm}
\makeatother
\begin{algorithm}
\caption{SMRT}\label{smrt}
\begin{algorithmic}[1]
\Procedure{Balance}{$X, y, ratio$}\Comment{Balance $X, y$ subject to $ratio$}
\State $labels \gets \text{distinct } \textit{y}$
\State $majority \gets argmax(count(y))$
\State $nlabels \gets \text{length of } labels$
\State $nreq \gets int(ratio \times \text{number of majority samples in } \textit{y})$
\BState \emph{loop}:
\For{i := 1 to \textit{nlabels}}
\State $label \gets labels[i]$
\If {$label \neq majority$}\Comment{Skip the majority class}
\State $Xsub \gets X \text{where } y = label$
\State $p \gets nreq - \text{length of } Xsub$\Comment{\textit{n} needed for this class}
\State \text{Fit a variational auto-encoder with $Xsub$, call it $vae$}
\State \text{Generate $p$ synthetic examples from $vae$, update $X, y$}
\EndIf
\EndFor
\Return{$X, y$}\Comment{Optionally shuffle}
\EndProcedure
\end{algorithmic}
\end{algorithm}
\section{Experiments}
We used induced three different learners on each dataset: TODO
\subsection{Datasets}
\newpage
\bibliography{references}
\end{document}