-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanomaly_detection_with_auto_encoders.py
573 lines (425 loc) · 20.9 KB
/
anomaly_detection_with_auto_encoders.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
# -*- coding: utf-8 -*-
"""anomaly-detection-with-auto-encoders.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1CqgRuzi8hIqCtnxi6B0zqvip9RwfU7uF
# Table of Contents
1. [Unsupervised Learning with Auto-Encoders](#1)
1. [Preprocessing](#2)
1. [Visualising clusters with t-SNE](#3)
1. [Train/Validate/Test split](#4)
1. [Normalising & Standardising](#5)
1. [Training the auto-encoder](#6)
1. [Reconstructions](#7)
1. [Setting a threshold for classification](#8)
1. [Latent Space ](#9)
1. [Conclusion](#10)
<a id="1"></a> <br>
# Unsupervised Learning with Auto-Encoders
If you are interested in an introduction to auto-encoders, head over to [Julien Despois' article](https://hackernoon.com/latent-space-visualization-deep-learning-bits-2-bd09a46920df).
If a more technical breakdown is what you are looking for, check out [Lilian Weng's blog post](https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html) from which the below image is sourced.
It illustrates the functioning of an auto-encoder for MNIST images, but the concept is the same.
![image.png](attachment:image.png)
The idea is quite straightforward:
1. Due to the **bottleneck architecture** of the neural network, it is forced to learn a **condensed representation** from which to reproduce the original input.
2. We feed it **only normal transactions**, which it will learn to reproduce with high fidelity.
3. As a consequence, if a **fraud transaction is sufficiently distinct** from normal transactions, the auto-encoder will have trouble reproducing it with its learned weights, and the subsequent **reconstruction loss will be high**.
4. Anything above a specific loss (treshold) will be **flagged as anomalous** and thus labeled as fraud.
<a id="2"></a> <br>
# Preprocessing
## Import Libraries & set Random Seeds
"""
from google.colab import drive
drive.mount('/content/drive')
# Commented out IPython magic to ensure Python compatibility.
# read & manipulate data
import pandas as pd
import numpy as np
import tensorflow as tf
# visualisations
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
# %matplotlib notebook
# misc
import random as rn
# load the dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML/creditcard.csv')
# manual parameters
RANDOM_SEED = 42
TRAINING_SAMPLE = 200000
VALIDATE_SIZE = 0.2
# setting random seeds for libraries to ensure reproducibility
np.random.seed(RANDOM_SEED)
rn.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
"""## Renaming columns"""
# let's quickly convert the columns to lower case and rename the Class column
# so as to not cause syntax errors
df.columns = map(str.lower, df.columns)
df.rename(columns={'class': 'label'}, inplace=True)
# print first 5 rows to get an initial impression of the data we're dealing with
df.head()
"""## Calculated field: log10(amount)
Turn the amount feature into a normally distributed log equivalent.
"""
# add a negligible amount to avoid taking the log of 0
df['log10_amount'] = np.log10(df.amount + 0.00001)
# keep the label field at the back
df = df[
[col for col in df if col not in ['label', 'log10_amount']] +
['log10_amount', 'label']
]
"""<a id="3"></a> <br>
# Visualising clusters with t-SNE
*t-Distributed Stochastic Neighbor Embedding (t-SNE)*
From the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html): <br>
> t-SNE [1] is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
In plain English, most certainly oversimplifying matters: **t-SNE is a dimensionality reduction technique used for visualisations** of complex datasets.
It **maps clusters in high-dimensional data** to **a two- or three dimensional plane** so we can get an idea of how easy it will be to **discriminate between classes**.
It does this by trying to keep the distance between data points in lower dimensions proportional to the probability that these data points are neighbours in the higher dimensions.
A more elaborate [introduction](https://www.datacamp.com/community/tutorials/introduction-t-sne) is available on DataCamp.
## Undersampling the non-fraud
To keep the computation time low, let's feed t-SNE only a small subsample (undersampling the clean transactions).
"""
# manual parameter
RATIO_TO_FRAUD = 15
# dropping redundant columns
df = df.drop(['time', 'amount'], axis=1)
# splitting by class
fraud = df[df.label == 1]
clean = df[df.label == 0]
# undersample clean transactions
clean_undersampled = clean.sample(
int(len(fraud) * RATIO_TO_FRAUD),
random_state=RANDOM_SEED
)
# concatenate with fraud transactions into a single dataframe
visualisation_initial = pd.concat([fraud, clean_undersampled])
column_names = list(visualisation_initial.drop('label', axis=1).columns)
# isolate features from labels
features, labels = visualisation_initial.drop('label', axis=1).values, \
visualisation_initial.label.values
print(f"""The non-fraud dataset has been undersampled from {len(clean):,} to {len(clean_undersampled):,}.
This represents a ratio of {RATIO_TO_FRAUD}:1 to fraud.""")
"""## t-SNE output"""
from sklearn.manifold import TSNE
from mpl_toolkits.mplot3d import Axes3D
def tsne_scatter(features, labels, dimensions=2, save_as='graph.png'):
if dimensions not in (2, 3):
raise ValueError('tsne_scatter can only plot in 2d or 3d (What are you? An alien that can visualise >3d?). Make sure the "dimensions" argument is in (2, 3)')
# t-SNE dimensionality reduction
features_embedded = TSNE(n_components=dimensions, random_state=RANDOM_SEED).fit_transform(features)
# initialising the plot
fig, ax = plt.subplots(figsize=(8,8))
# counting dimensions
if dimensions == 3: ax = fig.add_subplot(111, projection='3d')
# plotting data
ax.scatter(
*zip(*features_embedded[np.where(labels==1)]),
marker='o',
color='r',
s=2,
alpha=0.7,
label='Fraud'
)
ax.scatter(
*zip(*features_embedded[np.where(labels==0)]),
marker='o',
color='g',
s=2,
alpha=0.3,
label='Clean'
)
# storing it to be displayed later
plt.legend(loc='best')
plt.savefig(save_as);
plt.show;
tsne_scatter(features, labels, dimensions=2, save_as='tsne_initial_2d.png')
"""Some clusters are apparent, but a minority of fraud transactions remains sneaky, sneaky.
<a id="4"></a> <br>
# Train/Validate/Test split
Our auto-encoder will **only train on transactions that were normal**.
What's left over will be combined with the fraud set to form our test sample.
We will be doing something akin to the below:
![image.png](attachment:image.png)
1. Training: only non-fraud
* Split into:
1. Actual training of our autoencoder
2. Validation of the neural network's ability to generalize
2. Testing : mix of fraud and non-fraud
* Treated like new data
* Attempt to locate outliers
1. Compute reconstruction loss
2. Apply threshold
"""
print(f"""Shape of the datasets:
clean (rows, cols) = {clean.shape}
fraud (rows, cols) = {fraud.shape}""")
# shuffle our training set
clean = clean.sample(frac=1).reset_index(drop=True)
# training set: exlusively non-fraud transactions
X_train = clean.iloc[:TRAINING_SAMPLE].drop('label', axis=1)
# testing set: the remaining non-fraud + all the fraud
X_test = clean.iloc[TRAINING_SAMPLE:].append(fraud).sample(frac=1)
print(f"""Our testing set is composed as follows:
{X_test.label.value_counts()}""")
from sklearn.model_selection import train_test_split
# train // validate - no labels since they're all clean anyway
X_train, X_validate = train_test_split(X_train,
test_size=VALIDATE_SIZE,
random_state=RANDOM_SEED)
# manually splitting the labels from the test df
X_test, y_test = X_test.drop('label', axis=1).values, X_test.label.values
"""## Summary"""
print(f"""Shape of the datasets:
training (rows, cols) = {X_train.shape}
validate (rows, cols) = {X_validate.shape}
holdout (rows, cols) = {X_test.shape}""")
"""<a id="5"></a> <br>
# Normalising & Standardising
## Why
In an [excellent article by Jeremy Jordan](https://www.jeremyjordan.me/batch-normalization/), it is explained why making sure your data is normally distributed can **help stochastic gradient descent converge** more effectively.
In a nutshell:
![image.png](attachment:image.png)
## When
At what point in the data processing do we apply standardisation/normalisation? <br>
An [excellent answer was provided on StackOverflow](https://stackoverflow.com/questions/49444262/normalize-data-before-or-after-split-of-training-and-testing-data).
> Don't forget that **testing data points represent real-world data**. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. **If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables** (i.e. the mean and variance).
>
> Therefore, you should **perform feature normalisation over the training data**. Then **perform normalisation on testing **instances as well, but this time **using the mean and variance of training** explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.
>
> <span style="font-size:10px">[Answer by [Giorgos Myrianthous](https://stackoverflow.com/users/7131757/giorgos-myrianthous)]</span>
## Building our pipeline
"""
from sklearn.preprocessing import Normalizer, MinMaxScaler
from sklearn.pipeline import Pipeline
# configure our pipeline
pipeline = Pipeline([('normalizer', Normalizer()),
('scaler', MinMaxScaler())])
"""## Fitting the pipeline"""
# get normalization parameters by fitting to the training data
pipeline.fit(X_train);
"""## Applying transformations with acquired parameters"""
# transform the training and validation data with these parameters
X_train_transformed = pipeline.transform(X_train)
X_validate_transformed = pipeline.transform(X_validate)
"""## Before & After"""
g = sns.PairGrid(X_train.iloc[:,:3].sample(600, random_state=RANDOM_SEED))
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Before:')
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot);
g = sns.PairGrid(pd.DataFrame(X_train_transformed, columns=column_names).iloc[:,:3].sample(600, random_state=RANDOM_SEED))
plt.subplots_adjust(top=0.9)
g.fig.suptitle('After:')
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot);
"""We can tell the data is slightly more **uniform and proportionally distributed**. <br>
The ranges were also shrunk to fit **between 0 and 1**.
<a id="6"></a> <br>
# Training the auto-encoder
## TensorBoard
As documented in [this kernel by Aurelio Agundez](https://www.kaggle.com/aagundez/using-tensorboard-in-kaggle-kernels), TensorBoard requires a running kernel, so its output will only be available in an editor session.
Fork this notebook if you wish to interact with it.
"""
# Commented out IPython magic to ensure Python compatibility.
# Load the extension and start TensorBoard
# %load_ext tensorboard
# %tensorboard --logdir logs
"""## Architecture of our model
Keras has become the standard high-level API within Tensorflow. No surprise, it's awesome.
Check out their [blog post on the topic of autoencoders](https://blog.keras.io/building-autoencoders-in-keras.html).
"""
# data dimensions // hyperparameters
input_dim = X_train_transformed.shape[1]
print(input_dim)
BATCH_SIZE = 512
EPOCHS = 100
# https://keras.io/layers/core/
autoencoder = tf.keras.models.Sequential([
# deconstruct / encode
tf.keras.layers.Dense(input_dim, activation='elu', input_shape=(input_dim, )),
tf.keras.layers.Dense(16, activation='elu'),
tf.keras.layers.Dense(8, activation='elu'),
tf.keras.layers.Dense(4, activation='elu'),
tf.keras.layers.Dense(2, activation='elu'),
# reconstruction / decode
tf.keras.layers.Dense(4, activation='elu'),
tf.keras.layers.Dense(8, activation='elu'),
tf.keras.layers.Dense(16, activation='elu'),
tf.keras.layers.Dense(input_dim, activation='elu')
])
# https://keras.io/api/models/model_training_apis/
autoencoder.compile(optimizer="adam",
loss="mse",
metrics=["acc"])
# print an overview of our model
autoencoder.summary();
"""## Callbacks
* Continue as long as the model is reducing the training loss.
* Save only the weights for the model with the lowest validation loss, though.
* Get graphical insights with Tensorboard.
"""
from datetime import datetime
# current date and time
yyyymmddHHMM = datetime.now().strftime('%Y%m%d%H%M')
# new folder for a new run
log_subdir = f'{yyyymmddHHMM}_batch{BATCH_SIZE}_layers{len(autoencoder.layers)}'
# define our early stopping
early_stop = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
min_delta=0.0001,
patience=10,
verbose=1,
mode='min',
restore_best_weights=True
)
save_model = tf.keras.callbacks.ModelCheckpoint(
filepath='autoencoder_best_weights.hdf5',
save_best_only=True,
monitor='val_loss',
verbose=0,
mode='min'
)
tensorboard = tf.keras.callbacks.TensorBoard(
f'logs/{log_subdir}',
batch_size=BATCH_SIZE,
update_freq='batch'
)
# callbacks argument only takes a list
cb = [early_stop, save_model, tensorboard]
"""## Training"""
history = autoencoder.fit(
X_train_transformed, X_train_transformed,
shuffle=True,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
callbacks=cb,
validation_data=(X_validate_transformed, X_validate_transformed)
);
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.savefig("train.png")
plt.show()
"""<a id="7"></a> <br>
# Reconstructions
We **apply the transformation pipeline to our test set**. <br>
Then, we **pass the data through the trained autoencoder**.
"""
# transform the test set with the pipeline fitted to the training set
X_test_transformed = pipeline.transform(X_test)
# pass the transformed test set through the autoencoder to get the reconstructed result
reconstructions = autoencoder.predict(X_test_transformed)
"""**Calculate the reconstruction loss** for every transaction and draw a sample."""
# calculating the mean squared error reconstruction loss per row in the numpy array
mse = np.mean(np.power(X_test_transformed - reconstructions, 2), axis=1)
clean = mse[y_test==0]
fraud = mse[y_test==1]
fig, ax = plt.subplots(figsize=(6,6))
ax.hist(clean, bins=50, density=True, label="clean", alpha=.6, color="green")
ax.hist(fraud, bins=50, density=True, label="fraud", alpha=.6, color="red")
plt.title("(Normalized) Distribution of the Reconstruction Loss")
plt.legend()
plt.show()
"""Very promising! Although some transactions seem to fool the autoencoder, the fraudulent transactions clearly have a distinguishing element in their data that sets them apart from clean ones.
<a id="8"></a> <br>
# Setting a threshold for classification
## Unsupervised
Normally, in an unsupervised solution, this is where the story would end. We would **set a threshold that limits the amount of false positives** to a manageable degree, **and captures the most anomalous data points**.
### Percentiles
We could set this threshold by taking the top x% of the dataset and considering it anomalous.
### MAD
We could also use a **modified Z-score using the Median Absolute Deviation to define outliers** on our reconstruction data. Here is a [good blog post on the topic](https://medium.com/james-blogs/outliers-make-us-go-mad-univariate-outlier-detection-b3a72f1ea8c7) by João Rodrigues, illustrating why this algorithm is more robust and scalable than the percentiles method.
"""
THRESHOLD = 3
def mad_score(points):
"""https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm """
m = np.median(points)
ad = np.abs(points - m)
mad = np.median(ad)
return 0.6745 * ad / mad
z_scores = mad_score(mse)
outliers = z_scores > THRESHOLD
print(f"Detected {np.sum(outliers):,} outliers in a total of {np.size(z_scores):,} transactions [{np.sum(outliers)/np.size(z_scores):.2%}].")
"""## Supervised
We know the labels, so we can verify our results.
### Classification Matrix on MAD outliers
A closer look:
"""
from sklearn.metrics import (confusion_matrix,
precision_recall_curve)
# get (mis)classification
cm = confusion_matrix(y_test, outliers)
# true/false positives/negatives
(tn, fp,
fn, tp) = cm.flatten()
# Commented out IPython magic to ensure Python compatibility.
print(f"""The classifications using the MAD method with threshold={THRESHOLD} are as follows:
{cm}
# % of transactions labeled as fraud that were correct (precision): {tp}/({fp}+{tp}) = {tp/(fp+tp):.2%}
# % of fraudulent transactions were caught succesfully (recall): {tp}/({fn}+{tp}) = {tp/(fn+tp):.2%}""")
"""### Asymmetric error cost
In the real world, we can expect **different costs associated with reporting a false positive versus reporting a false negative**. Missing a fraud case is likely to be much more costly than wrongly flagging a transaction as one. In [another kernel](https://www.kaggle.com/robinteuwens/fraud-detection-as-a-cost-optimization-problem/comments), I discuss an approach to determining these costs for this dataset in depth.
### Recall & Precision
Generally speaking, you will have to prioritise what you find more important. This dilemma is commonly called the **"recall vs precision" trade-off**.
If you want to increase recall, **adjust the MAD's Z-Score threshold** downwards, if you want recover precision, increase it.
"""
clean = z_scores[y_test==0]
fraud = z_scores[y_test==1]
fig, ax = plt.subplots(figsize=(6,6))
ax.hist(clean, bins=50, density=True, label="clean", alpha=.6, color="green")
ax.hist(fraud, bins=50, density=True, label="fraud", alpha=.6, color="red")
plt.title("Distribution of the modified z-scores")
plt.legend()
plt.show()
"""<a id="9"></a> <br>
# Latent Space
It is always interesting to look at the **compressed representation** our neural network devised.
## Encoder
Let's build the encoder that gets us to the bottleneck. We take the layers from our autoencoder.
"""
encoder = tf.keras.models.Sequential(autoencoder.layers[:5])
encoder.summary()
"""## Undersampling
Consistent with the previous t-sne visualisation, let's undersample the clean transactions.
"""
# taking all the fraud, undersampling clean
fraud = X_test_transformed[y_test==1]
clean = X_test_transformed[y_test==0][:len(fraud) * RATIO_TO_FRAUD, ]
# combining arrays & building labels
features = np.append(fraud, clean, axis=0)
labels = np.append(np.ones(len(fraud)),
np.zeros(len(clean)))
# getting latent space representation
latent_representation = encoder.predict(features)
print(f'Clean transactions downsampled from {len(X_test_transformed[y_test==0]):,} to {len(clean):,}.')
print('Shape of latent representation:', latent_representation.shape)
"""## Visualising the Latent Space"""
X = latent_representation[:,0]
y = latent_representation[:,1]
# plotting
plt.subplots(figsize=(8, 8))
plt.scatter(X[labels==0], y[labels==0], s=1, c='g', alpha=0.3, label='Clean')
plt.scatter(X[labels==1], y[labels==1], s=2, c='r', alpha=0.7, label='Fraud')
# labeling
plt.legend(loc='best')
plt.title('Latent Space Representation')
# saving & displaying
plt.savefig('latent_representation_2d');
plt.show()
"""Although there is no perfectly distinct cluster, **most of the fradulent transactions appear to be neatly grouped together**.
This is in line with the hope/idea that both **classes would occupy distinct areas in latent space**, due to the **encoder's weights not being calibrated to cope with fraudulent transactions**.
![image.png](attachment:image.png)
<a id="10"></a> <br>
# Conclusion
We could already tell from our misclassifications that the network was not able to generalize perfectly. However, we must not forget that **our model was trained never having seen a single fraud case!** In that regard, its performance is decent. It illustrates the power of **autoencoders as anomaly detection tools**.
To improve its performance, perhaps we need to:
* improve the model architecture
* diversify the training data more, with a broader sample of clean transactions
* augment the data with different, additional features - the data itself might not be good enough to distinguish between classes perfectly (i.e. fraudsters are disguising themselves well enough to always go undetected using these data points, no matter the algorithm).
"""