-
Notifications
You must be signed in to change notification settings - Fork 89
/
Copy pathindex.html
576 lines (554 loc) · 24.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>Bias Variance Tradeoff</title>
<meta
name="description"
content="MLU-Explain: Visual Explanation of the Bias Variance Tradeoff."
/>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta
property="og:image"
content="https://mlu-explain.github.io/assets/ogimages/ogimage-bias-variance.png"
/>
<meta property="og:title" content="Bias Variance Tradeoff" />
<meta
property="og:description"
content="Learn the tradeoff between under- and over-fitting models, how it relates to bias and variance, and explore interactive examples related to LASSO and KNN."
/>
<meta property="og:image:width" content="1200" />
<meta property="og:image:height" content="630" />
<link rel="icon" href="./assets/mlu_robot.png" />
<link rel="stylesheet" href="css/katex.min.css" />
<link rel="stylesheet" href="css/styles.scss" />
<!-- Global site tag (gtag.js) - Google Analytics -->
<script
async
src="https://www.googletagmanager.com/gtag/js?id=G-1FYW57GW3G"
></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-1FYW57GW3G");
</script>
</head>
<body>
<main>
<div id="intro-icon">
<a href="https://mlu-explain.github.io"
><svg
width="50"
height="50"
viewBox="0 0 234 216"
fill="none"
xmlns="https://www.w3.org/2000/svg"
>
<g id="mlu_robot 1" clip-path="url(#clip0)">
<g>
<path
id="Vector"
d="M90.6641 83.1836C96.8828 83.1836 101.941 78.1289 101.941 71.8906V71.8242C101.941 65.5898 96.8945 60.5312 90.6641 60.5312C84.4453 60.5312 79.3828 65.5898 79.3828 71.8242V71.8906C79.3828 78.1289 84.4336 83.1836 90.6641 83.1836Z"
fill="#232f3e"
/>
<path
id="Vector_2"
d="M143.305 83.1836C149.523 83.1836 154.586 78.1289 154.586 71.8906V71.8242C154.586 65.5898 149.535 60.5312 143.305 60.5312C137.09 60.5312 132.027 65.5898 132.027 71.8242V71.8906C132.027 78.1289 137.078 83.1836 143.305 83.1836Z"
fill="#232f3e"
/>
<path
id="Vector_3"
d="M163.586 159.402H173.609V122.641H163.586V159.402Z"
fill="#232f3e"
/>
<path
id="Vector_4"
d="M60.3594 159.402H70.3867V122.641H60.3594V159.402Z"
fill="#232f3e"
/>
<g id="Group">
<path
id="Vector_5"
d="M182.16 30.0781H51.8047V10.0234H182.16V30.0781ZM182.16 103.609H51.8047V40.1055H182.16V103.609ZM144.559 168.789H89.4062V113.641H144.559V168.789ZM0 0V10.0234H15.8789V46.7891H25.9023V10.0234H41.7812V113.641H79.3867V178.816H96.9297V215.578H106.957V178.816H127.016V215.578H137.039V178.816H154.586V113.641H192.188V10.0234H233.969V0"
fill="#232f3e"
/>
</g>
</g>
</g>
<defs>
<clipPath id="clip0">
<rect width="233.97" height="215.58" fill="#232f3e" />
</clipPath>
</defs>
</svg>
<h2 class="logo">MLU-expl<span id="ai">AI</span>n</h2>
</a>
</div>
<section id="intro">
<!-- MLU Robot Icon -->
<h1 id="intro__hed">
The <span id="bias">Bias</span> <span id="variance">Variance</span>
Tradeoff
</h1>
<h3 id="intro__date">
<a href="https://twitter.com/jdwlbr">Jared Wilber</a>
& Brent Werness, January 2021
</h3>
<p id="intro__dek">
Prediction errors can be decomposed into two main subcomponents of
interest: error from <span class="bolder">bias</span>, and error from
<span class="bolder">variance</span>. The tradeoff between a model's
ability to minimize bias and variance is foundational to training
machine learning models, so it's worth taking the time to understand
the concept.
</p>
</section>
<!-- BEGIN SCROLLYTELLING SECTION -->
<section id="scrolly">
<figure id="scroll-figure">
<div id="scroll-viz"></div>
<div id="svg2"></div>
</figure>
<article>
<div class="step" data-step="1">
<p class="step-head"></p>
<p>
Using our wildest imagination, we can picture a dataset consisting
of features X and labels Y, as on the left. Also imagine that we’d
like to generalize this relationship to additional values of X -
that we’d like to predict future values based on what we’ve
already seen before.
<br /><br />
With our imagination now undoubtedly spent, we can take a very
simple approach to modeling the relationship between X and Y by
just drawing a line to the general trend of the data.
</p>
</div>
<div class="step" data-step="2">
<p class="step-head">A Simple Model</p>
<p>
Our simple model isn’t the best at modeling the relationship -
clearly there's information in the data that it's failing to
capture.
<br /><br />
We'll measure the performance of our model by looking at the
<a href="https://en.wikipedia.org/wiki/Mean_squared_error"
>mean-squared error</a
>
of its output and the true values (displayed in the bottom
barchart). Our model is close to some of the training points, but
overall there's definitely room for improvement. <br /><br />
The error on the training data is important for model tuning, but
what we really care about is how it performs on data we haven't
seen before, called test data. So let's check that out as well.
</p>
</div>
<div class="step" data-step="3">
<p class="step-head">Low Complexity & Underfitting</p>
<p>
Uh-oh, it looks like our earlier suspicions were correct - our
model is garbage. The test error is even higher than the train
error!
<br /><br />
In this case, we say that our model is
<span id="underfit-highlight">underfitting</span> the data: our
model is so simple that it fails to adequately capture the
relationships in the data. The high test error is a direct result
of the lack of complexity of our model. <br /><br />
<span id="underfit-def"
>An underfit model is one that is too simple to accurately
capture the relationships between its features X and label
Y.</span
>
</p>
</div>
<div class="step" data-step="4">
<p class="step-head">A Complex Model</p>
<p>
Our previous model performed poorly because it was too simple.
Let's try our luck with something more complex. In fact, let's get
as complex as we can - let's train a model that predicts every
point in our training data perfectly.
<br /><br />
Great! Now our training error is zero. As the old saying goes in
Tennessee: Fool me once - shame on you. Fool me twice - er... you
can't get fooled again ;).
</p>
</div>
<div class="step" data-step=" 5">
<p class="step-head">High Complexity & Overfitting</p>
<p>
Wait a second... Even though our training error from our model was
effectively zero, the error on our test data is high. What gives?
<br /><br />
Unsurprisingly, our model is too complicated. We say that it
<span id="overfit-highlight">overfits</span> the data. Instead of
learning the true trends underlying our dataset, it memorized
noise and, as a result, the model is not generalizable to datasets
beyond its training data. <br /><br />
<span id="overfit-def"
>Overfitting refers to the case when a model is so specific to
the data on which it was trained that it is no longer applicable
to different datasets.</span
><br /><br />
In situations where your training error is low but your test error
is high, you've likely overfit your model.
</p>
</div>
<div class="step" data-step="6">
<p class="step-head">Test Error Decomposition</p>
<p>
Our test error can come as a result of both under- and over-
fitting our data, but how do the two relate to each other?
<br /><br />
In the general case,
<span id="decomp-highlight"
>mean-squared error can be decomposed into three components:
error due to bias, error to to variance, and error due to
noise.</span
>
<br /><br />
<span class="katex-bv-text"></span>
<br /><br />
Or, mathematically:
<br /><br />
<span class="katex-bv-equation"></span>
<br /><br />
We can’t do much about the irreducible<span
class="katex-noise"
></span>
term, but we can make use of the relationship between both bias
and variance to obtain better predictions.
</p>
</div>
<div class="step" data-step="7">
<p class="step-head">Bias</p>
<p>
<span id="bias-def"
>Bias represents the difference between the average prediction
and the true value</span
>: <br /><br />
<span class="katex-bias"></span>
<br /><br />
The term <span class="katex-bias-inline"></span> is a tricky one.
It refers to the average prediction after the model has been
trained over several independent datasets. We can think of the
bias as measuring a <i>systematic</i> error in prediction.
<br /><br />These different model realizations are shown in the
top chart, while the error decomposition (for each point of data)
is shown in the bottom chart. <br /><br />
For underfit (low-complexity) models, the majority of our error
comes from bias.
</p>
</div>
<div class="step" data-step="8">
<p class="step-head">Variance</p>
<p>
As with bias, the notion of variance also relates to different
realizations of our model. Specifically,
<span id="variance-def"
>variance measures how much, on average, predictions vary for a
given data point</span
>: <br /><br />
<span class="katex-var"></span>
<br /><br />
As you can see in the bottom plot, predictions from overfit
(high-complexity) models show a lot more error from variance than
from bias. It’s easy to imagine that any unseen data points will
be predicted with high error.
</p>
</div>
<div class="step" data-step="9">
<p class="step-head">Finding A Balance</p>
<p>
To obtain our best results, we should work to find a happy medium
between a model that is so basic it fails to learn meaningful
patterns in our data, and one that is so complex it fails to
generalize to unseen data .
<br /><br />
In other words, we don’t want an underfit model, but we don’t want
an overfit model either. We want something in between - something
with enough complexity to learn learn the generalizable patterns
in our data.
<br /><br />
<span id="balance-def"
>By trading some bias for variance (i.e. increasing the
complexity of our model), and without going overboard, we can
find a balanced model for our dataset</span
>.
</p>
</div>
<div class="step" data-step="10">
<p class="step-head">Across Complexities</p>
<p>
We just showed, at different levels of complexity, a sample of
model realizations alongside their corresponding prediction error
decompositions.
<br /><br />
Let’s direct our focus to the error decompositions across model
complexities.
<br /><br />
For each level of complexity, we’ll aggregate the error
decomposition across all data-points, and plot the aggregate
errors at their level of complexity.
<br /><br />
This aggregation applied to our balanced model (i.e. the middle
level of complexity) is shown to the left.
</p>
</div>
<div class="step" data-step="11">
<p class="step-head">The Bias Variance Trade-off</p>
<p>
Repeating this aggregation across our range of model complexities,
we can see the relationship between bias and variance in
prediction errors manifests itself as a U-shaped curve detailing
the trade off between bias and variance.
<br /><br />
When a model is too simple (i.e. small values along the x-axis),
it ignores useful information, and the error is composed mostly of
that from bias.
<br /><br />
When a model is too complex (i.e. large values along the x-axis),
it memorizes non-general patterns, and the error is composed
mostly of that from variance.
<br /><br />
<span id="conclusion"
>The ideal model aims to minimize both bias and variance. It
lays in the sweet spot - not too simple, nor too complex.
Achieving such a balance will yield the minimum error.</span
>
</p>
</div>
</article>
</section>
<!-- END SCROLLYTELLING -->
<div class="model-text">
<br /><br />
<p>
Many models have parameters that change the final learned models,
called hyperparameters Let's look at how these hyperparameters may be
used to control the the bias-variance tradeoff with two examples:
LOESS Regression and K-Nearest Neighbors.
</p>
<br />
</div>
<section class="models" class="section-offset">
<hr />
<!-- LOESS -->
<div class="model-text-wrapper">
<h1 class="model-header">LOESS Regression</h1>
<p class="model-text">
LOESS (LOcally Estimated Scatterplot Smoothing) regression is a
nonparametric technique for fitting a smooth surface between an
outcome and some predictor variables. The curve fitting at a given
point is weighted by nearby data. This weighting is governed by a
smoothing hyperparameter, which represents the proportion of
neighboring data used to calculate each local fit.
<br /><br />
Thus, the bias variance tradeoff for LOESS may be controlled for via
the smoothness parameter. When the smoothness is small, the amount
of data we consider is insufficient for an accurate fit, resulting
in large variance. However, if we make the smoothness too high (i.e.
over-smoothed), we trade local information for global, resulting in
large bias.
<br /><br />
Below a LOESS curve is fit to two variables. Randomize the training
data to observe the effect different model realizations have on
variance, and control the smoothness to observe the tradeoff between
under- and over-fitting (and thus, bias and variance).
</p>
</div>
<div class="model-controls">
<button class="data-button" id="button-loess">
Randomize Train Data
</button>
<div id="slider-container-loess"></div>
</div>
<div
class="model-wrapper"
style="text-align: center; justify-content: center"
>
<div id="fit-container-loess"></div>
<div id="predict-container-loess"></div>
</div>
<!-- </section> -->
<!-- END LOESS -->
<hr />
</section>
<section class="models" class="section-offset">
<!-- KNN -->
<div class="model-text-wrapper">
<h1 class="model-header">K-Nearest Neighbors</h1>
<p class="model-text">
K-Nearest Neighbors (KNN) classification is a simple technique for
assigning class membership to a data point with some majority vote
of its K-nearest neighbors. For example, when K = 1, the data point
is simply assigned to the class of that single nearest neighbor. If
K = 69, the data point is assigned to the majority class of its
69-nearest neighbors.
<br /><br />
We can observe the bias variance tradeoff in KNN directly by playing
with the hyperparameter K. When K is small, only a small number of
neighbors are considered during the classification vote. The
resulting islands and jagged boundaries are a result of high
variance, as classifications are determined by very localized
neighborhoods. For large values of K, we see very smoothed regions
that deviate sharply from the true decision boundary - go too high
and you’ll just obtain a majority vote. This is high bias. On the
other hand, for medium values of K, we see smooth the regions that
follow along the true decision boundary.
<br /><br />
Explore the trade-off for yourself below! The plot on the left shows
the training data. The plot on the right shows decision regions
based on the current value of K. Deeper colors reflect more
confidence in the classification. Hover over a point to see its
classification to the right, and the K-nearest neighbors used for
consideration to the left.
</p>
</div>
<div class="model-controls">
<button class="data-button" id="button-knn">Randomize Data</button>
<div id="slider-container"></div>
</div>
<div
class="model-wrapper"
style="text-align: center; justify-content: center"
>
<div id="fit-container"></div>
<div id="predict-container"></div>
</div>
<!-- End KNN --><!-- KNN -->
</section>
<hr />
<section class="section-offset">
<div style="text-align: center">
<h1 class="model-header">What About Double Descent?</h1>
<p class="model-text">
You may have heard about the phenomena, known as Double Descent,
wherein which the classic U-shaped bias variance curve (seen above
and in textbooks everywhere) is followed by a second dip (shown
below). Such a phenomena must nullify the bias variance tradeoff we
just spent so long explaining, right?
</p>
<br />
<div id="dd-container"></div>
<br />
<p class="model-text">
Fret not, dear reader! As we will detail in our next series of
articles, Double Descent actually supports the classical notion of
the bias variance trade off. Stay tuned to learn more!
<br /><br />
</p>
</div>
</section>
<hr />
<!-- CONCLUSION -->
<section>
<div style="text-align: center">
<h1 class="model-header">It's Finally Over</h1>
<p class="model-text">
<i>Exhales deeply.</i> It's finally over. Thanks for reading! We
hope that the article is insightful no matter where you are along
your machine learning journey, and that you came away with a better
undersatnding of the bias variance tradeoff. <br /><br />
To make things compact, we skipped over some relevant topics (such
as regularization), but stay-tuned for more MLU-Explain articles,
where we plan to explain those, and other, topics related to machine
learning.
<br /><br />
To learn more about machine learning, check out our
<a href="https://aws.amazon.com/machine-learning/mlu/"
>self-paced courses</a
>, our
<a href="https://www.youtube.com/channel/UC12LqyqTQYbXatYS9AA7Nuw"
>youtube videos</a
>, and the
<a href="https://d2l.ai/">Dive into Deep Learning</a> textbook. If
you have any comments/ideas/etc. related to MLU-Explain articles,
feel free to
<a href="https://twitter.com/jdwlbr"> reach out directly</a>. The
code is available
<a href="https://github.com/aws-samples/aws-mlu-explain">here</a>.
</p>
</div>
</section>
<hr />
<section id="outro">
<br /><br />
<h1 class="model-header">References + Open Source</h1>
<p class="model-text">
This article is a product of the following resources + the awesome
people who made (& contributed to) them:
</p>
<br />
<ul>
<li>
<a href="https://arxiv.org/abs/1812.11118"
>Reconciling modern machine learning practice and the
bias-variance trade-off</a
><br />
(Mikhail Belkin; Daniel Hsu; Siyuan Ma; Soumik Mandal, 2019)
</li>
<li>
<a href="https://web.stanford.edu/~hastie/Papers/ESLII.pdf">
The Elements of Statistical Learning</a
>
<br />(Trevor Hastie; Robert Tibshirani; Jerome Friedman, 2009).
</li>
<li>
<a href="https://d2l.ai/"> Dive into Deep Learning</a>
(Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola,
2020).
</li>
<li>
<a href="https://scott.fortmann-roe.com/docs/BiasVariance.html">
Understanding The Bias-Variance Tradeoff</a
>
(Scott Fortmann-Roe, 2012).
</li>
<li>
<a
href="https://www.statsdirect.com/help/nonparametric_methods/loess.htm"
>
LOESS Curve Fitting</a
>
(statsdirect.com).
</li>
<li>
<a href="https://d3js.org/">D3.js</a> (Mike Bostock, Philippe
Rivière)
</li>
<li>
<a href="https://roughnotation.com/">Rough Notation</a> (Preet
Shihn)
</li>
<li>
<a href="https://katex.org/">KaTeX</a> (Emily Eisenberg, Sophie
Alpert)
</li>
<li>
<a href="https://github.com/russellgoldenberg/scrollama"
>Scrollama</a
>
(Russel Goldenberg)
</li>
<li>
<a href="https://github.com/HarryStevens/d3-regression"
>d3-regression</a
>
(Harry Stevens)
</li>
</ul>
<br /><br />
</section>
</main>
<script></script>
<!-- <div class='debug'></div> -->
<script src="js/index.js"></script>
<script src="js/katexCalls.js"></script>
</body>
</html>