-
Notifications
You must be signed in to change notification settings - Fork 89
/
Copy pathindex.html
446 lines (418 loc) · 18.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>Double Descent</title>
<meta name="description" content="Double Descent" />
<meta
name="viewport"
content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no"
/>
<meta
name="description"
content="MLU-Explain: Visual Introduction to Double Descent."
/>
<meta
property="og:image"
content="https://mlu-explain.github.io/assets/ogimages/ogimage-double-descent.png"
/>
<meta property="og:title" content="Double Descent" />
<meta
property="og:description"
content="An introduction to the Double Descent phenomena in modern machine learning."
/>
<meta property="og:image:width" content="1000" />
<meta property="og:image:height" content="630" />
<link rel="icon" href="./assets/mlu_robot.png" />
<link rel="stylesheet" href="css/styles.scss" />
<!-- Global site tag (gtag.js) - Google Analytics -->
<script
async
src="https://www.googletagmanager.com/gtag/js?id=G-1FYW57GW3G"
></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-1FYW57GW3G");
</script>
</head>
<body>
<main>
<div id="intro-icon">
<a href="https://mlu-explain.github.io"
><svg
width="50"
height="50"
viewBox="0 0 234 216"
fill="none"
xmlns="https://www.w3.org/2000/svg"
>
<g id="mlu_robot 1" clip-path="url(#clip0)">
<g>
<path
id="Vector"
d="M90.6641 83.1836C96.8828 83.1836 101.941 78.1289 101.941 71.8906V71.8242C101.941 65.5898 96.8945 60.5312 90.6641 60.5312C84.4453 60.5312 79.3828 65.5898 79.3828 71.8242V71.8906C79.3828 78.1289 84.4336 83.1836 90.6641 83.1836Z"
fill="white"
></path>
<path
id="Vector_2"
d="M143.305 83.1836C149.523 83.1836 154.586 78.1289 154.586 71.8906V71.8242C154.586 65.5898 149.535 60.5312 143.305 60.5312C137.09 60.5312 132.027 65.5898 132.027 71.8242V71.8906C132.027 78.1289 137.078 83.1836 143.305 83.1836Z"
fill="white"
></path>
<path
id="Vector_3"
d="M163.586 159.402H173.609V122.641H163.586V159.402Z"
fill="white"
></path>
<path
id="Vector_4"
d="M60.3594 159.402H70.3867V122.641H60.3594V159.402Z"
fill="white"
></path>
<g id="Group">
<path
id="Vector_5"
d="M182.16 30.0781H51.8047V10.0234H182.16V30.0781ZM182.16 103.609H51.8047V40.1055H182.16V103.609ZM144.559 168.789H89.4062V113.641H144.559V168.789ZM0 0V10.0234H15.8789V46.7891H25.9023V10.0234H41.7812V113.641H79.3867V178.816H96.9297V215.578H106.957V178.816H127.016V215.578H137.039V178.816H154.586V113.641H192.188V10.0234H233.969V0"
fill="white"
></path>
</g>
</g>
</g>
<defs>
<clipPath id="clip0">
<rect width="233.97" height="215.58" fill="white"></rect>
</clipPath>
</defs>
</svg>
<h2 class="logo">MLU-expl<span id="ai">AI</span>n</h2>
</a>
</div>
<section id="intro">
<h1 id="intro-hed">Double Descent</h1>
<h1 class="intro-sub">Part 1: A Visual Introduction</h1>
<h3 id="intro__date">
<a href="https://twitter.com/jdwlbr">Jared Wilber</a>
& Brent Werness, December 2021
</h3>
<p id="top-note">
Note - this is part 1 of a two article series on
<span class="bold">Double Descent</span>. Part 2 is available
<a href="https://mlu-explain.github.io/double-descent2/">here</a>.
</p>
<p class="model-text">
In our<a href="https://mlu-explain.github.io/bias-variance/">
previous discussion of the bias-variance tradeoff</a
>, we ended with a note about one of modern machine learning’s more
surprising phenomena: <span class="bold">double descent</span>. Double
descent is interesting because it appears to stand counter to our
classical understanding of the bias-variance tradeoff. Namely, while
we expect the best model performance to be obtained via some balance
between bias (underfitting) and variance (overfitting), we instead
observe strong test performance from very overfit, complex models.
As a result, many practitioners and researchers are left questioning
the relevance of the traditional bias-variance tradeoff in modern
machine learning. <br /><br />
Here, we'll first introduce the phenomenon for a general example and
then offer a soft explanation for why it's occurring. (In a follow up
article, we'll describe the phenomena in more low level, mathematical
detail).
<br /><br />
At the end of it all, we conclude that the double descent phenomenon
actually reinforces the importance of the bias-variance tradeoff.
</p>
</section>
<!-- start center scroll -->
<section id="scrolly">
<figure>
<div id="doubledescent-container">
</div>
</figure>
<article>
<div class="step" data-step="1">
<p>
Plotting the training and testing error against some measure of
model complexity (say, training time) for a typical case may look
like in this figure, with both following the same decreasing
direction and the test error hovering slightly above the train
error.
</p>
</div>
<div class="step" data-step="2">
<p>
Under the classical bias-variance tradeoff, as we move further
right along the x axis (i.e. increasing the complexity of our
model), we overfit and expect the test error to skyrocket further
and further upwards even as the train error continues decreasing.
</p>
</div>
<div class="step" data-step="3">
<p>
However, what we observe is quite different. Indeed the error does
shoot up, but it does so before descending back down to a new
minimum. In other words, even though our model is extremely
overfitted, it has achieved its best performance, and it has done
so during this second descent (hence the name,
<span class="bold">double descent</span>)!
</p>
</div>
<div class="step" data-step="4">
<p>
We call the under-parameterized region to the left of the second
descent the <span class="bold">classical regime</span>, and the
point of peak error the
<span class="bold">interpolation threshold</span>. In the
classical regime, the bias-variance tradeoff behaves as expected,
with the test error drawing out the familiar U-shape.
</p>
</div>
<div class="step" data-step="5">
<p>
To the right of the interpolation threshold, the behavior changes.
We call this over-parameterized region the
<span class="bold">interpolation regime</span>. In this regime,
the model perfectly memorizes, or interpolates, the training data.
That is, every model passes exactly through the given training
data, thus the only thing that changes is how the model connects
the dots between these data points.
</p>
</div>
</article>
</section>
<!-- end center scroll -->
<section id="section2">
<div class="model-text">
<h1 class="model-header">What's Going On?</h1>
<p class="model-text">
To better understand this phenomenon, let's explore it together!
Quick note - in this article, we'll keep things fairly high-level.
However, if you're interested in more mathematical, lower-level
details,
<a href="https://mlu-explain.github.io/double-descent2/"
><span class="bold"
>check out our sibling article on double descent</span
></a
>. <br /><br />
Let's begin with a simple problem. We will train on data sampled
from a cubic curve that has been occasionally corrupted with noise.
We will generate a tiny training set and a larger test set so that
we can rapidly explore what is going on, while still trusting the
values for the test error, which is obtained from predictions on the
test set. To maximize the stability of training, we will not employ
a full neural network, but rather pick random non-linear features
and then train a linear model on top.
<br /><br />
Below we show our data, as well as each model's associated mean
absolute error (MAE).
</p>
</div>
</section>
<!-- Side Scroller -->
<div id="ttt">
<section id="scrolly-side">
<figure>
<div id="scatter-container"></div>
<br />
<div id="error-container"></div>
</figure>
<article>
<div class="step-side" data-step="1">
<p>
To begin, we plot a simple model. Recall from
<a href="https://mlu-explain.github.io/bias-variance/"
>our previous discussion on the bias-variance tradeoff</a
>
that basic models cannot capture complex patterns in the data
and are thus underfitted, providing poor performance for the
task at hand.
</p>
</div>
<div class="step-side" data-step="2">
<p>
Next, we plot a model that's not too simple, nor too complex. It
is in this complexity region where, traditionally, we expect to
find the best performing model. This is reflected in the low
error ( ≤ 0.25) in the bottom chart.
</p>
</div>
<div class="step-side" data-step="3">
<p>
Now, let's plot a complex model, one where our number of
features is equal to the number of dimensions. This situation,
in which the model passes through each and every point in the
training set, is our interpolation threshold. At this stage
we're overfitting, which leads to the high test error shown in
the plot. Typically, we would stop increasing the complexity
here and revert to a more simple model that achieves a good
tradeoff between bias and variance.
</p>
</div>
<div class="step-side" data-step="4">
<p>
The existence of the double descent phenomenon means that this
picture is incomplete. We stopped making more and more intricate
models when we reached a certain complexity level. But what
happens if we go further, beyond the interpolation threshold?
Let's look at a single example with 256 random features and what
happens with the test error curve as we extend to that
situation.
</p>
</div>
<div class="step-side" data-step="5">
<p>
The test MAE is even lower for these large models! The
traditional U-shape is sometimes telling only part of the story.
Past the traditional U-shaped region is the interpolation
regime. The idea is that every model past that spike in error is
complex enough to pass through every single training data point,
thus all models are interpolating the training set. More
complicated models can achieve smoother interpolations. If the
conditions are right (as they are in this experiment) these
enormous interpolating models can perform far better than
traditional well-fit models.
</p>
</div>
<div class="step-side" data-step="6">
<p>
Try for yourself! Toggle the slider to modify the number of
non-linear features used to build the models.
<div id="slider-container"></div>
</p>
</div>
</article>
<br /><br />
<br /><br />
<br /><br />
</section>
</div>
<section id="gap">
<div class="model-text">
<h1 class="model-header">Minding The Gap</h1>
<div id="gap-container">
<img
src="line.gif"
alt="Image of interpolation region from above."
/>
<p class="body-text">
It's important to pay attention to the gap
<i>between the points</i> once we've entered our interpolation
region (K > 36). In the image to the left, we show, for a small
portion of the data above, how the interpolation varies across 40
to 500 features.
</p>
<br />
<p class="body-text">
Once we're at the interpolation threshold, every model from that
complexity level onwards passes through each training data point.
<span class="bolder"
>The only thing that changes is how the model connects the
in-between points</span
>. As the models become more and more complex, these connections
can become smoother, and the resulting prediction may fit your
test data better. This is why models in the interpolation region
can perform so well.
</p>
<h1 class="model-header">Final Takeaways</h1>
<br />
<p class="body-text">
The key takeaway here is that double descent is a real phenomenon,
although its existence does not nullify the bias-variance
tradeoff. It is believed that double descent contributes to
explain why deep neural networks perform so well at many tasks. By
building models with many more parameters than data points, deep
neural networks are often operating in this interpolating regime.
Traditional intuition from the bias-variance tradeoff would
discourage such approach, indicating that simpler models with
fewer parameters should be more performant. However, this has been
contradicted with experiments. Double descent provides an
indication that even though models that pass through every
training data point are indeed overfitted, the structure of the
resulting network forces the interpolation to be smooth and
results in superior generalization to unseen data.
</p>
<br /><br />
</div>
<p></p>
</div>
</section>
<section id="conclusion">
<br /><br />
<p class="model-text">
Thanks for reading! If you've made it this far, consider viewing
<a href="https://mlu-explain.github.io/double-descent2/"
>our follow-up article</a
>
explaining double descent in more detail. To learn more about machine
learning, check out our
<a href="https://aws.amazon.com/machine-learning/mlu/"
>self-paced courses</a
>, our
<a href="https://www.youtube.com/channel/UC12LqyqTQYbXatYS9AA7Nuw"
>youtube videos</a
>, and the
<a href="https://d2l.ai/">Dive into Deep Learning</a> textbook. If you
have any comments or ideas related to MLU-Explain articles, feel free
to <a href="https://twitter.com/jdwlbr"> reach out directly</a>. The
code for this article is available
<a
href="https://github.com/aws-samples/aws-mlu-explain"
>here</a
>.
</p>
<h1 class="model-header">References & Open Source</h1>
<p class="model-text">
This article is a product of the following resources + the awesome
people who made (& contributed to) them:
</p>
<br />
<ul>
<li>
<a href="https://arxiv.org/abs/1812.11118"
>Reconciling modern machine learning practice and the
bias-variance trade-off</a
><br />
(Mikhail Belkin; Daniel Hsu; Siyuan Ma; Soumik Mandal, 2019)
</li>
<li>
<a href="https://mltheory.org/deep.pdf">
Deep Double Descent: Where Bigger Models And More Data Hurt</a
>
<br />(Preetum et al., 2019).
</li>
<li>
<a href="https://arxiv.org/abs/1912.07242">
More Data Can Hurt for Linear Regression: Sample-wise Double
Descent</a
>
(Preetum Nakkiran, 2019).
</li>
<li>
<a href="https://d2l.ai/"> Dive into Deep Learning</a>
(Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola,
2020).
</li>
<li>
<a href="https://d3js.org/">D3.js</a> (Mike Bostock, Philippe
Rivière)
</li>
<li>
<a href="https://katex.org/">KaTeX</a> (Emily Eisenberg, Sophie
Alpert)
</li>
<li>
<a href="https://github.com/russellgoldenberg/scrollama"
>Scrollama</a
>
(Russel Goldenberg)
</li>
</ul>
<br /><br />
</section>
</main>
<script src="js/scrollCenter.js"></script>
<script src="js/scrollSide.js"></script>
</body>
</html>