Memory leak in Image Optimization #23189

Pythe1337N · 2021-03-18T13:37:15Z

What version of Next.js are you using?

10.0.9

What version of Node.js are you using?

14.15.5

What browser are you using?

Chrome

What operating system are you using?

alpine (docker image)

How are you deploying your application?

express server

Describe the Bug

Upgrading from 10.0.7 to 10.0.8 or 10.0.9 results in a server side memory leak when using the component.
Consumes over 1GB of memory after only 10-20 image optimizations.

Expected Behavior

Memory usage should be normal.

To Reproduce

Import Image from 'next/image' and use it as the the documentation says https://nextjs.org/docs/api-reference/next/image with an allowed external image source.
In this case layout="fixed" was used.

shuding · 2021-03-18T14:21:22Z

Could you please try installing the latest canary (next@canary)? It should be fixed there, see also #22925.

Pythe1337N · 2021-03-18T14:45:45Z

Tried with 10.0.10-canary.4 and I'm still having issues. Been running in a 512 MB ram docker image up to version 10.0.7. Increased it to 1024 MB to test this, but it's still using all of the memory available.

fabb · 2021-03-19T08:10:38Z

we are also having memory leaks and pod crashes since updating from next.js 10.0.7 to 10.0.9 yesterday. we updated a few other dependencies as well, so it's not yet certain that the next.js updated caused the issue. note that we do not use the image optimization feature.

Pythe1337N · 2021-03-19T08:39:47Z

I did a rollback on all deps other than for next and still had the same problem. The memory leak might be part of the image optimization, but that is what is running slow. To fetch an image from the _next/image endpoint is really slow while everything else seems to run as usual. Will try to only fetch images to try to pinpoint the leak.

fabb · 2021-03-19T09:13:29Z

i did a rollback of just next.js to 10.0.7 and it fixed the issue. so it's definitely caused by next.js. as i said, we are not using the image optimization feature at all.

shuding · 2021-03-19T12:15:09Z

We've fixed some know issues regarding memory usage in canary already. It would be great if you can test next@canary vs latest stable release and share more information, so we can identify the cause quickly!

ahmetuysal · 2021-03-19T14:09:47Z

I think the problem persists in 10.0.10-canary.5. Memory utilization is steadily increasing in our website deployed to AWS Beanstalk. It was stable until I updated next to 10.0.10-canary.5 from 10.0.7.

tiagocarmo · 2021-03-19T14:13:38Z

I had the same memory problem here.
We did some tests on an application that we developed, when version 10.0.7 occurs, the memory increases consumption.
With version 10.0.6 it is ok to use.

stefanprobst · 2021-03-19T16:20:07Z

as i understand it, these memory issues are caused by the switch from sharp to squoosh in #22253. the reason for the change given in that pr is that:

the native sharp dependency (which doesn't work on some Linux variants, nor M1 Mac)

i wonder if this is still a valid concern with the soon-to-be-released sharp v0.28 (see here). i'm also curious about a response to the sharp maintainer's comment here.

Downgraded next to 10.0.6 because of a mem leak introduced in 10.0.7 - vercel/next.js#23189

gu-stav · 2021-03-22T18:24:17Z

Alrighty, then here again :)

v10.0.10-canary.7 doesn't fix the issue in my case. It's still:

10.0.10-canary.7: 1960MB
10.0.7: 230MB

mjenzen · 2021-03-22T21:14:11Z

canary 7 didnt fix the issue for me either. I am showing a massive memory leak on my Windows 10 dev machine at 6,5GB/16GB and 3.5GB/8GB on my Azure Linux App Service. The resources are never recovered until the dev process or app service are stopped

jdfm · 2021-03-24T20:32:08Z

i did a rollback of just next.js to 10.0.7 and it fixed the issue. so it's definitely caused by next.js. as i said, we are not using the image optimization feature at all.

Our team had the same situation as @fabb where we upgraded to 10.0.8 and we were not using next/image, yet our containers were running out of memory. After downgrading to 10.0.7 our containers were back to running normally.

Interestingly, this only surfaced in production containers. Our local environments didn't face this issue, and our playground (something like dev/staging) environments also didn't have this issue. We still haven't been able to determine why.

avisra · 2021-03-25T16:44:02Z

I just found that we are also experiencing this same bug while hosting in AWS (elastic beanstalk). It is at 98% memory consumption :(

joe-bell · 2021-03-26T08:05:03Z

I believe this issue could be related to something I'm seeing when using Plaiceholder on Vercel with [email protected]:
#23531

(despite manually adding back the sharp dependency)

00:01:35.213 | info  - Finalizing page optimization...
-- | --
00:01:41.224 | sh: line 1:   729 Segmentation fault      (core dumped) next build
00:01:41.234 | npm ERR! code ELIFECYCLE
00:01:41.234 | npm ERR! errno 139

andrijavulicevic · 2021-03-26T08:54:50Z

The same issue on my side had to switch from 1GB to 6GB instance (though memory usage keeps at between 4 and 5GB) on GAE. Tried canary build 10.0.10-canary.11 and it seems way better, average memory consumption is around 1.5GB.

thomasgrivet · 2021-03-28T17:50:50Z

Hello!
Same issue on my end, had to switch from a 512mb instance to a 2gb instance and even then it still crash from time to time.
I tried using 10.0.9 and the latest canary with no luck. I just got back to using 10.0.7, it seems to be working just fine for now.

henryStelle · 2021-03-29T16:04:06Z

Hello,
I see that #23188 has been merged in the 10.1.0 release - has this been confirmed to fix this issue? Also, Sharp 0.28 was just released, does it merit a second look considering #23189 (comment)?

henryStelle · 2021-03-29T20:57:07Z

In my initial testing of 10.1.1 using the node:14.16.0-alpine3.10 docker image, the container's memory usage hit 2.5GiB within 45 seconds and didn't decrease. In comparison, 10.0.7 spikes to 400MiB when images are being generated and then falls back to 145 MiB. As a side note, adding RUN apk add --no-cache libc6-compat to the Dockerfile when testing 10.1.1 resulted in the container quickly consuming over 3.5GiB, at which point the instance became unresponsive and was restarted. Is anyone else seeing similar performance issues?

fabb · 2021-03-30T05:24:01Z

Should we rename this issue since it‘s not only happening when using image optimization? Or should we rather create a new one?

gu-stav · 2021-03-30T06:57:29Z

I've tried with 10.1.1 yesterday and the leak is still happening:

10.1.1: 1630MB
10.0.7: 230MB

It has gotten better, but it's still a lot more than it used to be in 10.0.7. Unfortunately.

tbanys · 2021-03-30T07:05:14Z

@gustavpursche are you using next/image component or memory usage high even without it?

gu-stav · 2021-03-30T07:09:01Z

That's something I haven't tested yet. I'm not super good at debugging at this level 🤦🏽 If it helps I can provide access to the private repo (the site itself is very small). I'm using a shared hosting and whenever (after 10.0.7) someone loads a page with a lot of images (e.g. https://sibylleberg.com/en/books) the process is terminated by the hoster, because the memory consumption is too high.

JVictorV · 2021-03-30T10:29:22Z

Got even worse for me with thew new next version + webpack 5. My solution was to rollback to the 10.0.6 version

karlhorky · 2021-03-30T11:01:33Z

Yeah, I was a big supporter of trying out something different other than sharp, but the squoosh solution just doesn't cut it for anything other than usage on an M1.

Please consider adding back sharp for all non-M1 environments @shuding . I say this as an M1 user myself.

kmdev1 · 2021-03-30T15:05:26Z

I've found a race condition in the image optimisation that seems to not only relate to this but also severely exaggerates it:

I opened this separate ticket: #23436

This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`[email protected]`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues #23189, #23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes

timneutkens · 2021-03-31T09:53:47Z

Please try out Next.js canary which includes #23565: yarn add next@canary

gu-stav · 2021-03-31T10:02:48Z

@timneutkens Thanks for the new release! I can see that it has gotten much better:

10.1.3-canary.0: 940MB
10.1.1: 1630MB
10.0.7: 230MB

However, the memory consumption is still much higher than it used to be in my case.

shuding · 2021-03-31T13:25:49Z

Hi @gustavpursche thanks for sharing the numbers! I'm curious that if 940MB is the stable memory usage after image optimization, or is it the memory usage without image optimization?

This is important for us to debug because most of the memory should be freed in a while (5s~15s) after image optimization, according to my test in #23565.

gu-stav · 2021-03-31T14:48:54Z

@shuding Thanks for the question. As I said I'm not very experienced in debugging at this level. Stackoverflow said that mprof run node_modules/.bin/next could be a way (build was done in production mode).

It gives me the following:

10.1.3-canary.0

10.0.7

So I guess you are right and the memory consumption isn't much higher in the long term (it's actually lower) - it only peaks higher at the beginning.

shuding · 2021-03-31T14:57:18Z

@gustavpursche thank you very much, these 2 charts are helpful! These basically match my test results that the memory leak in next/image is gone.

The initial memory usage being higher could be something else such as the new worker library that we introduced, I'll keep looking into it.

karlhorky · 2021-04-01T07:46:12Z

Great, thanks for your continued effort on this @shuding ! Really cool to see the fast response and the amount of care that Vercel puts into things.

mjenzen · 2021-04-01T14:20:05Z

@shuding

Here is a comparison of my memory usage. Its not a direct apples-to-apples comparison as the blue line has probably cached most of the optimized images, but on the orange line, I was hammering on the image processor generating e-commerce images after a fresh build. But if anything that indicates an even better result for the new version.

blue line - 10.0.7
salmon-y orange line - 10.1.3-canary.0

The only downside with the new build, is that the initial image optimization is noticeably slower than the previous version and can take seconds to populate images on the page. It seems to be exacerbated by loading more images in quick succession as well.

kmdev1 · 2021-04-01T16:00:26Z

Hey @shuding,

Great piece of work there 😄 - The canary release does indeed fix the overload of memory and it now seems to stay similar to that of 10.0.7 with no real noticeable difference between the two for myself 😄

henryStelle · 2021-04-01T21:12:55Z

Great work @shuding. I am also seeing the memory being deallocated after the images are optimized with the same amount of memory being used during the optimization process, compared to 10.0.7. However, as @mjenzen reported, the rate of the initial image optimization has decreased significantly.

karlhorky · 2021-04-02T13:05:19Z

The initial memory usage being higher could be something else such as the new worker library that we introduced, I'll keep looking into it.

However, as @mjenzen reported, the rate of the initial image optimization has decreased significantly.

@shuding Now that this is closed, where can we follow along about these two issues?

Higher peak memory usage at start
Up to 25x slower

mjenzen · 2021-04-02T13:20:06Z

I created issue #23637 to report the slowdown

see: - vercel/next.js#23436 - vercel/next.js#23189

- Switched because vercel/next.js#23189 has been fixed there - vercel/next.js#23565

This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`[email protected]`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues vercel#23189, vercel#23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes

balazsorban44 · 2022-01-28T14:05:28Z

This issue has been automatically locked due to no recent activity. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

Pythe1337N added the bug Issue was opened via the bug report template. label Mar 18, 2021

timcole added a commit to spaceflight-live/comet that referenced this issue Mar 21, 2021

Moved to CDN and downgraded next

7d383f4

Downgraded next to 10.0.6 because of a mem leak introduced in 10.0.7 - vercel/next.js#23189

karlhorky mentioned this issue Mar 22, 2021

10.0.8 next/image high memory use #22925

Closed

fostyfost mentioned this issue Mar 24, 2021

Increase of memory usage with migration Next-redux-wrapper V6 kirill-konshin/next-redux-wrapper#334

Open

joe-bell mentioned this issue Mar 30, 2021

"Segmentation fault" on Vercel (when using Plaiceholder/Sharp) #23531

Closed

shuding self-assigned this Mar 30, 2021

shuding added kind: bug and removed bug Issue was opened via the bug report template. labels Mar 30, 2021

shuding mentioned this issue Mar 30, 2021

Fix memory leak in image optimization #23565

Merged

8 tasks

timneutkens added this to the Iteration 18 milestone Apr 1, 2021

timneutkens closed this as completed Apr 2, 2021

dazedbear added a commit to dazedbear/dazedbear.github.io that referenced this issue Apr 5, 2021

build: lowgrade nextjs to 10.0.7 to prevent segment fault

09a4041

see: - vercel/next.js#23436 - vercel/next.js#23189

SudharakaP mentioned this issue Apr 13, 2021

Revert "Replace img by next img homepage only" opencollective/opencollective-frontend#6169

Merged

timcole added a commit to spaceflight-live/comet that referenced this issue Apr 18, 2021

Switch to next@canary

c689154

- Switched because vercel/next.js#23189 has been fixed there - vercel/next.js#23565

fabb mentioned this issue May 12, 2021

Slower Response Times when updating from 10.0.7 to 10.2.0 or 11.0.1 or 12.0.1 (with Disabled automatic image optimization) #25032

Closed

vercel locked as resolved and limited conversation to collaborators Jan 28, 2022

Memory leak in Image Optimization #23189

Memory leak in Image Optimization #23189

Comments

Pythe1337N commented Mar 18, 2021

What version of Next.js are you using?

What version of Node.js are you using?

What browser are you using?

What operating system are you using?

How are you deploying your application?

Describe the Bug

Expected Behavior

To Reproduce

shuding commented Mar 18, 2021

Pythe1337N commented Mar 18, 2021

fabb commented Mar 19, 2021

Pythe1337N commented Mar 19, 2021

fabb commented Mar 19, 2021 • edited Loading

shuding commented Mar 19, 2021

ahmetuysal commented Mar 19, 2021

tiagocarmo commented Mar 19, 2021

stefanprobst commented Mar 19, 2021

gu-stav commented Mar 22, 2021

mjenzen commented Mar 22, 2021

jdfm commented Mar 24, 2021

avisra commented Mar 25, 2021

joe-bell commented Mar 26, 2021 • edited Loading

andrijavulicevic commented Mar 26, 2021

thomasgrivet commented Mar 28, 2021

henryStelle commented Mar 29, 2021

henryStelle commented Mar 29, 2021

fabb commented Mar 30, 2021

gu-stav commented Mar 30, 2021 • edited Loading

tbanys commented Mar 30, 2021

gu-stav commented Mar 30, 2021 • edited Loading

JVictorV commented Mar 30, 2021

karlhorky commented Mar 30, 2021

kmdev1 commented Mar 30, 2021 • edited Loading

timneutkens commented Mar 31, 2021

gu-stav commented Mar 31, 2021 • edited Loading

shuding commented Mar 31, 2021

gu-stav commented Mar 31, 2021

shuding commented Mar 31, 2021

karlhorky commented Apr 1, 2021

mjenzen commented Apr 1, 2021

kmdev1 commented Apr 1, 2021

henryStelle commented Apr 1, 2021

karlhorky commented Apr 2, 2021

mjenzen commented Apr 2, 2021

balazsorban44 commented Jan 28, 2022

fabb commented Mar 19, 2021 •

edited

Loading

joe-bell commented Mar 26, 2021 •

edited

Loading

gu-stav commented Mar 30, 2021 •

edited

Loading

gu-stav commented Mar 30, 2021 •

edited

Loading

kmdev1 commented Mar 30, 2021 •

edited

Loading

gu-stav commented Mar 31, 2021 •

edited

Loading