Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in Image Optimization #23189

Closed
Pythe1337N opened this issue Mar 18, 2021 · 37 comments
Closed

Memory leak in Image Optimization #23189

Pythe1337N opened this issue Mar 18, 2021 · 37 comments
Assignees
Milestone

Comments

@Pythe1337N
Copy link

What version of Next.js are you using?

10.0.9

What version of Node.js are you using?

14.15.5

What browser are you using?

Chrome

What operating system are you using?

alpine (docker image)

How are you deploying your application?

express server

Describe the Bug

Upgrading from 10.0.7 to 10.0.8 or 10.0.9 results in a server side memory leak when using the component.
Consumes over 1GB of memory after only 10-20 image optimizations.

Expected Behavior

Memory usage should be normal.

To Reproduce

Import Image from 'next/image' and use it as the the documentation says https://nextjs.org/docs/api-reference/next/image with an allowed external image source.
In this case layout="fixed" was used.

@Pythe1337N Pythe1337N added the bug Issue was opened via the bug report template. label Mar 18, 2021
@shuding
Copy link
Member

shuding commented Mar 18, 2021

Could you please try installing the latest canary (next@canary)? It should be fixed there, see also #22925.

@Pythe1337N
Copy link
Author

Tried with 10.0.10-canary.4 and I'm still having issues. Been running in a 512 MB ram docker image up to version 10.0.7. Increased it to 1024 MB to test this, but it's still using all of the memory available.

@fabb
Copy link
Contributor

fabb commented Mar 19, 2021

we are also having memory leaks and pod crashes since updating from next.js 10.0.7 to 10.0.9 yesterday. we updated a few other dependencies as well, so it's not yet certain that the next.js updated caused the issue. note that we do not use the image optimization feature.

@Pythe1337N
Copy link
Author

I did a rollback on all deps other than for next and still had the same problem. The memory leak might be part of the image optimization, but that is what is running slow. To fetch an image from the _next/image endpoint is really slow while everything else seems to run as usual. Will try to only fetch images to try to pinpoint the leak.

@fabb
Copy link
Contributor

fabb commented Mar 19, 2021

i did a rollback of just next.js to 10.0.7 and it fixed the issue. so it's definitely caused by next.js. as i said, we are not using the image optimization feature at all.

@shuding
Copy link
Member

shuding commented Mar 19, 2021

We've fixed some know issues regarding memory usage in canary already. It would be great if you can test next@canary vs latest stable release and share more information, so we can identify the cause quickly!

@ahmetuysal
Copy link

I think the problem persists in 10.0.10-canary.5. Memory utilization is steadily increasing in our website deployed to AWS Beanstalk. It was stable until I updated next to 10.0.10-canary.5 from 10.0.7.
image

@tiagocarmo
Copy link

I had the same memory problem here.
We did some tests on an application that we developed, when version 10.0.7 occurs, the memory increases consumption.
With version 10.0.6 it is ok to use.

@stefanprobst
Copy link
Contributor

as i understand it, these memory issues are caused by the switch from sharp to squoosh in #22253. the reason for the change given in that pr is that:

the native sharp dependency (which doesn't work on some Linux variants, nor M1 Mac)

i wonder if this is still a valid concern with the soon-to-be-released sharp v0.28 (see here). i'm also curious about a response to the sharp maintainer's comment here.

timcole added a commit to spaceflight-live/comet that referenced this issue Mar 21, 2021
Downgraded next to 10.0.6 because of a mem leak introduced in 10.0.7
 - vercel/next.js#23189
@gu-stav
Copy link

gu-stav commented Mar 22, 2021

Alrighty, then here again :)

v10.0.10-canary.7 doesn't fix the issue in my case. It's still:

10.0.10-canary.7: 1960MB
10.0.7: 230MB

@mjenzen
Copy link

mjenzen commented Mar 22, 2021

canary 7 didnt fix the issue for me either. I am showing a massive memory leak on my Windows 10 dev machine at 6,5GB/16GB and 3.5GB/8GB on my Azure Linux App Service. The resources are never recovered until the dev process or app service are stopped

@jdfm
Copy link

jdfm commented Mar 24, 2021

i did a rollback of just next.js to 10.0.7 and it fixed the issue. so it's definitely caused by next.js. as i said, we are not using the image optimization feature at all.

Our team had the same situation as @fabb where we upgraded to 10.0.8 and we were not using next/image, yet our containers were running out of memory. After downgrading to 10.0.7 our containers were back to running normally.

Interestingly, this only surfaced in production containers. Our local environments didn't face this issue, and our playground (something like dev/staging) environments also didn't have this issue. We still haven't been able to determine why.

@avisra
Copy link

avisra commented Mar 25, 2021

I just found that we are also experiencing this same bug while hosting in AWS (elastic beanstalk). It is at 98% memory consumption :(

@joe-bell
Copy link
Contributor

joe-bell commented Mar 26, 2021

I believe this issue could be related to something I'm seeing when using Plaiceholder on Vercel with [email protected]:
#23531

(despite manually adding back the sharp dependency)

00:01:35.213 | info  - Finalizing page optimization...
-- | --
00:01:41.224 | sh: line 1:   729 Segmentation fault      (core dumped) next build
00:01:41.234 | npm ERR! code ELIFECYCLE
00:01:41.234 | npm ERR! errno 139

@andrijavulicevic
Copy link

The same issue on my side had to switch from 1GB to 6GB instance (though memory usage keeps at between 4 and 5GB) on GAE. Tried canary build 10.0.10-canary.11 and it seems way better, average memory consumption is around 1.5GB.

@thomasgrivet
Copy link

Hello!
Same issue on my end, had to switch from a 512mb instance to a 2gb instance and even then it still crash from time to time.
I tried using 10.0.9 and the latest canary with no luck. I just got back to using 10.0.7, it seems to be working just fine for now.

@henryStelle
Copy link

Hello,
I see that #23188 has been merged in the 10.1.0 release - has this been confirmed to fix this issue? Also, Sharp 0.28 was just released, does it merit a second look considering #23189 (comment)?

@henryStelle
Copy link

In my initial testing of 10.1.1 using the node:14.16.0-alpine3.10 docker image, the container's memory usage hit 2.5GiB within 45 seconds and didn't decrease. In comparison, 10.0.7 spikes to 400MiB when images are being generated and then falls back to 145 MiB. As a side note, adding RUN apk add --no-cache libc6-compat to the Dockerfile when testing 10.1.1 resulted in the container quickly consuming over 3.5GiB, at which point the instance became unresponsive and was restarted. Is anyone else seeing similar performance issues?

@fabb
Copy link
Contributor

fabb commented Mar 30, 2021

Should we rename this issue since it‘s not only happening when using image optimization? Or should we rather create a new one?

@gu-stav
Copy link

gu-stav commented Mar 30, 2021

I've tried with 10.1.1 yesterday and the leak is still happening:

10.1.1: 1630MB
10.0.7: 230MB

It has gotten better, but it's still a lot more than it used to be in 10.0.7. Unfortunately.

@tbanys
Copy link

tbanys commented Mar 30, 2021

@gustavpursche are you using next/image component or memory usage high even without it?

@gu-stav
Copy link

gu-stav commented Mar 30, 2021

That's something I haven't tested yet. I'm not super good at debugging at this level 🤦🏽 If it helps I can provide access to the private repo (the site itself is very small). I'm using a shared hosting and whenever (after 10.0.7) someone loads a page with a lot of images (e.g. https://sibylleberg.com/en/books) the process is terminated by the hoster, because the memory consumption is too high.

@JVictorV
Copy link

Got even worse for me with thew new next version + webpack 5. My solution was to rollback to the 10.0.6 version

@karlhorky
Copy link
Contributor

Yeah, I was a big supporter of trying out something different other than sharp, but the squoosh solution just doesn't cut it for anything other than usage on an M1.

Please consider adding back sharp for all non-M1 environments @shuding . I say this as an M1 user myself.

@kmdev1
Copy link

kmdev1 commented Mar 30, 2021

I've found a race condition in the image optimisation that seems to not only relate to this but also severely exaggerates it:

I opened this separate ticket: #23436

@shuding shuding self-assigned this Mar 30, 2021
@shuding shuding added kind: bug and removed bug Issue was opened via the bug report template. labels Mar 30, 2021
kodiakhq bot pushed a commit that referenced this issue Mar 31, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`[email protected]`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues #23189, #23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
@timneutkens
Copy link
Member

Please try out Next.js canary which includes #23565: yarn add next@canary

@gu-stav
Copy link

gu-stav commented Mar 31, 2021

@timneutkens Thanks for the new release! I can see that it has gotten much better:

10.1.3-canary.0: 940MB
10.1.1: 1630MB
10.0.7: 230MB

However, the memory consumption is still much higher than it used to be in my case.

@shuding
Copy link
Member

shuding commented Mar 31, 2021

Hi @gustavpursche thanks for sharing the numbers! I'm curious that if 940MB is the stable memory usage after image optimization, or is it the memory usage without image optimization?

This is important for us to debug because most of the memory should be freed in a while (5s~15s) after image optimization, according to my test in #23565.

@gu-stav
Copy link

gu-stav commented Mar 31, 2021

@shuding Thanks for the question. As I said I'm not very experienced in debugging at this level. Stackoverflow said that mprof run node_modules/.bin/next could be a way (build was done in production mode).

It gives me the following:

10.1.3-canary.0
Figure_1

10.0.7
Figure_1

So I guess you are right and the memory consumption isn't much higher in the long term (it's actually lower) - it only peaks higher at the beginning.

@shuding
Copy link
Member

shuding commented Mar 31, 2021

@gustavpursche thank you very much, these 2 charts are helpful! These basically match my test results that the memory leak in next/image is gone.

The initial memory usage being higher could be something else such as the new worker library that we introduced, I'll keep looking into it.

@karlhorky
Copy link
Contributor

Great, thanks for your continued effort on this @shuding ! Really cool to see the fast response and the amount of care that Vercel puts into things.

@mjenzen
Copy link

mjenzen commented Apr 1, 2021

@shuding

Here is a comparison of my memory usage. Its not a direct apples-to-apples comparison as the blue line has probably cached most of the optimized images, but on the orange line, I was hammering on the image processor generating e-commerce images after a fresh build. But if anything that indicates an even better result for the new version.

image

blue line - 10.0.7
salmon-y orange line - 10.1.3-canary.0

The only downside with the new build, is that the initial image optimization is noticeably slower than the previous version and can take seconds to populate images on the page. It seems to be exacerbated by loading more images in quick succession as well.

@kmdev1
Copy link

kmdev1 commented Apr 1, 2021

Hey @shuding,

Great piece of work there 😄 - The canary release does indeed fix the overload of memory and it now seems to stay similar to that of 10.0.7 with no real noticeable difference between the two for myself 😄

@timneutkens timneutkens added this to the Iteration 18 milestone Apr 1, 2021
@henryStelle
Copy link

Great work @shuding. I am also seeing the memory being deallocated after the images are optimized with the same amount of memory being used during the optimization process, compared to 10.0.7. However, as @mjenzen reported, the rate of the initial image optimization has decreased significantly.

@karlhorky
Copy link
Contributor

The initial memory usage being higher could be something else such as the new worker library that we introduced, I'll keep looking into it.

However, as @mjenzen reported, the rate of the initial image optimization has decreased significantly.

@shuding Now that this is closed, where can we follow along about these two issues?

  1. Higher peak memory usage at start
  2. Up to 25x slower

@mjenzen
Copy link

mjenzen commented Apr 2, 2021

I created issue #23637 to report the slowdown

dazedbear added a commit to dazedbear/dazedbear.github.io that referenced this issue Apr 5, 2021
timcole added a commit to spaceflight-live/comet that referenced this issue Apr 18, 2021
- Switched because vercel/next.js#23189 has
  been fixed there
  - vercel/next.js#23565
SokratisVidros pushed a commit to SokratisVidros/next.js that referenced this issue Apr 20, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`[email protected]`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues vercel#23189, vercel#23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
flybayer pushed a commit to blitz-js/next.js that referenced this issue Apr 29, 2021
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem:

### 1. Too many WebAssembly instances are created

We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one).

This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case.

### 2. WebAssembly memory can't be deallocated

It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed.

The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK.

### 3. WebAssembly memory isn't correctly freed after finishing the operation

`wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. 

This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished.

### 4. Memory leak inside event listeners

`emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43

And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`).

That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak.

Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners.

### Test

Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs):

Before this PR (`[email protected]`):

<img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500">

Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again.

With fix 1 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500">

Memory went from ~280MB to 1.5GB (peak: 2GB).

With fix 1+2 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500">

Memory went from ~280MB to 1.1GB (peak: 1.6GB).

With fix 1+2+3+4 applied:

<img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500">

It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here.

---

## Bug

- [x] Related issues vercel#23189, vercel#23436
- [ ] Integration tests added

## Feature

- [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR.
- [ ] Related issues linked using `fixes #number`
- [ ] Integration tests added
- [ ] Documentation added
- [ ] Telemetry added. In case of a feature if it's used or not.

## Documentation / Examples

- [ ] Make sure the linting passes
@balazsorban44
Copy link
Member

This issue has been automatically locked due to no recent activity. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

@vercel vercel locked as resolved and limited conversation to collaborators Jan 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests