-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in Image Optimization #23189
Comments
Could you please try installing the latest canary ( |
Tried with 10.0.10-canary.4 and I'm still having issues. Been running in a 512 MB ram docker image up to version 10.0.7. Increased it to 1024 MB to test this, but it's still using all of the memory available. |
we are also having memory leaks and pod crashes since updating from next.js 10.0.7 to 10.0.9 yesterday. we updated a few other dependencies as well, so it's not yet certain that the next.js updated caused the issue. note that we do not use the image optimization feature. |
I did a rollback on all deps other than for next and still had the same problem. The memory leak might be part of the image optimization, but that is what is running slow. To fetch an image from the _next/image endpoint is really slow while everything else seems to run as usual. Will try to only fetch images to try to pinpoint the leak. |
i did a rollback of just next.js to 10.0.7 and it fixed the issue. so it's definitely caused by next.js. as i said, we are not using the image optimization feature at all. |
We've fixed some know issues regarding memory usage in canary already. It would be great if you can test |
I had the same memory problem here. |
as i understand it, these memory issues are caused by the switch from
i wonder if this is still a valid concern with the soon-to-be-released |
Downgraded next to 10.0.6 because of a mem leak introduced in 10.0.7 - vercel/next.js#23189
Alrighty, then here again :) v10.0.10-canary.7 doesn't fix the issue in my case. It's still:
|
canary 7 didnt fix the issue for me either. I am showing a massive memory leak on my Windows 10 dev machine at 6,5GB/16GB and 3.5GB/8GB on my Azure Linux App Service. The resources are never recovered until the dev process or app service are stopped |
Our team had the same situation as @fabb where we upgraded to 10.0.8 and we were not using next/image, yet our containers were running out of memory. After downgrading to 10.0.7 our containers were back to running normally. Interestingly, this only surfaced in production containers. Our local environments didn't face this issue, and our playground (something like dev/staging) environments also didn't have this issue. We still haven't been able to determine why. |
I just found that we are also experiencing this same bug while hosting in AWS (elastic beanstalk). It is at 98% memory consumption :( |
I believe this issue could be related to something I'm seeing when using Plaiceholder on Vercel with (despite manually adding back the 00:01:35.213 | info - Finalizing page optimization...
-- | --
00:01:41.224 | sh: line 1: 729 Segmentation fault (core dumped) next build
00:01:41.234 | npm ERR! code ELIFECYCLE
00:01:41.234 | npm ERR! errno 139 |
The same issue on my side had to switch from 1GB to 6GB instance (though memory usage keeps at between 4 and 5GB) on GAE. Tried canary build |
Hello! |
Hello, |
In my initial testing of 10.1.1 using the node:14.16.0-alpine3.10 docker image, the container's memory usage hit 2.5GiB within 45 seconds and didn't decrease. In comparison, 10.0.7 spikes to 400MiB when images are being generated and then falls back to 145 MiB. As a side note, adding |
Should we rename this issue since it‘s not only happening when using image optimization? Or should we rather create a new one? |
I've tried with 10.1.1: 1630MB It has gotten better, but it's still a lot more than it used to be in 10.0.7. Unfortunately. |
@gustavpursche are you using next/image component or memory usage high even without it? |
That's something I haven't tested yet. I'm not super good at debugging at this level 🤦🏽 If it helps I can provide access to the private repo (the site itself is very small). I'm using a shared hosting and whenever (after 10.0.7) someone loads a page with a lot of images (e.g. https://sibylleberg.com/en/books) the process is terminated by the hoster, because the memory consumption is too high. |
Got even worse for me with thew new next version + webpack 5. My solution was to rollback to the 10.0.6 version |
Yeah, I was a big supporter of trying out something different other than Please consider adding back |
I've found a race condition in the image optimisation that seems to not only relate to this but also severely exaggerates it: I opened this separate ticket: #23436 |
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`[email protected]`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues #23189, #23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes
Please try out Next.js canary which includes #23565: |
@timneutkens Thanks for the new release! I can see that it has gotten much better: 10.1.3-canary.0: 940MB However, the memory consumption is still much higher than it used to be in my case. |
Hi @gustavpursche thanks for sharing the numbers! I'm curious that if 940MB is the stable memory usage after image optimization, or is it the memory usage without image optimization? This is important for us to debug because most of the memory should be freed in a while (5s~15s) after image optimization, according to my test in #23565. |
@shuding Thanks for the question. As I said I'm not very experienced in debugging at this level. Stackoverflow said that It gives me the following: So I guess you are right and the memory consumption isn't much higher in the long term (it's actually lower) - it only peaks higher at the beginning. |
@gustavpursche thank you very much, these 2 charts are helpful! These basically match my test results that the memory leak in next/image is gone. The initial memory usage being higher could be something else such as the new worker library that we introduced, I'll keep looking into it. |
Great, thanks for your continued effort on this @shuding ! Really cool to see the fast response and the amount of care that Vercel puts into things. |
Here is a comparison of my memory usage. Its not a direct apples-to-apples comparison as the blue line has probably cached most of the optimized images, but on the orange line, I was hammering on the image processor generating e-commerce images after a fresh build. But if anything that indicates an even better result for the new version. blue line - 10.0.7 The only downside with the new build, is that the initial image optimization is noticeably slower than the previous version and can take seconds to populate images on the page. It seems to be exacerbated by loading more images in quick succession as well. |
Hey @shuding, Great piece of work there 😄 - The canary release does indeed fix the overload of memory and it now seems to stay similar to that of 10.0.7 with no real noticeable difference between the two for myself 😄 |
@shuding Now that this is closed, where can we follow along about these two issues?
|
I created issue #23637 to report the slowdown |
- Switched because vercel/next.js#23189 has been fixed there - vercel/next.js#23565
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`[email protected]`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues vercel#23189, vercel#23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`[email protected]`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues vercel#23189, vercel#23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes
This issue has been automatically locked due to no recent activity. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you. |
What version of Next.js are you using?
10.0.9
What version of Node.js are you using?
14.15.5
What browser are you using?
Chrome
What operating system are you using?
alpine (docker image)
How are you deploying your application?
express server
Describe the Bug
Upgrading from 10.0.7 to 10.0.8 or 10.0.9 results in a server side memory leak when using the component.
Consumes over 1GB of memory after only 10-20 image optimizations.
Expected Behavior
Memory usage should be normal.
To Reproduce
Import Image from 'next/image' and use it as the the documentation says https://nextjs.org/docs/api-reference/next/image with an allowed external image source.
In this case layout="fixed" was used.
The text was updated successfully, but these errors were encountered: