Memory leak on v20231217 #429

dmitry-j-mikhin · 2024-01-31T17:48:52Z

I have a small VDS with 1GB of memory (https://asciinema.tw1.ru). I installed asciinema-server v20231217 for personal use. This server has no real load, several requests per day. But every 2 days OOM occurs:

Jan 26 18:06:40 2430241-cu30781.twc1.net kernel: Out of memory: Killed process 1109 (beam.smp) total-vm:1665272kB, anon-rss:457200kB, file-rss:0kB, shmem-rss:93584kB, UID:1000 pgtables:1252kB oom_score_adj:200
Jan 28 13:37:05 2430241-cu30781.twc1.net kernel: Out of memory: Killed process 14114 (beam.smp) total-vm:1679664kB, anon-rss:458068kB, file-rss:0kB, shmem-rss:93648kB, UID:1000 pgtables:1284kB oom_score_adj:200
Jan 30 10:21:24 2430241-cu30781.twc1.net kernel: Out of memory: Killed process 25835 (beam.smp) total-vm:1665368kB, anon-rss:468296kB, file-rss:0kB, shmem-rss:93584kB, UID:1000 pgtables:1264kB oom_score_adj:200

The memory usage graph looks like this:

The text was updated successfully, but these errors were encountered:

dmitry-j-mikhin · 2024-02-05T09:14:34Z

v20240203 also has this issue, the memory usage of the beam.smp process is constantly increasing.

ku1ik · 2024-02-05T09:18:57Z

Thanks for the report. Great that you have the graph, this indeed looks like a leak. I'll try to reproduce that locally.

dmitry-j-mikhin · 2024-02-05T09:30:45Z

@ku1ik thanks for the feedback and for this wonderful project. If you need more information, logs, etc. I can also provide.

jiriks74 · 2024-05-14T11:06:15Z

I'm currently use the latest tag in docker so I don't know what exact version I'm on.

But I can confirm that this indeed happens. I've been running the server for about 4 days now and I've noticed the memory usage steadily climbing. I've restarted the container one time to see if it was just a fluke but the memory usage is still climbing like before. I'll probably make a cron job to restart it at midnight every day to get around this until it's patched.

Also: there's no usage at all. I have a single account on this instance, here are no recordings (apart from the welcome one), no recordings are shared so no websites are loading them and I haven't used it for about 2 days now but it's still going up.

Here's my graph:

If you're interested I can make some tests, etc., as it's a personal instance where downtime won't annoy anyone.

jiriks74 · 2024-05-14T17:14:48Z

I just modified my view so that I can filter containers more easily and I also found a way to capture non-visibile parts of a website in Firefox (the built in screenshot tool is still broken) so here's a better screenshot. You can also see the network usage to have an idea of how much this instance is not used.

Note

If you see multiple containers names asciinema-asciinema-1 it means that there was a container recreation/restart. When I recreate/restart the container some stats get a different ID. It's still the same thing.

ku1ik · 2024-05-14T18:14:41Z

Does the mem usage grow indefinitely, or does it top at some value, and go down and the up again?

ku1ik · 2024-05-14T18:24:22Z

I can observe similar behavior on asciinema.org:

jiriks74 · 2024-05-14T18:24:54Z

Does the mem usage grow indefinitely, or does it top at some value, and go down and the up again?

So far it's indefinitely until I restart it.

dmitry-j-mikhin · 2024-05-14T18:58:57Z

So far it's indefinitely until I restart it.

Or until OOM-killer kills the process eating up all the memory)

dmitry-j-mikhin · 2024-05-14T19:06:09Z

These are my 30d graphs:

ku1ik · 2024-05-14T19:07:25Z

Thanks. Yeah, the problem is definitely there. I'm trying to pinpoint it now.

jiriks74 · 2024-05-14T19:58:05Z

I see that about the time that the v20231217 version was reported the admin dashboard was added:

added admin console endpoint in port 4002, with Phoenix LiveDashboard at http://localhost:4002/dashboard

https://github.com/asciinema/asciinema-server/releases/tag/v20231216

Maybe that could be the culpruit?

ku1ik · 2024-05-14T20:14:09Z

I think it's the built-in prometheus endpoint (http://IP:9568/metrics), which when not queried, accumulates aggregated data in ETS table.

ku1ik · 2024-05-14T20:23:51Z

That's likely it: beam-telemetry/telemetry_metrics_prometheus_core#52

jiriks74 · 2024-05-14T20:33:42Z

I agree. I exposed the dashboard (recreated the container) and then opened it after a while. The usage has not gone up for an hour now (though it can be too little time for it to be visible):

Note

The memory spike is when I recreated the container to expose port 4002

jiriks74 · 2024-05-14T22:24:14Z

I'll let this run overnight to see how the memory usage behaves and whether opening the dashboard causes the memory to be freed.

jiriks74 · 2024-05-14T23:08:10Z

I did a quick test before I went to sleep and after opening the dashboard the memory usage fell down. I can confirm that the admin panel feature is what is the cause here.

Note

RSS memory usage didn't go down. This may be caused by my 2GB swap that is 80% free for the majority of time. From reading the issue upstream it looks like the metrics are not flushed from memory until they're loaded by the panel. Since the metrics are static data it would make sense that Linux would move it to swap.

ku1ik · 2024-05-15T06:22:57Z

I nailed it. This is mem graph from asciinema.org, with the prometheus endpoint disabled. Note the last quarter - flat as a table :)

ku1ik · 2024-05-15T06:32:29Z

I'll release a new version with the fix soon.

jiriks74 · 2024-05-15T09:03:36Z

Would gmthere be a config option for it? If I'd be able to monitor things like the number of recordings, etc, I may integrate it to my monitoring stack.

ku1ik · 2024-05-15T09:14:54Z

The built-in prometheus endpoint provided the following stats:

asciinema-server/lib/asciinema/telemetry.ex

Lines 48 to 76 in b3a852d

 [ 

 # VM 

 last_value("vm.memory.total", unit: :byte), 

 last_value("vm.total_run_queue_lengths.total"), 

 last_value("vm.total_run_queue_lengths.cpu"), 

 last_value("vm.total_run_queue_lengths.io"), 

 # Ecto 

 distribution("asciinema.repo.query.total_time", repo_distribution), 

 distribution("asciinema.repo.query.decode_time", repo_distribution), 

 distribution("asciinema.repo.query.query_time", repo_distribution), 

 distribution("asciinema.repo.query.idle_time", repo_distribution), 

 distribution("asciinema.repo.query.queue_time", repo_distribution), 

 # Phoenix 

 distribution("phoenix.endpoint.start.system_time", phoenix_distribution), 

 distribution("phoenix.endpoint.stop.duration", phoenix_distribution), 

 distribution("phoenix.router_dispatch.start.system_time", phoenix_distribution), 

 distribution("phoenix.router_dispatch.exception.duration", phoenix_distribution), 

 distribution("phoenix.router_dispatch.stop.duration", phoenix_distribution), 

 distribution("phoenix.socket_connected.duration", phoenix_distribution), 

 distribution("phoenix.channel_join.duration", phoenix_distribution), 

 distribution("phoenix.channel_handled_in.duration", phoenix_distribution), 

 # Oban 

 counter("oban.job.start.count", oban_counter), 

 distribution("oban.job.stop.duration", oban_distribution), 

 distribution("oban.job.exception.duration", oban_distribution) 

 ]

- so, BEAM VM mem/cpu stats, HTTP request stats, database query stats and background job stats. It didn't have application-level stats like recordings count etc.

I removed it for now to solve the leak, especially that it was undocumented and nobody used it (including me).

I may re-add it in the future, with some more useful stats, and with an explicit config option to enable it.

jiriks74 · 2024-05-15T10:06:58Z

Yeah it was weird for me when I saw it in the logs (admin panel listening on 4002) while docs said there's no admin panel.

And these stats are useless for me since I gather these things a other way. CPU, memory and network Rx/Tx are gathered by Docker and cAdvisor, HTTP requests are gathered by Traefik.

ku1ik · 2024-05-15T10:18:26Z

Yeah it was weird for me when I saw it in the logs (admin panel listening on 4002) while docs said there's no admin panel.

Both the dashboard and the /metrics endpoint report similar metrics, from the same source.

That admin panel is still there, and it was not the cause of the leak. This panel is a basic dashboard with some metrics, but it doesn't use any extra resources (not noticeably). It doesn't hurt when it's not used.

The problem was the prometheus /metrics endpoint, which I now removed.

jiriks74 · 2024-05-15T11:11:37Z

How does the /metrics endpoint work? I've accessed :4002/metrics and I get a 404. I'd just like to see whether I could fetch and discard the stats until there's a new release.

ku1ik · 2024-05-15T11:23:32Z

@jiriks74 this one runs on its own port, 9568.

ku1ik · 2024-05-15T11:24:27Z

@jiriks74 the new release is out (docker image), I'm just wrapping up the release notes.

ku1ik · 2024-05-15T11:35:43Z

@jiriks74 https://github.com/asciinema/asciinema-server/releases/tag/v20240515

jiriks74 · 2024-05-15T11:56:09Z

Thanks <3

jiriks74 · 2024-05-15T12:26:27Z

Using the new release seems to be working. It's been only cca 30 minutes but I can confirm that I cannot see the memory leak trend as I did before.

Old release

New release

Note

The spike on the second graph is when I updated to the new release and started the server.

Both graphs are over an interval of 30 minutes.

jiriks74 · 2024-05-16T13:42:07Z

Hello, here's my last comment on this, i promise. Since memory leaks can sometimes be pain I was monitoring whether something would come up. I can now confirm that there hasn't been any significant memory usage over the last 24 hours. Below is a graph over 48 hours to see the differences between the old and new releases more easily.

ku1ik · 2024-05-16T15:38:17Z

Thank you @jiriks74! I'm observing the same on asciinema.org so we can safely assume the problem has been solved. Thanks for the assistance! 🤝

ku1ik closed this as completed in 06cd7e5 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak on v20231217 #429

Memory leak on v20231217 #429

dmitry-j-mikhin commented Jan 31, 2024

dmitry-j-mikhin commented Feb 5, 2024

ku1ik commented Feb 5, 2024

dmitry-j-mikhin commented Feb 5, 2024

jiriks74 commented May 14, 2024

jiriks74 commented May 14, 2024

ku1ik commented May 14, 2024 •

edited

ku1ik commented May 14, 2024

jiriks74 commented May 14, 2024

dmitry-j-mikhin commented May 14, 2024 •

edited

dmitry-j-mikhin commented May 14, 2024 •

edited

ku1ik commented May 14, 2024

jiriks74 commented May 14, 2024

ku1ik commented May 14, 2024 •

edited

ku1ik commented May 14, 2024

jiriks74 commented May 14, 2024 •

edited

jiriks74 commented May 14, 2024

jiriks74 commented May 14, 2024 •

edited

ku1ik commented May 15, 2024 •

edited

ku1ik commented May 15, 2024

jiriks74 commented May 15, 2024

ku1ik commented May 15, 2024

jiriks74 commented May 15, 2024

ku1ik commented May 15, 2024 •

edited

jiriks74 commented May 15, 2024

ku1ik commented May 15, 2024

ku1ik commented May 15, 2024

ku1ik commented May 15, 2024

jiriks74 commented May 15, 2024

jiriks74 commented May 15, 2024 •

edited

jiriks74 commented May 16, 2024

ku1ik commented May 16, 2024 •

edited

Memory leak on v20231217 #429

Memory leak on v20231217 #429

Comments

dmitry-j-mikhin commented Jan 31, 2024

dmitry-j-mikhin commented Feb 5, 2024

ku1ik commented Feb 5, 2024

dmitry-j-mikhin commented Feb 5, 2024

jiriks74 commented May 14, 2024

jiriks74 commented May 14, 2024

ku1ik commented May 14, 2024 • edited

ku1ik commented May 14, 2024

jiriks74 commented May 14, 2024

dmitry-j-mikhin commented May 14, 2024 • edited

dmitry-j-mikhin commented May 14, 2024 • edited

ku1ik commented May 14, 2024

jiriks74 commented May 14, 2024

ku1ik commented May 14, 2024 • edited

ku1ik commented May 14, 2024

jiriks74 commented May 14, 2024 • edited

jiriks74 commented May 14, 2024

jiriks74 commented May 14, 2024 • edited

ku1ik commented May 15, 2024 • edited

ku1ik commented May 15, 2024

jiriks74 commented May 15, 2024

ku1ik commented May 15, 2024

jiriks74 commented May 15, 2024

ku1ik commented May 15, 2024 • edited

jiriks74 commented May 15, 2024

ku1ik commented May 15, 2024

ku1ik commented May 15, 2024

ku1ik commented May 15, 2024

jiriks74 commented May 15, 2024

jiriks74 commented May 15, 2024 • edited

jiriks74 commented May 16, 2024

ku1ik commented May 16, 2024 • edited

ku1ik commented May 14, 2024 •

edited

dmitry-j-mikhin commented May 14, 2024 •

edited

dmitry-j-mikhin commented May 14, 2024 •

edited

ku1ik commented May 14, 2024 •

edited

jiriks74 commented May 14, 2024 •

edited

jiriks74 commented May 14, 2024 •

edited

ku1ik commented May 15, 2024 •

edited

ku1ik commented May 15, 2024 •

edited

jiriks74 commented May 15, 2024 •

edited

ku1ik commented May 16, 2024 •

edited