Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage of the resource destructor #178

Open
varsill opened this issue Dec 9, 2024 · 2 comments
Open

High CPU usage of the resource destructor #178

varsill opened this issue Dec 9, 2024 · 2 comments

Comments

@varsill
Copy link

varsill commented Dec 9, 2024

Hello!
Recently I've come across a problem with high CPU usage when processing images with Vix v0.31.1 in Elixir 1.16.3 with OTP 26.
My scenario looks as follows - I am generating a bunch of videos out of sequences of JPEG images in realtime. That is why I need to load images from files and convert them to YUV color space (so that they can later be encoded).
To be more precise, for each video I run the following sequence of operations multiple times per second (corresponding to the video's framerate):

Vix.Image.new_from_binary
Vix.Image.write_to_binary
Vix.Operation.jpegload_buffer
Vix.Operation.linear!
Vix.Operation.bandjoin!
Vix.Operation.xyz
Vix.Operation.composite2
Image.YUV.write_to_binary

Though this task indeed is quite computationally heavy, it's looks to me as if it was using more resources than I expect.
Apart from that I am unable to scale the number of videos beyond 5/6, despite there are still some CPU resources left on my computer.
I did some profiling and noticed that normal scheduler 1 is pushed to its boundaries, with its utilization reaching 100%, in contrast to other normal schedulers, which are almost unused. Below there is a screenshot from the Erlang's observer:
image

I thought it might have something to do with the fact the the dirty NIFs resources destructor is always run on the normal scheduler 1 (https://github.com/erlang/otp/blob/51548614b588c667a4230c17e2b187474dd70d63/erts/emulator/beam/erl_nif.c#L2841)
so I decided to see what operations are executed on that scheduler. I used sample to do the beam.smp process profiling and it turns out, that the majority of time spent by the normal scheduler 1 thread is indeed spent on run_resource_dtor.
Below there is part of the output of my profiling results:

   36632 Thread_2779870: erts_sched_1
    + 25350 ???  (in <unknown binary>)  [0x103f21a64]
    + ! 25346 ???  (in <unknown binary>)  [0x103e6dee0]
    + ! : 21982 erts_schedule  (in beam.smp) + 8660  [0x102adaf14]
    + ! : | 21953 handle_aux_work  (in beam.smp) + 3652  [0x102adff50]
    + ! : | + 21916 run_resource_dtor  (in beam.smp) + 256  [0x102cb3a04]
    + ! : | + ! 21245 g_object_unref  (in libvips.42.dylib) + 224  [0x35c327b88]
    + ! : | + ! : 17932 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x146d40  [0x35c1ced40]
    + ! : | + ! : | 17930 g_signal_emit  (in libvips.42.dylib) + 28  [0x35c334b60]
    + ! : | + ! : | + 17778 g_signal_emit_valist  (in libvips.42.dylib) + 64  [0x35c33418c]
    + ! : | + ! : | + ! 17613 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x2ac708  [0x35c334708]
    + ! : | + ! : | + ! : 17216 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x2ad35c  [0x35c33535c]
    + ! : | + ! : | + ! : | 17210 g_closure_invoke  (in libvips.42.dylib) + 228  [0x35c323fa8]
    + ! : | + ! : | + ! : | + 17145 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x143b40  [0x35c1cbb40]
    + ! : | + ! : | + ! : | + ! 17143 g_object_set  (in libvips.42.dylib) + 28  [0x35c329bac]
    + ! : | + ! : | + ! : | + ! : 16589 g_object_set_valist  (in libvips.42.dylib) + 700  [0x35c3295c0]
    + ! : | + ! : | + ! : | + ! : | 16413 g_object_unref  (in libvips.42.dylib) + 224  [0x35c327b88]
    + ! : | + ! : | + ! : | + ! : | + 11518 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x146d40  [0x35c1ced40]
    + ! : | + ! : | + ! : | + ! : | + ! 11517 g_signal_emit  (in libvips.42.dylib) + 28  [0x35c334b60]
    + ! : | + ! : | + ! : | + ! : | + ! : 11386 g_signal_emit_valist  (in libvips.42.dylib) + 64  [0x35c33418c]
   ...

What is interesting, I see that the part starting with g_signal_emit is repeating itself several thousands of times, apparently decreasing some reference counter.
I've tried reducing the Vips concurrency and after setting it to 1, the amount of CPU used by run_resource_dtor is smaller, but still significant (around 50% of time obtained with Vips concurrency set to 8).

I've several questions concerning my issue:

  1. Why does resource deallocation in Vips takes that much of CPU resources?
  2. Is it expected that "g_signal_emit cycle" is run that many times?
  3. Are there some Vips flags that might be used to do the deallocation "in a bulk"?
@akash-akya
Copy link
Owner

@varsill thanks for a detailed issue report and analysis.

Deallocation taking lot of CPU is certainly odd. I'll check this as high priority as soon as I get sometime.

Are there some Vips flags that might be used to do the deallocation "in a bulk"?

As of now no.

Is it possible to share a sample code so I can reproduce this easily? I also like to see the context in which these operations are performed (process involvement etc)

Also, a blind suggestion if you want an immediate workaround, try wrapping all the vips operations for each frame in separate process and see if it helps? Something like Task.async(fn -> ... end) |> Task.await()

@varsill
Copy link
Author

varsill commented Dec 10, 2024

@akash-akya Thank you very much for a quick response!

Deallocation taking lot of CPU is certainly odd. I'll check this as high priority as soon as I get sometime.

Great, thanks in advance!

Also, a blind suggestion if you want an immediate workaround, try wrapping all the vips operations for each frame in separate process and see if it helps? Something like Task.async(fn -> ... end) |> Task.await()

Indeed, this one seems to solve an issue with uneven work distribution among the normal schedulers. When I run :erlang.statistics(:microstate_accounting), I see that the normal schedulers are now sharing the auxilary work. It allows me to scale the number of concurrent streams even further, thanks for that workaround suggestion! Unfortunately, the overall CPU consumption seems to be slightly higher than without that change and I still see that much of work is spent on run_resource_dtor.

Is it possible to share a sample code so I can reproduce this easily?

Sure, I will try to prepare some sample script that reproduces that behaviour

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants