Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible hang with OpenMP #96

Open
aulwes opened this issue May 2, 2024 · 3 comments
Open

Possible hang with OpenMP #96

aulwes opened this issue May 2, 2024 · 3 comments
Labels
enhancement New feature or request
Milestone

Comments

@aulwes
Copy link

aulwes commented May 2, 2024

Hi, Are there known issues with using OpenMP inside malt? I attempted to modify the code in SymbolSolver::solveNames() to parallelize the loop over the theCommands list to see if I could speedup the addr2line phase. However, it seems to hang. As a simple reproducer, I added this loop

#pragma omp parallel for
for ( int c = 0; c < 8; ++c ) {
int tid = omp_get_thread_num();
std::cerr << " thread " << tid << " going to sleep..." << std::endl;
usleep(50000*(tid+1) + c*2000);
}

right before the run loop to execute addr2line. I tried both Intel and GCC using '-fopenmp', but the above loop hangs as well. I set OMP_NUM_THREADS=4.

Thanks,Rob

@svalat
Copy link
Member

svalat commented May 3, 2024

Hi Robe, thanks for reporting the try.

I remember I already tried once shortly a couple of years ago and seen what your describe (exactly for that part to speedup the symbol solving).

I would say, there is probably an issue to fix at that stage (on exit) not to enter again in the library if it does a malloc. Or possibly because we are unloaded very late due to LD_PRELOAD and things are already closed about openmp.

In principle this should already be the case but with what you describe (and what I remember about my own try) there is certainly a problem of re-entrance.

Any way I wanted to cleanup that part of the code due to some patched added by some others this year so I can make a new release soon so I can try to focus on that when returning mid may to see if I can also look on parallelism. I would just probably use pthread at this layer instead of openmp to limit interaction with extra components. But I can try first to see what is the status.

I have someone making me feedback to see if MALT can optionally run on MacOSX also and for sure we will have problem on that part of the code so I will anyway have interest to re-look that part end of the month.

Just question to understand, I looked myself on making symbol solving parallel when applying on a very large C++ code at CERN because it was taking long at the end of the run, the problem for you is similar I suppose ?

@svalat svalat added the enhancement New feature or request label May 3, 2024
@svalat svalat added this to the v1.2.3 milestone May 3, 2024
@aulwes
Copy link
Author

aulwes commented May 3, 2024

Thank you for checking! What my colleague found is that the slowdown we're seeing is from using the nm tool to get the global variables, and not from addr2line. For one of the executable we're profiling, this nm step took over 20 minutes. What we discussed is whether this nm step could be done separately and cached ahead of time until malt needed it. Or, could we use objdump instead?

@svalat
Copy link
Member

svalat commented May 3, 2024

Hum interesting to know.
By looking I probably can use directly readelf if there is the debug symbols (at least for the part which has). Which also adds something missing in MALT : the source origin of the global variable.

https://stackoverflow.com/questions/11003376/extract-global-variables-from-a-out-file/11056685#11056685

Can you measure on you problematic case how much cost readelf -s and readelf -w and nm --print-size -l -n -P --no-demangle on the various libs & the executable assembled ?

I suppose it comes from one bug lib or the exec itself ? Or parallelism is a solution to reduce a lot because multiple files are concerned ?

What you propose looks interesting, for the globs we can also offer options to either :

  1. Not track global vars (this is required once in first study in principle then most of the time we don't want to look at it anymore)
  2. As you propose use a cache at least for the fixed libs.... which are not recompiled base on MD5SUM of the object. But but the executable itself if the cost comes from it I think this is harder to decide even if we can offer an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants