Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memorypool implementation problem #21

Open
ghost opened this issue Sep 26, 2013 · 10 comments
Open

Possible memorypool implementation problem #21

ghost opened this issue Sep 26, 2013 · 10 comments
Assignees
Labels

Comments

@ghost
Copy link

ghost commented Sep 26, 2013

I am having problems running lz4mt, specifically when decompressing, it stalls at some point.

Using valgrind and 3 of its different tools (memcheck, helgrind and DRD) all of them stop when decompressing with errors:

-memcheck: complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd.

-helgrind: complains about possible data races when reading or writing by a given thread

-DRD: same as helgrind but different description, complains about conflicting load/store by a given thread

These stalls are definitely happening with lz4mtdecompress in multithread mode. With the single thread mode I am not sure yet.
Common line on the output by the 3 tools: ==procID== by 0xADDRESS lz4mtDecompress::{lambda(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)#1}::operator()(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)

I've tested with both gcc 4.7.2 and gcc 4.8.1, locally on my laptop running archlinux with recent packages (linux 3.11.1), and remotely on a cluster running centos with older linux 2.6.18.

@t-mat
Copy link
Owner

t-mat commented Sep 26, 2013

Hi samuel, thanks for the report !
I've checked your problem.

Summary

  • I've reproduce your problem partly.

Questions

  • Could you describe just a bit more detail about "stall" ?
    • eg. Segfault, Silence with 0% CPU usage, Eat all CPU cycles, etc.
  • What kind of data did you use ?
  • Could you reproduce your problem with enwik8 ?
  • Could you show me your full valgrind log ?

Result

  • I've reproduced
    • Possible data race error with valgrind --tool=helgrind
    • Conflict with valgrind --tool=drd
  • I could not reproduce
    • Stall / stop when decompressing
    • Invalid reads/writes error with valgrind --tool=memcheck

Here is a full result.

Todo

  • Investigate about valgrind's errors.
  • Reproduce samuel's problem.

@ghost ghost assigned t-mat Sep 26, 2013
@ghost
Copy link
Author

ghost commented Sep 26, 2013

Thanks for your fast reply Mr. Takayuki :)

Sorry for my incomplete report, I will try to get you all the information you asked. But for now I only have time for some answers:

  • Stall: Silence with 0%CPU usage
  • I am using binary and text data
  • I can with a certain condition I forgot to mention, output is to null
    I think this will be enough to reproduce it: for i in {1..100}; do ./lz4mt481_12Sep_omp -dy --lz4mt-thread=0 enwik8.lz4 null; done wait a little bit and it will eventually happen.

@t-mat
Copy link
Owner

t-mat commented Sep 27, 2013

output is to null

It seems that null is key to this problem.
I could not reproduce "stall", but always got std::future_error by the following command:

$ ./lz4mt -d -y enwik8.linux.lz4.c0 null
terminate called after throwing an instance of 'std::future_error'
  what():  No associated state
Aborted (core dumped)

Sorry for my incomplete report,

No problem. It's a good report. This issue list is not for QA/Debug team, so smaller report is good start point 😄

Todo

  • Investigate null output problem.
  • Investigate about valgrind's errors.
  • Reproduce @samalm321's "stall" problem.

@ghost
Copy link
Author

ghost commented Sep 27, 2013

Here are 4 logs from valgrind, for enwik8 and a binary dataset I have named msg_bt.bin, for both helgrind and drd tools:

Build environment:

$ uname -r
2.6.18-128.1.14.el5`

 $ gcc -v       
Using built-in specs.
COLLECT_GCC=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2/gcc
COLLECT_LTO_WRAPPER=/home/cpd18777/gentoo_prefix/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /home/cpd18777/gentoo_prefix/var/tmp/portage/sys-devel/gcc-4.7.2-r1/work/gcc-4.7.2/configure --prefix=/home/cpd18777/gentoo_prefix/usr --bindir=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2 --includedir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include --datadir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2 --mandir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/man --infodir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/info --with-gxx-include-dir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --enable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --disable-multilib --with-multilib-list=m64 --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/python --enable-checking=release --disable-libgcj --enable-libstdcxx-time --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --with-local-prefix=/home/cpd18777/gentoo_prefix/usr --enable-__cxa_atexit --enable-clocale=gnu --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.7.2-r1 p1.5, pie-0.5.5'
Thread model: posix
gcc version 4.7.2 (Gentoo 4.7.2-r1 p1.5, pie-0.5.5) 

Run environment:

$ uname -r
2.6.32-279.22.1.el6.x86_64

$ valgrind --version
valgrind-3.8.1 

Using the most recent version of lz4mt and compiled with -O0 -g so valgrind can output more info.

The stall problem is getting me puzzled, I can't reproduce it always. I think it only happens when lz4mt is executed many times quickly (like in that for loop example) and each execution takes a very small amount of time. enwik8 tends to produce almost always that std::future_error, but for example the other dataset msg_bt.bin the error almost never happens and the execution stall with 0%cpu usage after some iterations.

@t-mat
Copy link
Owner

t-mat commented Sep 28, 2013

Thanks for the logs. I'm checking your report.

2a8ed67, I've resolved std::future_error caused by null output.

fb61bf3, I've resolved valgrind --tool=memcheck's "possibly lost" warning

  • This is a false positive. An instance of Opt is still exist when exit() is called.
  • To prevent this warning, I've splitted out real function and main() and exit()s.

Todo

  • Investigate null output problem.
  • Investigate valgrind --tool=helgrind.
  • Investigate valgrind --tool=drd.
  • Reproduce valgrind --tool=memcheck's warning
    • complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd
  • Reproduce "stall" problem.

@t-mat
Copy link
Owner

t-mat commented Sep 28, 2013

MEMO TO ME

The GNU C++ Library
3. Using - Debugging Support - Data Race Hunting
http://gcc.gnu.org/onlinedocs/libstdc++/manual/debug.html#debug.races

c++ - std::thread problems - Stack Overflow
http://stackoverflow.com/questions/10618142/stdthread-problems

@t-mat
Copy link
Owner

t-mat commented Dec 7, 2013

MEMO

Bug 327881 - False Positive Warning on std::atomic_bool ( helgrind @ valgrind 3.9.0 )
https://bugs.kde.org/show_bug.cgi?id=327881

valgrind-variant
https://code.google.com/p/valgrind-variant/source/browse/trunk/valgrind/drd/tests/std_thread.cpp?spec=svn129&r=129

// Test whether no race conditions are reported on std::thread. Note: since
// the implementation of std::thread uses the shared pointer implementation,
// that implementation has to be annotated in order to avoid false positives.

I still did not investigate this issue, but I believe there are real problems and false positives.
It seems like Valgrind (3.8.1 and 3.9.0) has some problem with std::atomic_* and std::shared_ptr.

@t-mat
Copy link
Owner

t-mat commented Mar 26, 2014

MEMO

GCC Bugzilla - Bug 51504
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51504

See comment #2 and #3.

Current state of drd and helgrind support for std::thread
http://stackoverflow.com/q/8393777/2132223

@t-mat
Copy link
Owner

t-mat commented Mar 30, 2014

For me (@t-mat) - investigate about std::condition_variable which has a possibility to cause 'stall problem'.

progschj / ThreadPool - Deadlock spotted! #11

using "condition_variable_any" seems to fix the problem, so I think the real problem is inside the "condition_variable" implementation.

lot of bugs still unresolved for condition_variable.: this one in particular seems the same (apparently they forgot to fix it):
https://svn.boost.org/trac/boost/ticket/4978

@t-mat
Copy link
Owner

t-mat commented Apr 23, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant