Possible memorypool implementation problem #21

ghost · 2013-09-26T10:47:44Z

I am having problems running lz4mt, specifically when decompressing, it stalls at some point.

Using valgrind and 3 of its different tools (memcheck, helgrind and DRD) all of them stop when decompressing with errors:

-memcheck: complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd.

-helgrind: complains about possible data races when reading or writing by a given thread

-DRD: same as helgrind but different description, complains about conflicting load/store by a given thread

These stalls are definitely happening with lz4mtdecompress in multithread mode. With the single thread mode I am not sure yet.
Common line on the output by the 3 tools: ==procID== by 0xADDRESS lz4mtDecompress::{lambda(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)#1}::operator()(int, Lz4Mt::MemPool::Buffer*, bool, unsigned int)

I've tested with both gcc 4.7.2 and gcc 4.8.1, locally on my laptop running archlinux with recent packages (linux 3.11.1), and remotely on a cluster running centos with older linux 2.6.18.

The text was updated successfully, but these errors were encountered:

t-mat · 2013-09-26T16:46:59Z

Hi samuel, thanks for the report !
I've checked your problem.

Summary

I've reproduce your problem partly.

Questions

Could you describe just a bit more detail about "stall" ?
- eg. Segfault, Silence with 0% CPU usage, Eat all CPU cycles, etc.
What kind of data did you use ?
Could you reproduce your problem with enwik8 ?
Could you show me your full valgrind log ?

Result

I've reproduced
- Possible data race error with valgrind --tool=helgrind
- Conflict with valgrind --tool=drd
I could not reproduce
- Stall / stop when decompressing
- Invalid reads/writes error with valgrind --tool=memcheck

Here is a full result.

Todo

Investigate about valgrind's errors.
Reproduce samuel's problem.

ghost · 2013-09-26T17:18:52Z

Thanks for your fast reply Mr. Takayuki :)

Sorry for my incomplete report, I will try to get you all the information you asked. But for now I only have time for some answers:

Stall: Silence with 0%CPU usage
I am using binary and text data
I can with a certain condition I forgot to mention, output is to null
I think this will be enough to reproduce it: for i in {1..100}; do ./lz4mt481_12Sep_omp -dy --lz4mt-thread=0 enwik8.lz4 null; done wait a little bit and it will eventually happen.

t-mat · 2013-09-27T02:50:39Z

output is to null

It seems that null is key to this problem.
I could not reproduce "stall", but always got std::future_error by the following command:

$ ./lz4mt -d -y enwik8.linux.lz4.c0 null
terminate called after throwing an instance of 'std::future_error'
  what():  No associated state
Aborted (core dumped)

Sorry for my incomplete report,

No problem. It's a good report. This issue list is not for QA/Debug team, so smaller report is good start point 😄

Todo

Investigate null output problem.
Investigate about valgrind's errors.
Reproduce @samalm321's "stall" problem.

ghost · 2013-09-27T18:55:21Z

Here are 4 logs from valgrind, for enwik8 and a binary dataset I have named msg_bt.bin, for both helgrind and drd tools:

http://pastebin.com/zQs5QZ0g helgrind_enwik8.log
http://pastebin.com/nhLpKPUB helgrind_msgbt.log
http://pastebin.com/4m0R8FQf drd_enwik8.log
http://pastebin.com/zYSyNApJ drd_msgbt.log

Build environment:

$ uname -r
2.6.18-128.1.14.el5`

 $ gcc -v       
Using built-in specs.
COLLECT_GCC=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2/gcc
COLLECT_LTO_WRAPPER=/home/cpd18777/gentoo_prefix/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.2/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /home/cpd18777/gentoo_prefix/var/tmp/portage/sys-devel/gcc-4.7.2-r1/work/gcc-4.7.2/configure --prefix=/home/cpd18777/gentoo_prefix/usr --bindir=/home/cpd18777/gentoo_prefix/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.2 --includedir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include --datadir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2 --mandir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/man --infodir=/home/cpd18777/gentoo_prefix/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/info --with-gxx-include-dir=/home/cpd18777/gentoo_prefix/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --enable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --disable-multilib --with-multilib-list=m64 --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.7.2/python --enable-checking=release --disable-libgcj --enable-libstdcxx-time --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --with-local-prefix=/home/cpd18777/gentoo_prefix/usr --enable-__cxa_atexit --enable-clocale=gnu --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.7.2-r1 p1.5, pie-0.5.5'
Thread model: posix
gcc version 4.7.2 (Gentoo 4.7.2-r1 p1.5, pie-0.5.5)

Run environment:

$ uname -r
2.6.32-279.22.1.el6.x86_64

$ valgrind --version
valgrind-3.8.1

Using the most recent version of lz4mt and compiled with -O0 -g so valgrind can output more info.

The stall problem is getting me puzzled, I can't reproduce it always. I think it only happens when lz4mt is executed many times quickly (like in that for loop example) and each execution takes a very small amount of time. enwik8 tends to produce almost always that std::future_error, but for example the other dataset msg_bt.bin the error almost never happens and the execution stall with 0%cpu usage after some iterations.

t-mat · 2013-09-28T15:49:24Z

Thanks for the logs. I'm checking your report.

2a8ed67, I've resolved std::future_error caused by null output.

fb61bf3, I've resolved valgrind --tool=memcheck's "possibly lost" warning

This is a false positive. An instance of Opt is still exist when exit() is called.
To prevent this warning, I've splitted out real function and main() and exit()s.

Todo

Investigate null output problem.
Investigate valgrind --tool=helgrind.
Investigate valgrind --tool=drd.
Reproduce valgrind --tool=memcheck's warning
- complains about invalid reads and writes and ends with a "Syscall param futex(futex) points to unaddressable byte(s)". All of these errors say that some address is inside a block of size 240 free'd
Reproduce "stall" problem.

t-mat · 2013-09-28T18:37:00Z

MEMO TO ME

The GNU C++ Library
3. Using - Debugging Support - Data Race Hunting
http://gcc.gnu.org/onlinedocs/libstdc++/manual/debug.html#debug.races

c++ - std::thread problems - Stack Overflow
http://stackoverflow.com/questions/10618142/stdthread-problems

t-mat · 2013-12-07T04:33:06Z

MEMO

Bug 327881 - False Positive Warning on std::atomic_bool ( helgrind @ valgrind 3.9.0 )
https://bugs.kde.org/show_bug.cgi?id=327881

valgrind-variant
https://code.google.com/p/valgrind-variant/source/browse/trunk/valgrind/drd/tests/std_thread.cpp?spec=svn129&r=129

// Test whether no race conditions are reported on std::thread. Note: since
// the implementation of std::thread uses the shared pointer implementation,
// that implementation has to be annotated in order to avoid false positives.

I still did not investigate this issue, but I believe there are real problems and false positives.
It seems like Valgrind (3.8.1 and 3.9.0) has some problem with std::atomic_* and std::shared_ptr.

t-mat · 2014-03-26T00:39:55Z

MEMO

GCC Bugzilla - Bug 51504
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51504

See comment #2 and #3.

Current state of drd and helgrind support for std::thread
http://stackoverflow.com/q/8393777/2132223

t-mat · 2014-03-30T10:56:01Z

For me (@t-mat) - investigate about std::condition_variable which has a possibility to cause 'stall problem'.

progschj / ThreadPool - Deadlock spotted! #11

using "condition_variable_any" seems to fix the problem, so I think the real problem is inside the "condition_variable" implementation.

lot of bugs still unresolved for condition_variable.: this one in particular seems the same (apparently they forgot to fix it):
https://svn.boost.org/trac/boost/ticket/4978

t-mat · 2014-04-23T14:46:43Z

Testing Chromium: ThreadSanitizer v2, a next-gen data race detector

ghost assigned t-mat Sep 26, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memorypool implementation problem #21

Possible memorypool implementation problem #21

ghost commented Sep 26, 2013

t-mat commented Sep 26, 2013

ghost commented Sep 26, 2013

t-mat commented Sep 27, 2013

ghost commented Sep 27, 2013

t-mat commented Sep 28, 2013

t-mat commented Sep 28, 2013

t-mat commented Dec 7, 2013

t-mat commented Mar 26, 2014

t-mat commented Mar 30, 2014

t-mat commented Apr 23, 2014

Possible memorypool implementation problem #21

Possible memorypool implementation problem #21

Comments

ghost commented Sep 26, 2013

t-mat commented Sep 26, 2013

Summary

Questions

Result

Todo

ghost commented Sep 26, 2013

t-mat commented Sep 27, 2013

Todo

ghost commented Sep 27, 2013

t-mat commented Sep 28, 2013

Todo

t-mat commented Sep 28, 2013

t-mat commented Dec 7, 2013

t-mat commented Mar 26, 2014

t-mat commented Mar 30, 2014

t-mat commented Apr 23, 2014