ThreadSanitizer
Time for a new tool - tsan
http://code.google.com/p/data-race-test/wiki/ThreadSanitizer
See this pasty http://pastebin.com/gKRDTkh3 how I successfully analyzed a hairy race in parrot's new threads branch. This problem was a big blocker for parrot developers, as described in http://lists.parrot.org/pipermail/parrot-dev/2012-August/007078.html.
To catch common memory errors you simply have to use the new clang -faddress-sanitizer
, along these instructions in perlhacktips.pod or use valgrind.
But to detect race conditions tsan is another gold-mine from google's moscow lab. This is the old valgrind based version:
cd ~/bin;
wget http://build.chromium.org/p/client.tsan/binaries/tsan-r4356-amd64-linux-self-contained.sh;
chmod +x tsan-r4356-amd64-linux-self-contained.sh;
cd -
*runloop_id_counter race with threads, t/pmc/task.t * https://github.com/parrot/parrot/issues/808
tsan-r4356-amd64-linux-self-contained.sh ./parrot t/pmc/task.t
==2162== ThreadSanitizer, a data race detector
==2162== Copyright (C) 2008-2010, and GNU GPL'd, by Google Inc.
==2162== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info
==2162== Command: ./parrot t/pmc/task.t
==2162==
==2162== ThreadSanitizerValgrind r4356: hybrid=no
==2162== INFO: Allocating 256Mb (32 * 8M) for Segments.
==2162== INFO: Will allocate up to 640Mb for 'previous' stack traces.
Valgrind: ignoring NaCl's mmap(84G)
Valgrind: ignoring NaCl's mmap(84G)
Valgrind: ignoring NaCl's mmap(84G)
1..6
ok 1 - initialized
ok 2 task1 ran
ok 3 task2 ran
ok 4 sub1 ran
==2162== INFO: T4 has been created by T0. Use --announce-threads to see the creation stack.
==2162== INFO: T5 has been created by T0. Use --announce-threads to see the creation stack.
==2162== WARNING: Possible data race during write of size 4 at 0x6373A60: {{{
==2162== T5 (L{}):
==2162== #0 reset_runloop_id_counter /usr/src/parrot/threads/src/call/ops.c:155
==2162== #1 Parrot_thread_outer_runloop /usr/src/parrot/threads/src/thread.c:317
==2162== #2 __asan::AsanThread::ThreadStart /home/rurban/Perl/parrot/threads/parrot
==2162== Concurrent write(s) happened at (OR AFTER) these points:
==2162== T4 (L{}):
==2162== #0 new_runloop_jump_point /usr/src/parrot/threads/src/call/ops.c:191
==2162== #1 runops /usr/src/parrot/threads/src/call/ops.c:88
==2162== #2 Parrot_pcc_invoke_from_sig_object /usr/src/parrot/threads/src/call/pcc.c:338
==2162== #3 Parrot_ext_call /usr/src/parrot/threads/src/extend.c:158
==2162== #4 Parrot_Task_invoke /usr/src/parrot/threads/src/pmc/task.c:168
==2162== #5 Parrot_pcc_invoke_from_sig_object /usr/src/parrot/threads/src/call/pcc.c:330
==2162== #6 Parrot_ext_call /usr/src/parrot/threads/src/extend.c:158
==2162== #7 Parrot_cx_next_task /usr/src/parrot/threads/src/scheduler.c:222
==2162== #8 Parrot_thread_outer_runloop /usr/src/parrot/threads/src/thread.c:319
==2162== #9 __asan::AsanThread::ThreadStart /home/rurban/Perl/parrot/threads/parrot
==2162== Address 0x6373A60 is 0 bytes inside data symbol "runloop_id_counter"
==2162== Race verifier data: 0x539B6AC,0x539A8AE
==2162== }}}
This announces a possible data race when writing runloop_id_counter
("Possible data race during write of size 4 at 0x6373A60") in the threads T4 and T5 with the listed backtraces.
The logical error error was confirmed on irc by the author nine, Stefan Seifert (another fellow austrian):
rurban: I think for practical purposes only the new #810 is blocking threads now. Should be easy to fix. Is niner somewhere around?
rurban: Because there is a wrong assumption in his code and paper. threads.c:313 "there can be no active runloops at this point" - there is
rurban: I'm building now with -DRUNLOOP_TRACE
benabik: rurban++
nine: rurban: I am around
rurban: Hi, Do you understand the #810 runloopidcounter race with threads, t/pmc/task.t ?
nine: rurban: you are absolutely right. That comment is from the time when there was only green threads. The runloop id counter should be moved into the interp. There's absolutely no reason to share it between threads.
rurban: :) sigh
rurban: So this is the remaining blocker
rurban: nine: Can you fix this so we can merge threads into master?
nine: Well actually the nci.t hangs got me worried. I guess the timer stuff is racy on some platforms. Tried to reproduce it today but there's just no way to get my linux boxes to show any fault.
nine: rurban: on it
nine: I'll just get me a cup of coffee and start hacking
It cannot analyze the new sleep deadlock in t/pmc/nci_37.pasm
in the the threads branch, because parrot is looping endlessly in a while loop.
Here you really have to gdb into it and bt the two threads or use darwin's Activity Monitor to list the two threads waiting for each other.
See this bug t/pmc/nci.t tests are fragile and broken by design https://github.com/parrot/parrot/issues/808:
But for the common case to detect possible races within certain timing ranges tsan is very good. And without tsan it is also hard to reproduce the hang on normal HW at all because it only shows up on super slow HW, like a Mac Powerbook 4/ppc or a mips32, essentially a similated IRIX Indigo 2 with 200MHz.
http://code.google.com/p/thread-sanitizer/wiki/PopularDataRaces contains descriptions for the most popular data races:
- Simple race
- Thread-hostile reference counting
- Race on a complex object
- Notification
- Publishing objects without synchronization
- Initializing objects without synchronization
- Reader Lock during a write
- Race on bit field
- Double-checked locking
- Race during destruction
- Data race on vptr
- Race on free
- Race during exit
Note that currently TSan v2 is being reimplemented in LLVM, similar to ASan. This is compiler-based and is already in the LLVM trunk. See http://clang.llvm.org/docs/ThreadSanitizer.html and http://code.google.com/p/thread-sanitizer/.
It may be not mature enough yet and unfortunately won't help you if you have races in non-instrumented code (inline assembly, JIT code, system libraries), but it is way faster than the Valgrind-based TSan and might be better for large apps. For simple small testcases the Valgrind-based TSan is good enough, but if you need it on a non-valgrind, non-pin supported platform or with bigger apps go with LLVM trunk.
Note that there's a windows version for TSan v1, based on PIN.