Red Hat Bugzilla – Bug 308231
Signaltest hangs in RHEL5 and RHEL5.1
Last modified: 2007-12-14 16:30:24 EST
I have been running some tests using signaltest and when I tried to run the
simplest one of the tests in RHEL5, the test halted right after the second
iteration. It works fine in RHEL-RT.
The test I was running is:
for i in `seq $REP`
(time signaltest -l 10000) 2>&1 | tee /tmp/lalala
echo -e "\n--[ step $i\n" >> $ARQ
cat /tmp/lalala >> $ARQ
Signaltest is a latency test tool for RHEL-RT. The RPM package for signaltest
can be found at:
In a x86_64 RHEL5 box with system idle, after the second iteration signaltest
hangs in stext (wchan). Clark did the same test in a virtual machine (xen) and
got the very same result. I repeated the test in a x86 machine and after few
iterations (6-10) signaltest hangs.
[lclaudio@lab sandbox]$ ps -Ao pid,psr,rtprio,stat,vsz,rss,cmd,wchan
PID PSR RTPRIO STAT VSZ RSS CMD WCHAN
2764 0 - Rl+ 28472 524 signaltest -l 10000 stext
2765 0 - S+ 67408 516 tee /tmp/lalala -
The test was repeated in RHEL5.1 and the result was the same.
This is just a 'heads-up' for I am not sure of the real impact of this bug. I
was not sure whether it was a kernel bug or not. Righ now, it seems to be a
I have also tested other kernel:
- kernel-2.6.18-1.2869.fc6.x86_64: In this test signaltest hangs in stext (wchan)
- kernel-2.6.19-1.2920.fc7.x86_64: In this one signaltest hangs with "-" in wchan.
- 2.6.21-31.el5rt, with passed sucessfully the test.
- 2.6.21-34.el5rt passed.
- 2.6.21-34.el5rtvanilla passed. kernel-rt-vanilla is the base kernel used for
rt without the rt patch. It is a vanilla kernel + some fixes not rt related.
- 2.6.21-38.el5rt passed.
- 184.108.40.206-rt9 passed.
- 2.6.23-rc4-rt1 FAILED. In this case signaltest hangs in "-", but in the first
run of the test. To better illustrate that:
[lclaudio@lab ~]$ ps -Ao pid,psr,rtprio,stat,vsz,rss,cmd,wchan | grep sig
2775 0 - Rl+ 24392 532 signaltest -l 10000 -
Another interesting data is that after hitting the bug, the next attempt to run
signaltest hangs right after it starts its threads.
I forgot mentioning the results for 2.6.18 vanilla:
[lclaudio@lab ~]$ uname -a
Linux lab.casa 2.6.18 #1 SMP PREEMPT Thu Sep 20 09:39:15 BRT 2007 x86_64 x86_64
[lclaudio@lab ~]$ ps -Ao pid,psr,rtprio,stat,vsz,rss,cmd,wchan
PID PSR RTPRIO STAT VSZ RSS CMD WCHAN
2594 1 - S+ 67412 516 tee /tmp/lalala write_chan
2593 0 - S+ 24384 620 signaltest -l 10000 pipe_wait
The result in the last post may have been a result of some transient or some
mistake I made. 2.6.18 vanilla has just successfully passed 20 repetitions of
the test (that runs signaltest 20 times).
So, 2.6.18 vanilla seems to be ok.
Off hand the test looks vaguely OK, except perhaps for the use of system in a
pthread (not relevant to the test case, which does not use -b).
I'm able to reproduce the originally reported x86_64 behavior on a single-socket
dual-core opteron, running a 32-bit load w/kernel-2.6.18-52.el5.i686. Now to
start gathering data from other kernels...
A vanilla 2.6.18 also makes it through a 20-loop test here, at least
occasionally. One boot, it ran to completion the first time, then hung just like
the original x86_64 behavior. After forcibly killing the test off and waiting a
bit, another run made it through 5 loops before dying. So it appears to be
slightly less problematic with a vanilla kernel, but still problematic.
Continuing to poke...
2.6.23 final hangs identically to 2.6.23-rc4-rt1 (note that in this case, the
signaltest process is pegging one of my cpu cores), which would suggest this
isn't an RT-specific bug. Checking out non-rt 2.6.21 and 2.6.22 in a sec...
220.127.116.11 - same results as 2.6.23 -- signaltest hangs on first attempt, pegging cpu
18.104.22.168 - similar results as RHEL5 kernel -- gets through a few loops, then hangs
2.6.21-41.el5rt - works just fine as expected
Seems there are actually two slightly different bugs in the mix here. With
2.6.22 and later, signaltest hangs the first time you try to run it ('signaltest
-l 10000' by hand hangs every time), while with 2.6.21 and earlier, signaltest
generally always runs successfully at least one loop before locking up, and
almost always works when run by hand (run it, up arrow, run again, rinse and
Not quite sure where to go next with this. I think maybe there's a regression to
track down for upstream's sake, then a fix from the rt patches needed to make
things work on the RHEL5 kernel (well, and on the upstream kernel).
Some more info on the process hang:
[root@void ~]# uname -a
Linux void.casa 2.6.21-1.3194.fc7 #1 SMP Wed May 23 22:47:07 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux
[root@void ~]# ps -Amo pid,tid,stat,blocked,caught,wchan:18,cmd
PID TID STAT BLOCKED CAUGHT WCHAN CMD
3433 - - - - - su -
- 3433 S fffffffe7ffb9eff 0000000000004000 wait -
3436 - - - - - -bash
- 3436 S 0000000000010000 000000004b813efb wait -
4633 - - - - - signaltest
- 4633 Rl+ 0000000000000200 0000000180004002 - -
- 4634 Sl+ 0000000000000000 0000000180004002 - -
- 4635 Sl+ 0000000000000000 0000000180004002 - -
Relevant pieces of sysrq+t:
su S 000000326e143f91 0 3433 3291 (NOTLB)
ffff81005472ded8 0000000000000082 0000000000000000 ffff81005457de48
00000034def4d110 0000000000000006 ffff81005b9d8080 ffff810002f49100
0000000000003066 ffff81005b9d8258 00000001562d5620 ffff810000030002
bash S 00000000fffffff6 0 3436 3433 (NOTLB)
ffff810054577ed8 0000000000000086 ffff810054577e88 ffffffff80293110
ffff8100101e6408 0000000000000003 ffff810054a900c0 ffff810070b7f040
0000000000011d40 ffff810054a90298 0000000000000292 0000000000000001
signaltest S 7fffffffffffffff 0 4634 3436 (NOTLB)
ffff81000ce85e38 0000000000000086 0000000200000000 ffff8100135af698
ffff810000000000 0000000000000002 ffff81005a934860 ffff81005c66b860
000000000000387a ffff81005a934a38 000000010ce85de8 000000008028066c
signaltest S 7fffffffffffffff 0 4635 3436 (NOTLB)
ffff810007465e38 0000000000000086 0000000000000001 0000000107465df8
0000000000000001 0000000000000002 ffff81005c66b860 ffff810070b7f040
00000000000013b4 ffff81005c66ba38 0000000100000001 0000000000000000
Minor correction... Vanilla 2.6.22 does not hang on the first iteration, but our
22.214.171.124 fedora kernel does, as does 2.6.23, which suggests possibly something
in cfs being to blame for that particular problem. Still digging.
Finally finished bisection, but I seem to suck at it or something, because this
is the second time in recent history I've done this and gotten results that make
no sense... According to the bisection test, the guilty party is a commit that
only touches drivers/char/watchdog/alim1535_wdt.c...
Will dig more tomorrow. Ugh.
I confess I felt a bit confused when reading the patch... :)
Judging by my initial tests, and confirming by your test results, I believe this
issue is pre 2.6.18 and that only rhel-rt seem to be free of it.
Yeah, there's definitely a pre-2.6.18 issue, but also a post-2.6.22 issue, with
the complete lockup of signaltest on the first try, which is what I was hoping
to isolate w/the git bisection. Retracing my steps there... Still need to dig
into what in the rt patches makes this work after that though.
Okay, steps retraced. Latest bisection pass looks far saner, claiming:
095e56c7036fe97bc3ebcd80ed6e121be0847656 is first bad commit
Author: Ingo Molnar <email@example.com>
Date: Fri Aug 24 20:39:10 2007 +0200
sched: fix startup penalty calculation
fix task startup penalty miscalculation: sysctl_sched_granularity is
unsigned int and wait_runtime is long so we first have to convert it
to long before turning it negative ...
Signed-off-by: Ingo Molnar <firstname.lastname@example.org>
:040000 040000 60d5ed11a5729cd25ffed10c3a59f265b2d997dc
1dbfcf204b5f2ed97dae0f651207fc820bff6d76 M kernel
Full bisect log:
# bad: [bbf25010f1a6b761914430f5fca081ec8c7accd1] Linux 2.6.23
git-bisect bad bbf25010f1a6b761914430f5fca081ec8c7accd1
# bad: [b07d68b5ca4d55a16fab223d63d5fb36f89ff42f] Linux 2.6.23-rc4
git-bisect bad b07d68b5ca4d55a16fab223d63d5fb36f89ff42f
# good: [f695baf2df9e0413d3521661070103711545207a] Linux 2.6.23-rc1
git-bisect good f695baf2df9e0413d3521661070103711545207a
# good: [22da317629bad23816f3fa26d657fb46e7af21c7] Merge branch 'merge' of
git-bisect good 22da317629bad23816f3fa26d657fb46e7af21c7
# good: [505683675cf0ed903765f160ad633f77ab90a9d4] Merge branch 'for-linus' of
git-bisect good 505683675cf0ed903765f160ad633f77ab90a9d4
# good: [2b56fec64faae9fc5c3e61bbfb851b7985292cd5] Merge
git-bisect good 2b56fec64faae9fc5c3e61bbfb851b7985292cd5
# bad: [6ae26fa468533c86aaa6936fd366142fcf01386f] Merge branch 'agp-patches' of
git-bisect bad 6ae26fa468533c86aaa6936fd366142fcf01386f
# good: [519ef1af47bb7379c4c06aeba2d78073df92c151] Pull thermal into release branch
git-bisect good 519ef1af47bb7379c4c06aeba2d78073df92c151
# bad: [0b887d037bf4b76eec1c960e5feecd6a5a806971] Merge branch 'release' of
git-bisect bad 0b887d037bf4b76eec1c960e5feecd6a5a806971
# good: [25c87f7f2a4fc3e50a7912b1c78405d454d1c4d9] Pull events into release branch
git-bisect good 25c87f7f2a4fc3e50a7912b1c78405d454d1c4d9
# bad: [095e56c7036fe97bc3ebcd80ed6e121be0847656] sched: fix startup penalty
git-bisect bad 095e56c7036fe97bc3ebcd80ed6e121be0847656
# good: [98fbc798533339be802c6dcd48c2293c712e87db] sched: optimize
task_tick_rt() a bit
git-bisect good 98fbc798533339be802c6dcd48c2293c712e87db
# good: [a6f2994042cc2db9e507dc702ed0b5e2cc5890fe] sched: simplify bonus
git-bisect good a6f2994042cc2db9e507dc702ed0b5e2cc5890fe
# good: [ea0aa3b23a193d1fc5c982286edecd071af67d94] sched: simplify bonus
git-bisect good ea0aa3b23a193d1fc5c982286edecd071af67d94
Ingo, looks like cfs doesn't much like this signaltest app... I don't yet see
why the referenced changeset would cause a hang, and I'm planning to throw
together another test kernel (126.96.36.199) w/that changeset backed out to verify
that it does actually make a difference (I don't always trust my bisection
skills), but any insight you might have would be greatly appreciated.
Okay, so not a 100% straight-forward back-out, as there's been a decent bit of
change in that code...
Created attachment 230241 [details]
small patch for debug purposes
Jarod, with this patch I was able to identify a condition that was keeping the
main thread from starting its work. For some strange reason, maybe scheduling,
one shared variable was not being incremented as it should. Please, apply this
patch to signaltest.c and run it.
Created attachment 230251 [details]
Jarod, using this workaround I was able to run signaltest in 2.6.23-rc4-rt1.
Without this patch it hangs right in the first try.
I got to this point after Jeff told me that the main thread was in a busy loop.
Anyway I don't few comfortable to point this as an application issue or a
Aha! That workaround patch just made everything click. Or so I think. I'm now
inclined to chalk this up to a bug in signaltest. Here's why:
In the current code, threadstarted isn't initialized until *after*
pthread_create is called, which calls the signalthread() function, which among
other things, tries to increment threadstarted. It looks like sometimes we may
fall through and set threadstarted = 1 before its actually incremented in
signalthread(), as pthread_create is non-blocking, so things actually go as
expected in the subsequent while loop. However, if signalthread() finishes
quickly enough, we have already attempted to increment threadstarted, which we
then stomp back down to 1. When that happens, the subsequent while loop gets
stuck, because threadstarted != 2 is always true, so allstarted = 0 keeps
getting set, so we stay in the loop forever.
The work-around patch isn't actually a work-around at all, its the proper fix.
threadstarted shouldn't be getting incremented before its initialized. At least,
that's how I'm seeing it right now...
Forgot to mention that the earlier debug patch fully supports my theory.
Nice! In my tests with the debug patch I have only seen 0s and 1s... but I
second you on the app bug theory. As it always worked fine in rhel-rt, I took
that the application was OK.
I will apply that patch to signaltest. Thanks for all your work and effort!
Doh, yes. I already ordered on of the required brown paper bags. I apply this in
to the git tree.
Luis, can you send it by mail please with a comment and your SOB ?
On those grounds, closing NOTABUG. I'd reassign it to a signaltest component if
we had one, but I trust we'll get it taken care of just the same. :)