Bug 457909 - sched_football fails with MRG -69 kernel on LS21 machine in fairly long run
sched_football fails with MRG -69 kernel on LS21 machine in fairly long run
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.0
x86_64 All
medium Severity medium
: ---
: ---
Assigned To: Red Hat Real Time Maintenance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-05 07:10 EDT by IBM Bug Proxy
Modified: 2009-10-06 17:10 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-10-06 17:10:22 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
failure log (90.40 KB, text/plain)
2008-08-05 07:10 EDT, IBM Bug Proxy
no flags Details
Fix synchronization in the test (1.46 KB, text/plain)
2008-08-05 07:10 EDT, IBM Bug Proxy
no flags Details
screenshot: start of problem (148.52 KB, image/png)
2009-06-22 15:03 EDT, IBM Bug Proxy
no flags Details
screenshot: zoomed out view of failure (147.15 KB, image/png)
2009-06-22 15:03 EDT, IBM Bug Proxy
no flags Details
Startup failure? (166.29 KB, image/png)
2009-06-23 11:22 EDT, IBM Bug Proxy
no flags Details
PATCH: add condvar to put game start control in the hands of the ref (2.10 KB, text/plain)
2009-06-23 17:31 EDT, IBM Bug Proxy
no flags Details
PATCH: atomic startup mechanism (4.52 KB, text/plain)
2009-06-23 21:31 EDT, IBM Bug Proxy
no flags Details

  None (edit)
Description IBM Bug Proxy 2008-08-05 07:10:25 EDT
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495


Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position. 
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position. 
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Comment 1 IBM Bug Proxy 2008-08-05 07:10:29 EDT
Created attachment 313440 [details]
failure log
Comment 2 IBM Bug Proxy 2008-08-05 07:10:32 EDT
Created attachment 313441 [details]
Fix synchronization in the test
Comment 3 IBM Bug Proxy 2008-08-05 07:40:51 EDT
Sripathi, probably this bug could be a FOCUS bug as it is currently being worked
upon...
Comment 4 IBM Bug Proxy 2008-08-08 02:40:47 EDT
Ok, had left instrumented test to run for 1.5 days...however, find that only
about 300 iterations completed !!!! Got to look at other ways to narrow down the
issue here...
Comment 5 IBM Bug Proxy 2008-08-13 06:20:36 EDT
This is what am currently trying out - since the failure seems to be happening
only about 1 or 2 times in 12k+ runs, its difficult to reproduce this issue for
debugging purpose. With the modified testcase (including pthread_barriers),
added hooks to trigger kdump the moment the ball position is detected to go up,
ie, the offense thread runs - hoping that the dump might provide some
information. Have infinite runs on-going at the moment.
Comment 6 IBM Bug Proxy 2008-08-13 09:10:47 EDT
So the machine seems to be hung..not pingable. kdump did not trigger as well. I
had verified that kdump is setup fine on the system before starting the
sched_football runs. Unable to explain this...

This being the case, got to look at ways in which I could capture a snapshot of
all runqueues at the time the ball position increases. Any other thoughts anyone
has on this ?
Comment 7 IBM Bug Proxy 2008-08-14 07:00:38 EDT
Not sure if this will yield anything, but now running test runs and logging ps
-eLo output whenever the_ball count incremented.
Comment 8 IBM Bug Proxy 2008-08-27 14:11:14 EDT
(In reply to comment #25)
> Not sure if this will yield anything, but now running test runs and logging ps
> -eLo output whenever the_ball count incremented.

Ok, so the above didnt help. Got to think of something to try next. /me
scratches her head...
Comment 9 IBM Bug Proxy 2009-06-03 08:10:46 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.
Comment 10 IBM Bug Proxy 2009-06-03 11:31:26 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 11 IBM Bug Proxy 2009-06-08 07:21:02 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 12 IBM Bug Proxy 2009-06-08 12:00:52 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 13 IBM Bug Proxy 2009-06-09 10:10:53 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 14 IBM Bug Proxy 2009-06-09 19:10:46 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 15 IBM Bug Proxy 2009-06-22 00:00:50 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 16 IBM Bug Proxy 2009-06-22 01:20:43 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 17 IBM Bug Proxy 2009-06-22 04:21:14 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 18 IBM Bug Proxy 2009-06-22 04:51:02 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
Comment 19 IBM Bug Proxy 2009-06-22 15:03:05 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.
Comment 20 IBM Bug Proxy 2009-06-22 15:03:13 EDT
Created attachment 348969 [details]
screenshot: start of problem


------- Comment (attachment only) From dvhltc@us.ibm.com 2009-06-22 12:09 EDT-------
Comment 21 IBM Bug Proxy 2009-06-22 15:03:22 EDT
Created attachment 348970 [details]
screenshot: zoomed out view of failure


------- Comment (attachment only) From dvhltc@us.ibm.com 2009-06-22 12:10 EDT-------
Comment 22 IBM Bug Proxy 2009-06-23 08:52:07 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).
Comment 23 IBM Bug Proxy 2009-06-23 11:11:19 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?
Comment 24 IBM Bug Proxy 2009-06-23 11:21:53 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?
Comment 25 IBM Bug Proxy 2009-06-23 11:22:01 EDT
Created attachment 349102 [details]
Startup failure?


------- Comment on attachment From dvhltc@us.ibm.com 2009-06-23 11:19 EDT-------


This screenshot of Gowri's sched_football_329.vcd trace illustrates what appears to me to be a startup scheduling failure.  The game starts when the referee starts it's 1 second sleep loop, which appears to be here in this screenshot.  Once the referee free's it's CPU, we'd expect the 4th defense thread to get the CPU, but instead an offense thread gets it for 6us until the last defense thread preempts it.  I'm thinking this is all occuring on CPU 0 (is that what 00 means in the display?) - but I can't explain what LL means in the referee bar...  no explanation yet.
Comment 26 IBM Bug Proxy 2009-06-23 11:52:22 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".
Comment 27 IBM Bug Proxy 2009-06-23 12:02:39 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:58 EDT-------
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.
Comment 28 IBM Bug Proxy 2009-06-23 12:21:37 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:58 EDT-------
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:14 EDT-------
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.
Comment 29 IBM Bug Proxy 2009-06-23 12:32:07 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:58 EDT-------
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:14 EDT-------
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:26 EDT-------
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.
Comment 30 IBM Bug Proxy 2009-06-23 17:31:39 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:58 EDT-------
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:14 EDT-------
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:26 EDT-------
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.
Comment 31 IBM Bug Proxy 2009-06-23 17:31:47 EDT
Created attachment 349154 [details]
PATCH: add condvar to put game start control in the hands of the ref


------- Comment on attachment From dvhltc@us.ibm.com 2009-06-23 17:21 EDT-------


Gowri, can you give this patch a try on your instrumented kernel and see if you can still reproduce the bug?  If this succeeds, then I think attachment 38143 [details] might be suspect.  See the patch for a description of the possible problem.
Comment 32 IBM Bug Proxy 2009-06-23 21:31:43 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:58 EDT-------
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:14 EDT-------
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:26 EDT-------
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.
Comment 33 IBM Bug Proxy 2009-06-23 21:31:52 EDT
Created attachment 349178 [details]
PATCH: atomic startup mechanism


------- Comment on attachment From dvhltc@us.ibm.com 2009-06-23 21:03 EDT-------


Gowri, I think this fixes it.  Thanks to John S. for his recommendation for a simplified startup mechanism.  I'm testing on an HS21 (elm9m93).  Can you give this a shot and see if it fixes the problem for you?

Patch description:

The current barrier implementation results in the lowest priority thread
actually starting the game (they are the last to be scheduled to call
pthread_barrier_wait).  This thread likely gets a priority boost as it holds
the hb->lock for the futex associated with the barrier.  This might lead to it
running ahead of the defense threads. 

In fact, any sort of barrier or cond var implementation (short of a pi aware
cond broadcast, which is not yet readily available) will result in a thundering
herd situation when the FUTEX_WAKE_ALL syscall is issued, which can result in
a short run of one or more offense threads while all the threads get to the
RUNNABLE state.

This patch removes the complex starting mechansims and replaces them with a
simple atomic counter.  All player threads are started and once the
players_ready count reaches the total player count, the referee starts the game
by setting the ball position to zero.
Comment 34 IBM Bug Proxy 2009-06-24 06:41:33 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:58 EDT-------
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:14 EDT-------
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:26 EDT-------
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-24 06:30 EDT-------
Tested with instrumented kernel I used last time 2.6.24.7-117.el5rttrace
and 3000 runs succeeded with out any failure in LS21
Comment 35 IBM Bug Proxy 2009-06-24 14:52:05 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.



------- Comment From kirpraka@in.ibm.com 2009-06-03 08:09 EDT-------
Trying to recreate this bug with the latest MRG kernel.

------- Comment From kirpraka@in.ibm.com 2009-06-03 11:28 EDT-------
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,

------- Comment From dvhltc@us.ibm.com 2009-06-22 12:17 EDT-------
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-23 08:43 EDT-------
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).

------- Comment From dvhltc@us.ibm.com 2009-06-23 11:03 EDT-------
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:42 EDT-------
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".

------- Comment From will_schmidt@vnet.ibm.com 2009-06-23 11:58 EDT-------
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:14 EDT-------
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.

------- Comment From dvhltc@us.ibm.com 2009-06-23 12:26 EDT-------
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.

------- Comment From gowrishankar.m@in.ibm.com 2009-06-24 06:30 EDT-------
Tested with instrumented kernel I used last time 2.6.24.7-117.el5rttrace
and 3000 runs succeeded with out any failure in LS21

------- Comment From dvhltc@us.ibm.com 2009-06-24 14:44 EDT-------
Just completed 10495 successful runs on an HS21.  Marking bug as fixed by IBM and submitting atomic start patch to LTP.  Gowri, please ack there if you are happy with the fix.
Comment 36 IBM Bug Proxy 2009-07-15 02:01:49 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.


Trying to recreate this bug with the latest MRG kernel.
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.
Tested with instrumented kernel I used last time 2.6.24.7-117.el5rttrace
and 3000 runs succeeded with out any failure in LS21
Just completed 10495 successful runs on an HS21.  Marking bug as fixed by IBM and submitting atomic start patch to LTP.  Gowri, please ack there if you are happy with the fix.

------- Comment From sripathik@in.ibm.com 2009-07-15 01:59 EDT-------
Darren, was this patch accepted into LTP?
Comment 37 IBM Bug Proxy 2009-07-15 13:04:01 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.


Trying to recreate this bug with the latest MRG kernel.
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.
Tested with instrumented kernel I used last time 2.6.24.7-117.el5rttrace
and 3000 runs succeeded with out any failure in LS21
Just completed 10495 successful runs on an HS21.  Marking bug as fixed by IBM and submitting atomic start patch to LTP.  Gowri, please ack there if you are happy with the fix.

------- Comment From sripathik@in.ibm.com 2009-07-15 01:59 EDT-------
Darren, was this patch accepted into LTP?

------- Comment From dvhltc@us.ibm.com 2009-07-15 12:52 EDT-------
Yes, the cvs version of ltp contains the atomic startup mechanism patch.
Comment 38 IBM Bug Proxy 2009-07-16 03:03:17 EDT
Created attachment 313440 [details]
failure log

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.
Created an attachment (id=313440)
failure log
Created an attachment (id=313441)
Fix synchronization in the test
=Comment: #0=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:38 EDT
Problem description:

on running failrly large number of iterations of sched_football on MRG -69 kernel,
testcase failure is seen once.

The final ball position, which should be zero, is 495 in one iteration on LS21
machine.

On HS21 box I din't see this failure yet ( out of 15k iterations )

$uname -a
Linux elm3c28 2.6.24.7-69.el5rt #1 SMP PREEMPT RT Wed Jun 25 16:59:05 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux

Machine : LS21

how long does it (did it) take to reproduce it?

I ran a 6000+ iterations of sched_football and failure aoccured once.

Final ball position : 495

Is the system (not just the application) hung? No. System continues to be up and
running.
=Comment: #1=================================================
Sudhanshu Singh <sudhanshusingh@in.ibm.com> - 2008-07-16 03:46 EDT

failure log

=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-31 03:44 EDT
Running modified sched_football in a loop on llm54.
=Comment: #4=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 00:58 EDT
So I ran close to 11937 iterations of the testcase (before the job timed out).
Again, I hit one failure:

--- Running testcase sched_football  ---
Thu Jul 31 05:59:07 EDT 2008
Logging to
/test/ankita/tests/internal/func/ltp/ltp/testcases/realtime/logs/llm54-x86_64-2.6.24.7-74ibmrt2.5-2008-31-07-sched_football.log
jvmsim disabled
Running with: players_per_team=4 game_length=5
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (5 seconds)!
Game Over!
Final ball position: 13

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-01 01:56 EDT
First trying to rule out issues with the testcase, if any. Now running large
iterations with a modified patch.
=Comment: #7=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 2008-08-01 12:26 EDT
Ankita, took a look at the patch.  While I like the barries better than the
relying on spinning on defense_count, I don't see an opening for the offense
threads to move the ball after the referee thread resets the ball position.
Have you taken a look to see how far into the game the offense thread was able
to increment the ball position, and was it only one opening with 13 increments,
or is it several opening with varying increment loops?  It the increments
happened right at the beginning of the game, then perhaps I missed something, if
it happened much later then the barriers certainly won't make any difference.
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-04 00:31 EDT
(In reply to comment #7)
> Ankita, took a look at the patch.  While I like the barries better than the
> relying on spinning on defense_count, I don't see an opening for the offense
> threads to move the ball after the referee thread resets the ball position.
> Have you taken a look to see how far into the game the offense thread was able
> to increment the ball position, and was it only one opening with 13 increments,
> or is it several opening with varying increment loops?  It the increments
> happened right at the beginning of the game, then perhaps I missed something, if
> it happened much later then the barriers certainly won't make any difference.

Darren, I agree that barriers will not help if the increment happened later into
the game. So, one reason behind this exercise was to try and narrow down where
the issue is coming from. So with the above patch, I got failure 3 times with
ball position 13, 1, 1. So, clearly, the barriers are not helping. Got to now
look at system state when the offense threads were able to increment the ball
position.

=Comment: #10=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-08-05 01:51 EDT
I had kicked off more infinite runs of this test with some instrumentation..and
for some reason only about 250 iterations completed after which the job timed
out :-( And of these 250 iterations, I got no failures..got to start again.


Trying to recreate this bug with the latest MRG kernel.
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.
Tested with instrumented kernel I used last time 2.6.24.7-117.el5rttrace
and 3000 runs succeeded with out any failure in LS21
Just completed 10495 successful runs on an HS21.  Marking bug as fixed by IBM and submitting atomic start patch to LTP.  Gowri, please ack there if you are happy with the fix.

------- Comment From sripathik@in.ibm.com 2009-07-15 01:59 EDT-------
Darren, was this patch accepted into LTP?

------- Comment From dvhltc@us.ibm.com 2009-07-15 12:52 EDT-------
Yes, the cvs version of ltp contains the atomic startup mechanism patch.

------- Comment From sripathik@in.ibm.com 2009-07-16 02:51 EDT-------
Note to RH: As seen in the last few comments, we found that the cause of this problem was in the test case. We have got the fix accepted in LTP. We are closing this bug on our side.
Comment 39 IBM Bug Proxy 2009-09-14 08:01:34 EDT
Trying to recreate this bug with the latest MRG kernel.
I am currently running an infinite loop of sched_football on the MRG kernel 2.6.24.7-111.el5rt.
With 2166 iterations completed, I have observed 17 failures with the_ball value 1 in every case,
>> No, the system should always schedule any runnable higher prio RT task
>> irrespective of how long it has run. If it doesn't that is a bug.

Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to effectively de-prioritize the defense threads, allowing the offense threads to run momentarily?

> Well. In a small wrapper over sched_football to collect sched_switch
> trace, in every test failure I collect it, just after the test. I then  use
> single view tool to visualize the scheduling changes.
> http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

YES!  This is a perfect application of the tool.  Thanks for giving it a shot Gowri and sharing your results!

:-)  I'm pretty sure "SingleView" is part of the website logic, not the name of the tool.  (As it appears in other articles as well).  "sched_switch vcd visualization" is probably the most accurate.

I took a closer look at the vcd data and attached a couple PNG screenshots of the relevant areas so others could get a quick view of this thing works - especially those unlikely to install gtkwave *cough* managers *cough*.  Being preempted by the sirq thread is normal behavior throughout the run of the test (since the sirq is running at higher priority).  What is interesting is, as Gowri said, where offense 13253 is scheduled after the sirq which preempted 13258.  This not-scheduling-the-offence-threads behavior continues for a while, which makes me suspect the 95% rt limit.
(In reply to comment #43)
> >> No, the system should always schedule any runnable higher prio RT task
> >> irrespective of how long it has run. If it doesn't that is a bug.
>
> Hrm... are we hitting the 95% maximum utilization?  Causing the scheduler to
> effectively de-prioritize the defense threads, allowing the offense threads to
> run momentarily?
>

I could not see  sched_rt_period_us, sched_rt_runtime_us like files. So
it seems like 2.6.24 does not support this feature.

<snip>

> (since the sirq is running at higher priority).  What is interesting is, as
> Gowri said, where offense 13253 is scheduled after the sirq which preempted
> 13258.  This not-scheduling-the-offence-threads behavior continues for a while,
> which makes me suspect the 95% rt limit.
>

On more closer look, I could see defense #13258 on cpu #1 vanishing away in few us.
I could also see another defense thread on cpu #3 doing the same. More over, in one
another failure (below) I observed the same pair of threads (1 and 3 vanishing away)
just by the time referee comes up. Are they being killed ? any unhandled signals ??

I think that may be the reason a offense thread gets the chance now to run after referee
as other 3 defense threads are still busy (one of them goes away in few us as I said).
See sched_football.c, referee():

/* Watch the game */
while ((now.tv_sec - start.tv_sec) < game_length) {
sleep(1);
gettimeofday(&now, NULL);
}
/* Blow the whistle */
printf("Game Over!\n");
final_ball = the_ball;

So we expect the referee to wakeup once a second, check the time, and then go back to sleep.  This could be done with a single timer rather than multiple wake/sleep cycles, but I think the added scheduling is a good thing for this kind of test.

Gowri, can you please provide a screenshot of what you are seeing, or maybe mention the start-stop time range in the vcd file so we can be sure to be looking at the same thing as you?
(In reply to comment #47)
> Created an attachment (id=46211) [details]
> Startup failure?
>
>  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

In data from the 8-way  JS22, I see values between "000" and "111", so that would be a binary rep of the cpu number.

LL appears to be coming out of this bit of code in the ftrace->vcd converter tool gadget.
if (sched_switch[i].to_prio != program[sched_switch[i].to_pid].prio) {
for (j = 0; j <= nof_bits; j++) {
fprintf (fpo, "%c",
"LH"[(sched_switch[i].to_cpu >> (nof_bits - j)) & 1]);
}
}

A comment up a little ways states:  "L/H binary encoded cpu number with priority inheritance".
(In reply to comment #47)

> last defense thread preempts it.  I'm thinking this is all occuring on CPU 0
> (is that what 00 means in the display?) - but I can't explain what LL means in
> the referee bar...  no explanation yet.

I also see a "HL" for sched_football-5629 at 41490us. (in sched_football_329), and it all of a sudden the notation made sense.
H/L usage indicates there is a prio inheritance condition.  s/H/1/ and s/L/0/ to map back to the processor number.

LL=00, LH=01, HL=10, HH=11.
I have yet to see a failure anywhere other than at the very beginning or the very end (and I think maybe only at the very beginning).  I wonder if it wouldn't be useful to exit the test immediately on failure and collect a vmcore.  Although, just doing this through gdb (userspace) might be adequate.
I discussed this a bit with tglx, and a turns out we have a couple problems.

1) pthread_barriers are (of course) based on futexes.  the hb->lock is PI.  So if a low prio thread is the last one to pthread_barrier_wait() it seems likely that a higher prio one will wake, try to get the futex, hit the hb->lock, boost the low prio task, allow it to run long enough to move the ball, boom, we're dead.

2) pthread_barrier_wait() does a FUTEX_WAKE (all), which translates to a thundering heard, so even if 1) wasn't an issue, I think we would still see this problem.

3) 2.6.29-rt has an optimization to do the FUTEX_WAKE outside of the hb->lock, so it is probably worth seeing if we can reproduce there.  We probably can, but it will be harder to do so :-)

We need to rethink the validity of this test, and give some serious thought on how to start it up.  I'm thinking separate pthread_barriers for defense and offense.  I'll see if I can try that out today.
Tested with instrumented kernel I used last time 2.6.24.7-117.el5rttrace
and 3000 runs succeeded with out any failure in LS21
Just completed 10495 successful runs on an HS21.  Marking bug as fixed by IBM and submitting atomic start patch to LTP.  Gowri, please ack there if you are happy with the fix.

------- Comment From sripathik@in.ibm.com 2009-07-15 01:59 EDT-------
Darren, was this patch accepted into LTP?

------- Comment From dvhltc@us.ibm.com 2009-07-15 12:52 EDT-------
Yes, the cvs version of ltp contains the atomic startup mechanism patch.

------- Comment From sripathik@in.ibm.com 2009-07-16 02:51 EDT-------
Note to RH: As seen in the last few comments, we found that the cause of this problem was in the test case. We have got the fix accepted in LTP. We are closing this bug on our side.
Comment 40 Clark Williams 2009-10-06 17:10:22 EDT
closing as well

Note You need to log in before you can comment on or make changes to this bug.