447871 – prio-wake testcase failures

Bug 447871 - prio-wake testcase failures

Summary: prio-wake testcase failures

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	beta
Hardware:	x86_64
OS:	All
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Red Hat Real Time Maintenance
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-05-22 08:48 UTC by IBM Bug Proxy
Modified:	2012-02-16 22:10 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-01-05 21:12:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fixes to test pass/fail reporting (1.06 KB, text/plain) 2008-05-22 08:48 UTC, IBM Bug Proxy	no flags	Details
make output of prio-wake easily readable (3.45 KB, text/plain) 2008-05-22 08:48 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	45000	0	None	None	None	Never

Description IBM Bug Proxy 2008-05-22 08:48:25 UTC

=Comment: #0=================================================
Ankita Garg <ankigarg.com> - 2008-05-22 01:27 EDT
Problem description:

prio-wake testcase, under testcases/realtime/func/prio-wake/ in ltp, reports 30%
failures when large number of iterations are run. The test checks priority
ordered wakeup with pthread_cond_*.

There were a several issues with the testcase which have now been fixed, patches
yet to be committed into the ltp (will attach to the bugzilla).

Ran the test tying it to cpu 1 using taskset command. With this, I see 0/100
failures. This might point to scheduling. Need to look at this now.


If this is not an installation problem,
       Describe any custom patches installed.

       Provide output from "uname -a", if possible:

Linux elm3c19 2.6.24.7-54ibmrt2.3 #1 SMP PREEMPT RT Wed May 14 15:20:41 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux


Hardware Environment
    Machine type (p650, x235, SF2, etc.):
    Cpu type (Power4, Power5, IA-64, etc.):
    Describe any special hardware you think might be relevant to this problem:


Please provide contact information if the submitter is not the primary contact.


Please provide access information for the machine if it is available.


Is this reproducible? Run about 100 iterations of the test.
    If so, how long does it (did it) take to reproduce it?
    Describe the steps:

    If not, describe how the bug was encountered:


Is the system (not just the application) hung?
    If so, describe how you determined this:


Did the system produce an OOPS message on the console?
    If so, copy it here:


Is the system sitting in a debugger right now?
    If so, how long may it stay there?


Additional information:
=Comment: #3=================================================
Ankita Garg <ankigarg.com> - 2008-05-22 01:33 EDT

make output of prio-wake easily readable

=Comment: #4=================================================
Ankita Garg <ankigarg.com> - 2008-05-22 01:37 EDT

fixes to test pass/fail reporting

=Comment: #5=================================================
Ankita Garg <ankigarg.com> - 2008-05-22 01:37 EDT
Patches in #3 & #4 apply on top of the latest ltp tree.

Comment 1 IBM Bug Proxy 2008-05-22 08:48:29 UTC

Created attachment 306350 [details]
fixes to test pass/fail reporting

Comment 2 IBM Bug Proxy 2008-05-22 08:48:31 UTC

Created attachment 306351 [details]
make output of prio-wake easily readable

Comment 3 IBM Bug Proxy 2008-05-26 06:25:05 UTC

------- Comment From sripathi.com 2008-05-26 02:17 EDT-------
Have the LTP patches been sent to their ML? Have they been accepted?

------- Comment From ankigarg.com 2008-05-26 02:19 EDT-------
(In reply to comment #11)
> Have the LTP patches been sent to their ML? Have they been accepted?

Sending out the patches now...

Comment 4 IBM Bug Proxy 2008-05-27 12:32:48 UTC

------- Comment From ankigarg.com 2008-05-27 08:31 EDT-------
Trying sched_switch tracer now.

Comment 5 IBM Bug Proxy 2008-06-02 18:08:58 UTC

------- Comment From dvhltc.com 2008-06-02 14:07 EDT-------
I'd like to see if the -62 (alpha17) kernel improves this scenario since it
fixes some PI related bugs.

Comment 6 IBM Bug Proxy 2008-06-03 12:00:42 UTC

------- Comment From ankigarg.com 2008-06-03 07:57 EDT-------
See equivalent number of failures even with -62 kernel.

Comment 7 IBM Bug Proxy 2008-06-03 22:40:43 UTC

------- Comment From dvhltc.com 2008-06-03 18:33 EDT-------
After discussing this a bit with various folks, I believe that the test case IS
actually valid, but that we can't expect it to pass until after we resolve some
issues with the current pthread_cond_* implementations (both glibc and kernel).

As it stands, priority inversion is possible with pthread_cond_* as the
condition variables do not use PI mutexes internally, and they do not have an
explicit ownership handoff at signal/broadcast time.  So if a broadcast is sent
and the implementation wakes the highest prio thread first (and the CPU is in
interrupt context) it won't be able to grab the mutex immediately, so as the
implementation signals the next thread (which get's to run immediately on it's
runqueue) it will grab the mutex so that the higher priority thread will end up
blocking once the CPU returns execution to it.  An explicit handoff of ownership
prior to returning control to the threads should eliminate this scenario.

So I think we will need to defer this bug until such time as we get the
long-standing "requeue_pi/condvar" issues sorted out.  That said, before we
defer the bug, I would like to make sure we agree that the test is valid.  I
know Steven Rostedt had concerns that it wasn't valid if multiple CPUs were
involved.  Given my explanation above, are there still concerns over the
validity of the test-case?

Comment 8 IBM Bug Proxy 2008-06-04 07:00:39 UTC

------- Comment From ankigarg.com 2008-06-04 02:59 EDT-------
(In reply to comment #16)
> After discussing this a bit with various folks, I believe that the test case IS
> actually valid, but that we can't expect it to pass until after we resolve some
> issues with the current pthread_cond_* implementations (both glibc and kernel).
>
> As it stands, priority inversion is possible with pthread_cond_* as the
> condition variables do not use PI mutexes internally, and they do not have an
> explicit ownership handoff at signal/broadcast time.  So if a broadcast is sent
> and the implementation wakes the highest prio thread first (and the CPU is in
> interrupt context) it won't be able to grab the mutex immediately, so as the
> implementation signals the next thread (which get's to run immediately on it's
> runqueue) it will grab the mutex so that the higher priority thread will end up
> blocking once the CPU returns execution to it.  An explicit handoff of ownership
> prior to returning control to the threads should eliminate this scenario.
>
> So I think we will need to defer this bug until such time as we get the
> long-standing "requeue_pi/condvar" issues sorted out.  That said, before we
> defer the bug, I would like to make sure we agree that the test is valid.  I
> know Steven Rostedt had concerns that it wasn't valid if multiple CPUs were
> involved.  Given my explanation above, are there still concerns over the
> validity of the test-case?

I too agree that the testcase is valid and in line with the expected behavior.
But with the current implementation of pthread_* and RT scheduler, the testcase
would be expected to fail on SMP. That is the reason why, on binding the
testcase to a single CPU, did not observe failures. We could defer the bug...I
was only wanting to track down the exact reason why the high prio was not being
woken up first...could it be due to anything in the scheduler? besides the point
that the higher prio could be waking up in interrupt context...

Comment 9 IBM Bug Proxy 2008-06-23 19:16:47 UTC

------- Comment From dvhltc.com 2008-06-23 15:11 EDT-------
I agree with the DEFERRED state, but not so much with P5.  I think this is a
real issue we need to address, but we can't until the pi_requque work is sorted
out.  Doesn't that keep it at least at the P3 level ?

Comment 10 IBM Bug Proxy 2008-06-24 05:33:02 UTC

------- Comment From sripathi.com 2008-06-24 01:25 EDT-------
(In reply to comment #23)
> I agree with the DEFERRED state, but not so much with P5.  I think this is a
> real issue we need to address, but we can't until the pi_requque work is sorted
> out.  Doesn't that keep it at least at the P3 level ?

OK.

Comment 11 IBM Bug Proxy 2009-04-10 17:51:07 UTC

------- Comment From dvhltc.com 2009-04-10 13:41 EDT-------
*** Bug 52280 has been marked as a duplicate of this bug. ***

------- Comment From dvhltc.com 2009-04-10 13:44 EDT-------
I've tested this with the kernel fixes for Bug 48484 and a preliminary hacked glibc with over 13k successful runs.  Moving back to Open state and documenting it's dependency on the glibc fix as well.

------- Comment From dvhltc.com 2009-04-10 13:45 EDT-------
This bug will not be fixed in R2-SR1.  We are hoping for MRG 1.2, but for now marking it as upstream.

Comment 12 IBM Bug Proxy 2009-05-07 12:51:58 UTC

------- Comment From dino.com 2009-05-07 08:48 EDT-------
Sent the glibc patches for requeue_pi to Clark on the Rhel-rt-ibm list.

Comment 13 IBM Bug Proxy 2009-05-28 17:00:48 UTC

------- Comment From johnstul.com 2009-05-28 12:59 EDT-------
*** Bug 51506 has been marked as a duplicate of this bug. ***

Comment 14 Clark Williams 2012-01-05 21:12:00 UTC

The glibc guys wont' take the requeue_pi patches and I don't want to deliver a special glibc just for realtime. Closing WONTFIX.

Comment 15 IBM Bug Proxy 2012-02-16 22:10:28 UTC

------- Comment From niv.com 2012-02-16 16:32 EDT-------
Closing bug from our end WILL_NOT_FIX. We should yank the test from LTP or modify the test at least, to indicate it will fail on SMP. Test cleanups will be handled under a separate bug.

Note You need to log in before you can comment on or make changes to this bug.